CN114881888A

CN114881888A - Video Moire removing method based on linear sparse attention transducer

Info

Publication number: CN114881888A
Application number: CN202210649880.XA
Authority: CN
Inventors: 牛玉贞; 林志华; 刘文犀
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-08-09

Abstract

The invention provides a video Moire removing method based on a linear sparse attention transducer, which is used for training a video Moire removing network based on the linear sparse attention transducer so as to remove Moire of an input video after training is finished; the video degranulation network based on the linear sparse attention Transformer comprises: the characteristic extraction module is used for extracting the characteristics of the video frames; the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points; the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames; and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.

Description

Video Moire removing method based on linear sparse attention transducer

Technical Field

The invention belongs to the technical field of video processing and computer vision, and particularly relates to a linear sparse attention Transformer-based video Moire removing method.

Background

With the rapid development of mobile devices and multimedia technologies, smart phones have become indispensable tools in daily life, and the popularity of mobile photography is increasing with the support of the improvement of the quality of photography. Images and videos are an indispensable part of modern human communication and information transmission, and have great significance for the development of various aspects of society. Digital screens are ubiquitous in modern day life, such as television screens at home, computers and large-scale LED screens in public places, and it is common practice to capture these screens with a mobile phone to quickly store information, and sometimes capturing images and videos is the only practical way to store information. However, when taking digital screens, moire patterns often appear and contaminate underlying clean images and video. Moire is caused by mutual interference between a camera Color Filter Array (CFA) and a sub-pixel layout of a screen, resulting in color-distorted stripes in photographed images and videos, seriously degrading the visual quality of the images and videos. The development of computer vision and the upgrading of hardware make it possible to realize the problem, so that many researchers are beginning to put into the research of removing moire patterns from images, but the problem of removing moire patterns from videos is still rarely researched.

Removing moir e is a challenging task because the moir e is irregular in shape and color, and it spans low and high frequencies. Unlike other image and video restoration tasks, such as image or video denoising, image demosaicing and image or video super-resolution, the moir e removal task needs to cope with complex low-frequency and high-frequency moir fringes and also needs to restore details in images and videos, and meanwhile, the moir fringes can also influence the appearance of chromatic aberration of shot images. Since moire formation is closely related to the camera imaging process, especially the frequency of the Color Filter Array (CFA). Thus, many have proposed methods aimed at improving the imaged pipe to eliminate moir e. However, these methods have high computational complexity and are not suitable for practical application. In 2018, Sun et al created a large-scale Morgan removing reference dataset, TIP2018 dataset, containing hundreds of thousands of image pairs, and proposed a novel multi-resolution full convolution network to remove Morgan, which greatly promoted the development of the image Morgan removing task. Compared with image moir e removal, the method for removing the moir e of the video is more difficult, and the method cannot simply remove the moir e frame by frame because artifacts and flicker are introduced into the video, the time coherence between frames cannot be guaranteed, and the performance cannot be satisfactory. Therefore, a new method for solving the Moire removing task of the video is urgently needed.

The method has important practical significance for removing Moire patterns from the video, and for digital videos with large quantity cardinality, people manually remove Moire patterns from the video, so that great labor and time costs are consumed. The video moire removing algorithm just solves the problem, developers only need to use a trained video moire removing network to automatically remove moire in the video, repetitive labor is avoided, and a large amount of time is saved. However, since the task of removing moire patterns from video is rarely studied and the task of removing moire patterns from video cannot be solved by simply using the image moire removing method, the task still remains to be studied.

Disclosure of Invention

In order to make up for the blank and the defects of the prior art, the invention provides a video Moire removing method based on a linear sparse attention transducer, which is based on a designed video Moire removing network based on the linear sparse attention transducer to realize high-quality video Moire removing.

The invention specifically adopts the following technical scheme:

a video Moire removing method based on linear sparse attention transducer is characterized by comprising the following steps: training a video Moire removing network based on a linear sparse attention transducer to remove Moire of an input video after training is finished;

the video degranulation network based on the linear sparse attention Transformer comprises:

the characteristic extraction module is used for extracting the characteristics of the video frames;

the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points;

the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames;

and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.

Furthermore, the input of the feature extraction module is five adjacent video frames in the same moire video, wherein the input video frame is represented by I _t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I _t Inputting the data into the first volume block to obtain a feature map

The size is C × H × W, will

Feeding into a pooling layer and a second rolling block to obtain a feature map

It has a size of

Same, will

Feeding into a pooling layer and a third rolling block to obtain

Will be provided with

Feeding into a pooling layer and a final rolled block

And

respectively has the size of

And

each convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence; the two active layers adopt ReLu active functions, the two convolutional layers adopt convolution with convolution kernel of 3 x 3, the first convolutional layer realizes the change of the number of channels, and the second convolutional layer maintains the number of channels unchanged.

Further, the spatial Transformer module consists of nine linear sparse attention degranulation layers and one absolute position code;

wherein the input of the first layer is a feature map of a feature extraction module

The input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer ^t Calculating the spatial attention of the feature map in linear time complexity of each layer for the final output of the spatial transform module;

the absolute position code is

Learnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;

the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the two random inactivation layers is set to be 0.1, and the two normalization layers adopt layer normalization; the multilayer perceptron is composed of a first full connection layer, an activation layer and a second full connection layer in sequence, wherein the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;

the spatial self-attention layer consists of four learnable matrixes, namely a Query weight matrix W _q Key weight matrix W _k Value weight matrix W _v And bottleneck matrix W _p (ii) a The calculation formula for this layer is as follows:

Q＝Dot(W _q ,F _in )

K＝Dot(W _k ,F _in )

V＝Dot(W _v ,F _in )

H＝Dot(Softmax(Q),Dot(Softmax(K ^T ),V))

F _out ＝Dot(W _p ,H)

wherein, F _in For spatial self-attention layer input, F _out For spatial self-attention layer output, Q, K and V are Query matrix, Key matrix and Value matrix, K ^T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W _v Are sparse matrices under the constraint of the L2 loss function.

Further, the input of the time Transformer module is the final output F of the space Transformer module ^t (ii) a The system consists of four time attention degamma layers, an absolute position code and an absolute time code;

the absolute position code shares parameters with the absolute position code of the space Transformer module;

the absolute time coding is a learnable matrix with the scale of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training;

the first temporal attention democratic layer input is F for five video frames ^t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t frame

The final output of the time Transformer module to the t frame;

the time attention democratic layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence;

the neuron inactivation probability of all three random inactivation layers is set to be 0.1, all three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function;

the structure of the spatial self-attention layer is the same as that of the spatial self-attention layer in the linear sparse attention democration layer;

before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;

the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' _q Key weight matrix W' _k Value weight matrix W' _v And bottleneck matrix W' _p (ii) a The calculation formula for this layer is as follows:

K _a ＝[K ¹ ,K ² ,…,K ⁵ ]

V _a ＝[V ¹ ,V ² ,…,V ⁵ ]

H ^t (i,j)＝Dot(Softmax(Dot(Q ^t (i,j),(K _a (i,j)) ^T )),V _a (i,j))

F _out ＝Dot(W′ _p ,H)

where t denotes the tth frame, t ∈ [1,5 ]]，

For input features belonging to the t-th frame in the temporal self-attention layer, F _out For output of time from attention horizon, Q ^t 、K ^t And V ^t Respectively a Query matrix, a Key matrix and a Value matrix belonging to the t-th frame, Softmax (C)) Denotes the softmax calculation for the last dimension of the matrix, Dot () denotes the matrix multiplication calculation, superscript T denotes the matrix transpose, b]Representing a composition matrix operation, K _a And V _a The Key matrix and the Value matrix of five frames are respectively used, H is a complete attention feature map of a time self-attention layer, and (i, j) represents the position of a feature, namely, the feature map is divided into a plurality of 2 x 2 non-overlapping small squares, and (i, j) represents the position of the small square of the feature; h ^t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K _a (i, j) is represented by K _a Local Key matrix, V, for the middle (i, j) position _a (i, j) is represented at V _a Local Value matrix, Q, for the (i, j) position ^t (i, j) is represented by Q ^t Local Query matrix, W 'of the (i, j) position' _v Is a sparse matrix under the constraint of the L2 loss function.

Further, the input of the image reconstruction module is the final output of the temporal Transformer module on the third frame

Having a dimension of

Consists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided with

Input to the first upsampling block to obtain a feature map

It has a size of

Will be provided with

And feature map of feature extraction module

Splicing according to channels and inputting the spliced signals into a first volume block to obtain a feature map

Will be provided with

Inputting the data into a second up-sampling block to obtain a characteristic diagram

It has a size of

Will be provided with

And feature map of feature extraction module

Splicing according to the channels and inputting the spliced signals into a second rolling block to obtain a characteristic diagram

Will be provided with

Inputting the data into a third up-sampling block to obtain a characteristic diagram

The size is C × H × W, will

And feature map of feature extraction module

Splicing according to channels and inputting the spliced signals into a third volume block to obtain a feature map

Will be provided with

Inputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame image

I.e. corresponding to Moire video frame I ₃ The democratic frame of (1);

the up-sampling block is composed of an up-sampling layer, a convolution layer and an activation layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, and the convolution layer realizes the change of the number of characteristic image channels, namely, the number of the channels is reduced to half of the original number, and the activation layer adopts a ReLu activation function; the convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer adopts a ReLu active function.

Further, the loss function used to train the video deglitch network is constructed as follows:

the overall optimization objective of the network is as follows:

min(L)，

where min (L) represents the minimum L, L represents the total loss of democratic network,

indicating Charbonnier loss of the degranulated image versus the clean image,

representing the loss of edge texture in the deglitched image versus the clean image,

representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;

charbonier loss of the Moire-removed image and the clean image

The calculation formula of (a) is as follows:

wherein,

representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O ₃ Representing the clean image with which it is paired, e represents a constant of control precision, λ _C A weight representing the loss;

loss of edge texture of the degritted image and the clean image

The calculation formula of (a) is as follows:

wherein | | | purple hair ₁ Is an absolute value taking operation, Sobel ^* Improved Sobel filters, Sobel, showing different orientations ^* () Denotes the convolution operation, λ _ASL A weight representing the loss;

color loss of the degritted image and the clean image

The calculation formula of (a) is as follows:

wherein G represents a Gaussian blur kernel,

representing a blurred, degritted image, G (O) ₃ ) A blurred clean image is represented which is,

is a squaring operation taking the two norms, λ _cr A weight representing the loss;

loss of sparse matrices in the constrained space transform module and the temporal transform module

The calculation formula of (a) is as follows:

wherein Q is ^* And K ^* Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,

represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,

represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,

is a squaring operation of the absolute value, λ _sparse Representing the weight of the loss.

Further, training a linear sparse attention Transformer-based video moir e removing network, processing videos in an original data set by adopting a training set to obtain a moir e video and clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain the Moire image removing network specifically comprises the following steps:

step S11, obtaining an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another;

step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;

step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame image

The formula of (1) is as follows:

where (i, j) represents the position of the pixel.

Further, the training process specifically comprises the following steps:

step S41, a pair of Moire pattern video and clean video is randomly selected from the training set, next, five adjacent video frames are randomly selected from the Moire pattern video, and five corresponding video frames in the corresponding clean video are simultaneously selected, wherein the Moire pattern video frame and the clean video frame are respectively marked as I _t And O _r ，t∈[1,5]；

Step S42, training a first stage of a video Moire removing network based on a linear sparse attention Transformer: inputting five video frames I _t Obtaining the Moire-removed intermediate frame by calculation of the network

Calculating the total loss of Moire removalAnd (4) losing L, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 ^-4 Down to 10 ^-5 ；

Step S43, training a second stage of the video Moire removing network based on the linear sparse attention Transformer: the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted _cr Set to 0 with a learning rate of 10 ^-5 Down to 10 ^-6 And performing fine tuning training of the network.

Further, the specific operation of removing moire from the input video after the training is completed is as follows: for a new Moire pattern video, inserting two blank frames at the beginning and the end of the video respectively, starting to take the first five frames of the video and inputting the first five frames into a network for calculation to obtain a Moire pattern removed video frame corresponding to the first frame of the original Moire pattern video, then taking the second frame to the sixth frame of the video and inputting the second frame to the network to obtain a Moire pattern removed video frame corresponding to the second frame of the original Moire pattern video, and subsequently adopting the same operation until the Moire pattern removed video frame corresponding to the Moire pattern removed video frame is obtained.

An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the linear sparse attention Transformer-based video degranulation method as described above when executing the program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a linear sparse attention Transformer based video degranulation method as described above.

Compared with the prior art, the invention and the optimized scheme thereof have the following beneficial effects:

the spatial attention and the temporal attention are used, the spatial information and the temporal information between different frames are effectively utilized to remove Moire and supplement details, and artifacts and flicker are prevented from appearing in the video. The method uses a linear attention computing mode, can effectively reduce the square time complexity of the original Transformer attention computing mode into the linear time complexity, greatly reduces the computing amount of the network, and improves the practical application effect of the network. Meanwhile, a loss function constraint calculation matrix is adopted as a sparse matrix in the linear attention calculation process, so that a more effective and stable moire removing effect is realized. The method can remove the moire in the video under low calculation complexity, generate the high-quality clean video without moire, and improve the visual effect and the performance index of the generated video without moire. Therefore, the invention has strong practicability and wide application prospect.

Drawings

Fig. 1 is a schematic flow chart of implementation of the embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a video demarked network according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a spatial Transformer module according to an embodiment of the present invention.

FIG. 4 is a schematic structural diagram of a time Transformer module according to an embodiment of the present invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, this embodiment further details the scheme of the present invention in a form of steps according to a specific operation example for implementing a linear sparse attention Transformer-based video degranulation method.

The method can be specifically summarized as the following steps:

step S1, processing videos in the original data set to obtain a Moire pattern video and a clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain a training set; step S1 specifically includes the following steps:

s11, acquiring an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; and forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another.

The formula of (1) is as follows:

where (i, j) represents the position of the pixel.

And step S2, constructing a video Moire removing network based on the linear sparse attention Transformer. As shown in fig. 2, the video moir e removal network is composed of four parts, namely, a feature extraction module, a spatial Transformer module, a temporal Transformer module and an image reconstruction module;

specifically, step S2 includes the following steps:

step S21, constructing a feature extraction module to extract features of the video frame by using the feature extraction module to prepare for subsequent moir e removal;

as shown in FIG. 2, the input of the feature extraction module is five adjacent video frames in the same Moire pattern video, wherein the input video frame is I _t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I _t Inputting the data into the first volume block to obtain a feature map

The size is C × H × W, will

Feeding into a pooling layer and a second rolling block to obtain a characteristic diagram

It has a size of

Same, will

Feeding into a pooling layer and a third rolling block to obtain

Will be provided with

Feeding into a pooling layer and a final rolled block

And

respectively of size

And

each convolution block consists of a convolution layer, an activation layer, a convolution layer and an activation layer in sequence; the ReLu activation function is adopted by the two activation layers, the convolution with convolution kernel of 3 x 3 is adopted by the two convolution layers, the first convolution layer realizes the change of the channel number, for example, the channel number of the first convolution block is changed from 3 to C, the channel number of the other convolution blocks is doubled, and the channel number of the second convolution layer is kept unchanged. Step S22, constructing a spatial Transformer module to capture the positions of the moire in the single frame image by using the spatial attention of the spatial Transformer and perform key point removal, so as to achieve a better moire removing effect.

As shown in FIG. 3, the spatial Transformer module consists of nine linear sparse attention democratic layers and one absolute position code, where the input to the first layer is the feature map of the feature extraction module

The input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer ^t For the final output of the spatial transform module, each layer calculates the spatial attention of the feature map within linear time complexity, helps the network to find out and remove the region with serious Moire, and the absolute position code is the sum

the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multi-layer perceptron, a random inactivation layer and a normalization layer in sequence; the two random inactivation layers set the neuron inactivation probability to be 0.1, the two normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;

the spatial self-attention layer is composed of four learnable matrixes, namely a Query weight matrix W _q Key weight matrix W _k Value weight matrix W _v And bottleneck matrix W _p (ii) a The calculation formula for this layer is as follows:

Q＝Dot(W _q ,F _in )

K＝Dot(W _k ,F _in )

V＝Dot(W _v ,F _in )

H＝Dot(Softmax(Q),Dot(Softmax(K ^T ),V))

F _out ＝Dot(W _p ,H)

wherein, F _in For input from the spatial attention layer, F _out For the output of this layer, Q, K and V are the Query matrix, Key matrix and Value matrix, respectively, K ^T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W _v Are sparse matrices under the constraint of the L2 loss function.

Step S23, constructing a time Transformer module to capture complementary information existing between multiple frames of images by using the time attention of the time Transformer, and performing image restoration by using the complementary information of adjacent frames to further improve the moire removing effect.

As shown in FIG. 4, the input of the temporal Transformer module is the final output F of the spatial Transformer module ^t (ii) a The module consists of four time attention degaussing layers, oneThe absolute position code and the absolute time code share parameters, the absolute time code is a learnable matrix with the dimension of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training; the first temporal attention democratic layer input is F for five video frames ^t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t-th frame

The final output of the time Transformer module to the t frame; the temporal attention degranulation layer and the linear sparse attention degranulation layer are similar in structure, and the difference is mainly that the temporal attention degranulation layer is provided with a temporal self-attention layer which can be used for capturing complementary information between adjacent video frames, and the removal of the degranulation is facilitated through the temporally complementary information;

the time attention democration layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multi-layer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the three random inactivation layers is set to be 0.1, the three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, the activation layer adopts a ReLu activation function, and the structure of a spatial self-attention layer is the same as that of a spatial self-attention layer in a linear sparse attention degranulation layer; before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are also required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;

K _a ＝[K ¹ ,K ² ,…,K ⁵ ]

V _a ＝[V ¹ ,V ² ,…,V ⁵ ]

H ^t (i,j)＝Dot(Softmax(Dot(Q ^t (i,j),(K _a (i,j)) ^T )),V _a (i,j))

F _out ＝Dot(W′ _p ,H)

where t denotes the tth frame, t ∈ [1,5 ]]，

For input features belonging to the t-th frame in the temporal self-attention layer, F _out As output of this layer, Q ^t 、K ^t And V ^t Respectively, a Query matrix, a Key matrix and a Value matrix belonging to the T-th frame, Softmax () represents the Softmax calculation of the last dimension of the matrix, Dot () represents the matrix multiplication calculation, superscript T represents the matrix transposition, and]representing a composition matrix operation, K _a And V _a Key matrix and Value matrix of five frames respectively,h is an attention feature map of the complete time self-attention layer, (i, j) represents the position of the feature, specifically, the feature map is divided into a plurality of 2 × 2 non-overlapping small squares, and (i, j) represents the position of the small square where the feature is located; h ^t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K _a (i, j) is represented by K _a Local Key matrix, V, for the middle (i, j) position _a (i, j) is represented at V _a Local Value matrix, Q, for the (i, j) position ^t (i, j) is represented by Q ^t Local Query matrix, W 'for the (i, j) location' _v Is a sparse matrix under the constraint of the L2 loss function.

And step S24, constructing an image reconstruction module, and decoding the video frame characteristics passing through the space Transformer module and the time Transformer module by using the image reconstruction module to recover the video frame characteristics into a Moire removing video frame with the same scale as the input video.

As shown in FIG. 2, the input to the image reconstruction module is the final output of the temporal Transformer module for the intermediate frame (third frame)

Having a dimension of

The module consists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided with

Input to the first upsampling block to obtain a feature map

It has a size of

Will be provided with

And feature map of feature extraction module

Will be provided with

It has a size of

Will be provided with

And feature map of feature extraction module

Will be provided with

The size is C × H × W, will

And feature map of feature extraction module

Splicing according to channels and inputting the spliced signals into a third rolling block to obtainCharacteristic diagram

Will be provided with

I.e. corresponding to Moire video frame I ₃ Removing moire pattern frames;

the up-sampling block is composed of an up-sampling layer, a convolution layer and an active layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, meanwhile, the convolution layer realizes the change of the number of the channels of the characteristic diagram, namely, the number of the channels is reduced to half of the number of the original channels, and the active layer adopts a ReLu active function; the convolution block consists of convolution layers, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer also adopts a ReLu active function;

and step S3, constructing a loss function for training the video degranulation network.

Step S3 specifically includes the following steps:

step S31, constructing a total optimization target of the whole network; the optimization objectives are as follows:

min(L)，

charbonnier loss representing the democratic image versus the clean image,

step S32, constructing Charbonier loss of the Moire pattern removed image and the clean image;

the calculation formula of (a) is as follows:

wherein,

step S33, constructing the edge texture loss of the Moire pattern removed image and the clean image;

the calculation formula of (a) is as follows:

step S34, constructing color loss of the Moire pattern removed image and the clean image;

the calculation formula of (a) is as follows:

wherein G represents a Gaussian blur kernel,

representing a blurred, degritted image, G (O) ₃ ) A clear image that is blurred is represented,

step S35, constructing loss of sparse matrixes in a constrained space Transformer module and a time Transformer module;

the calculation formula of (a) is as follows:

wherein Q ^* And K ^* Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,

is a squaring operation of the absolute value, λ _sparse A weight representing the loss;

and step S4, training the video democratic texture network by adopting the training data set.

Step S4 specifically includes the following steps:

step S41, randomly selecting a pair of Moire pattern video and clean video from the training data set, then randomly selecting five adjacent video frames from the Moire pattern video, and simultaneously selecting five corresponding video frames from the corresponding clean video, wherein the Moire pattern video frame and the clean video frame are respectively marked as I _t And O _t ，t∈[1,5]；

Step S42, training a first stage of a video Moire pattern removing network based on a linear sparse attention transducer; inputting five video frames I _t Obtaining the Moire-removed intermediate frame by calculation of the network

Calculating the total loss L of the Moire pattern removal, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 ^-4 Slowly decreases to 10 ^-5 ；

Step S43, training a second stage of the video Moire pattern removing network based on the linear sparse attention Transformer; the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted _cr Set to 0 with a learning rate of 10 ^-5 Slowly decreases to 10 ^-6 Carrying out fine tuning training of the network;

in this embodiment, the whole training process is iterated four million times, and in each iteration process, a plurality of video pairs are randomly sampled and trained as one batch, the first two million and fifty million times are subjected to the first stage training of step S42, and the remaining one hundred and fifty million times are iterated and subjected to the second stage training of step S43;

and step S5, inputting the new Moire pattern video into the trained video Moire pattern removing network, and outputting the clean video without Moire patterns. Specifically, for a new moire video, two blank frames are respectively inserted at the beginning and the end of the video, the first five frames of the video are firstly taken and input into the network for calculation, a moire-removing video frame corresponding to the first frame of the original moire video is obtained, then the second frame to the sixth frame of the video are taken and input into the network, a moire-removing video frame corresponding to the second frame of the original moire video is obtained, and the same operation is subsequently adopted until the moire-removing video frame corresponding to the moire-removing video frame is obtained.

The embodiment also provides a linear sparse attention Transformer-based video degranulation system, which comprises a memory, a processor and computer program instructions stored in the memory and capable of being executed by the processor, wherein when the computer program instructions are executed by the processor, the above-mentioned method steps can be implemented.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow of the flowcharts, and combinations of flows in the flowcharts, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

The present invention is not limited to the above preferred embodiments, and other various forms of linear sparse attention transducer based video moir e removal methods can be derived by anyone in light of the present invention.

Claims

1. A video Moire removing method based on linear sparse attention transducer is characterized by comprising the following steps: training a video Moire removing network based on a linear sparse attention transducer to remove Moire of an input video after training is finished;

the characteristic extraction module is used for extracting the characteristics of the video frame;

2. The linear sparse attention fransformer-based video degranulation method of claim 1, wherein: the input of the feature extraction module is five adjacent video frames in the same Moire pattern video, wherein the input video frame is I _t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I _t Inputting the data into the first volume block to obtain a feature map

The size is C × H × W, will

Feeding into a pooling layer and a second rolling block to obtain a feature map

It has a size of

Same, will

Feeding into a pooling layer and a third rolling block to obtain

Will be provided with

Feeding into the pooling layer and the last rollBlock formation

And

respectively of size

And

3. The linear sparse attention fransformer-based video democration method of claim 2, wherein: the space Transformer module consists of nine linear sparse attention democration layers and an absolute position code;

the absolute position code is

Q＝Dot(W _q ,F _in )

K＝Dot(W _k ,F _in )

V＝Dot(W _v ,F _in )

H＝Dot(Softmax(Q),Dot(Softmax(K ^T ),V))

F _out ＝Dot(W _p ,H)

4. The linear sparse attention Transformer-based video degranulation method according to claim 3, wherein:

the input of the time Transformer module is the final output F of the space Transformer module ^t (ii) a The system consists of four time attention degamma layers, an absolute position code and an absolute time code;

the first temporal attention democratic layer input is F for five video frames ^t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t-th frame

The final output of the time Transformer module to the t frame;

K _a ＝[K ¹ ,K ² ,…,K ⁵ ]

V _a ＝[V ¹ ,V ² ,…,V ⁵ ]

H ^t (i,j)＝Dot(Softmax(Dot(Q ^t (i,j),(K _a (i,j)) ^T )),V _a (i,j))

F _out ＝Dot(W′ _p ,H)

where t denotes the tth frame, t ∈ [1,5 ]]，

For input features belonging to the t-th frame in the temporal self-attention layer, F _out Is time fromOutput of the attention layer, Q ^t 、K ^t And V ^t Respectively, a Query matrix, a Key matrix and a Value matrix belonging to the T-th frame, Softmax () represents the Softmax calculation of the last dimension of the matrix, Dot () represents the matrix multiplication calculation, superscript T represents the matrix transposition, and]representing a composition matrix operation, K _a And V _a The Key matrix and the Value matrix of five frames are respectively used, H is a complete attention feature map of a time self-attention layer, and (i, j) represents the position of a feature, namely, the feature map is divided into a plurality of 2 x 2 non-overlapping small squares, and (i, j) represents the position of the small square of the feature; h ^t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K _a (i, j) is represented by K _a Local Key matrix, V, for the middle (i, j) position _a (i, j) is represented at V _a Local Value matrix, Q, for the (i, j) position ^t (i, j) is represented by Q ^t Local Query matrix, W 'for the (i, j) location' _v Is a sparse matrix under the constraint of the L2 loss function.

5. The linear sparse attention transducer-based video degranulation method of claim 4, wherein:

the input of the image reconstruction module is the final output of the time Transformer module to the third frame

Having a dimension of

Input to the first upsampling block to obtain a feature map

It has a size of

Will be provided with

And feature map of feature extraction module

Will be provided with

It has a size of

Will be provided with

And feature map of feature extraction module

Splicing according to channels and inputting the spliced result into a second volume block to obtain a feature map

Will be provided with

The size of which is CxHxW, is

And feature map of feature extraction module

Will be provided with

I.e. corresponding to Moire video frame I ₃ Removing moire pattern frames;

6. The linear sparse attention transducer-based video degranulation method of claim 5, wherein: the loss function used to train the video degamma network is constructed as follows:

the overall optimization objective of the network is as follows:

min(L)，

indicating Charbonnier loss of the degranulated image versus the clean image,

charbonier loss of the Moire-removed image and the clean image

The calculation formula of (c) is as follows:

wherein,

loss of edge texture of the degritted image and the clean image

The calculation formula of (a) is as follows:

color loss of the degritted image and the clean image

The calculation formula of (a) is as follows:

wherein G represents a Gaussian blur kernel,

The calculation formula of (a) is as follows:

7. The linear sparse attention transducer-based video degranulation method of claim 6, wherein: training a video Moire pattern removing network based on a linear sparse attention transducer, processing videos in an original data set by adopting a training set to obtain a Moire pattern video and a clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain the Moire pattern video and the clean video, wherein the method specifically comprises the following steps:

The formula of (1) is as follows:

where (i, j) represents the position of the pixel.

8. The linear sparse attention fransformer-based video degranulation method of claim 7, wherein: the training process specifically comprises the following steps:

step S41, randomly selecting a pair of Moire pattern video and clean video from the training set, then randomly selecting five adjacent video frames from the Moire pattern video, and simultaneously selecting five corresponding video frames from the corresponding clean video, wherein the Moire pattern video frame and the clean video frame are respectively marked as I _t And O _t ，t∈[1,5]；

Calculating the total loss L of the Moire pattern removal, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 ^-4 Down to 10 ^-5 ；

9. The linear sparse attention fransformer-based video degranulation method of claim 8, wherein: the specific operation of removing the moire from the input video after the training is completed is as follows: for a new Moire pattern video, inserting two blank frames at the beginning and the end of the video respectively, starting to take the first five frames of the video and inputting the first five frames into a network for calculation to obtain a Moire pattern removed video frame corresponding to the first frame of the original Moire pattern video, then taking the second frame to the sixth frame of the video and inputting the second frame to the network to obtain a Moire pattern removed video frame corresponding to the second frame of the original Moire pattern video, and subsequently adopting the same operation until the Moire pattern removed video frame corresponding to the Moire pattern removed video frame is obtained.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the linear sparse attention Transformer based video degranulation method as recited in any one of claims 1-9 when executing the program.