CN114881888A - Video Moire removing method based on linear sparse attention transducer - Google Patents

Video Moire removing method based on linear sparse attention transducer Download PDF

Info

Publication number
CN114881888A
CN114881888A CN202210649880.XA CN202210649880A CN114881888A CN 114881888 A CN114881888 A CN 114881888A CN 202210649880 A CN202210649880 A CN 202210649880A CN 114881888 A CN114881888 A CN 114881888A
Authority
CN
China
Prior art keywords
layer
video
attention
matrix
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210649880.XA
Other languages
Chinese (zh)
Inventor
牛玉贞
林志华
刘文犀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202210649880.XA priority Critical patent/CN114881888A/en
Publication of CN114881888A publication Critical patent/CN114881888A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a video Moire removing method based on a linear sparse attention transducer, which is used for training a video Moire removing network based on the linear sparse attention transducer so as to remove Moire of an input video after training is finished; the video degranulation network based on the linear sparse attention Transformer comprises: the characteristic extraction module is used for extracting the characteristics of the video frames; the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points; the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames; and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.

Description

Video Moire removing method based on linear sparse attention transducer
Technical Field
The invention belongs to the technical field of video processing and computer vision, and particularly relates to a linear sparse attention Transformer-based video Moire removing method.
Background
With the rapid development of mobile devices and multimedia technologies, smart phones have become indispensable tools in daily life, and the popularity of mobile photography is increasing with the support of the improvement of the quality of photography. Images and videos are an indispensable part of modern human communication and information transmission, and have great significance for the development of various aspects of society. Digital screens are ubiquitous in modern day life, such as television screens at home, computers and large-scale LED screens in public places, and it is common practice to capture these screens with a mobile phone to quickly store information, and sometimes capturing images and videos is the only practical way to store information. However, when taking digital screens, moire patterns often appear and contaminate underlying clean images and video. Moire is caused by mutual interference between a camera Color Filter Array (CFA) and a sub-pixel layout of a screen, resulting in color-distorted stripes in photographed images and videos, seriously degrading the visual quality of the images and videos. The development of computer vision and the upgrading of hardware make it possible to realize the problem, so that many researchers are beginning to put into the research of removing moire patterns from images, but the problem of removing moire patterns from videos is still rarely researched.
Removing moir e is a challenging task because the moir e is irregular in shape and color, and it spans low and high frequencies. Unlike other image and video restoration tasks, such as image or video denoising, image demosaicing and image or video super-resolution, the moir e removal task needs to cope with complex low-frequency and high-frequency moir fringes and also needs to restore details in images and videos, and meanwhile, the moir fringes can also influence the appearance of chromatic aberration of shot images. Since moire formation is closely related to the camera imaging process, especially the frequency of the Color Filter Array (CFA). Thus, many have proposed methods aimed at improving the imaged pipe to eliminate moir e. However, these methods have high computational complexity and are not suitable for practical application. In 2018, Sun et al created a large-scale Morgan removing reference dataset, TIP2018 dataset, containing hundreds of thousands of image pairs, and proposed a novel multi-resolution full convolution network to remove Morgan, which greatly promoted the development of the image Morgan removing task. Compared with image moir e removal, the method for removing the moir e of the video is more difficult, and the method cannot simply remove the moir e frame by frame because artifacts and flicker are introduced into the video, the time coherence between frames cannot be guaranteed, and the performance cannot be satisfactory. Therefore, a new method for solving the Moire removing task of the video is urgently needed.
The method has important practical significance for removing Moire patterns from the video, and for digital videos with large quantity cardinality, people manually remove Moire patterns from the video, so that great labor and time costs are consumed. The video moire removing algorithm just solves the problem, developers only need to use a trained video moire removing network to automatically remove moire in the video, repetitive labor is avoided, and a large amount of time is saved. However, since the task of removing moire patterns from video is rarely studied and the task of removing moire patterns from video cannot be solved by simply using the image moire removing method, the task still remains to be studied.
Disclosure of Invention
In order to make up for the blank and the defects of the prior art, the invention provides a video Moire removing method based on a linear sparse attention transducer, which is based on a designed video Moire removing network based on the linear sparse attention transducer to realize high-quality video Moire removing.
The invention specifically adopts the following technical scheme:
a video Moire removing method based on linear sparse attention transducer is characterized by comprising the following steps: training a video Moire removing network based on a linear sparse attention transducer to remove Moire of an input video after training is finished;
the video degranulation network based on the linear sparse attention Transformer comprises:
the characteristic extraction module is used for extracting the characteristics of the video frames;
the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points;
the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames;
and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.
Furthermore, the input of the feature extraction module is five adjacent video frames in the same moire video, wherein the input video frame is represented by I t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I t Inputting the data into the first volume block to obtain a feature map
Figure BDA0003687299740000021
The size is C × H × W, will
Figure BDA0003687299740000022
Feeding into a pooling layer and a second rolling block to obtain a feature map
Figure BDA0003687299740000023
It has a size of
Figure BDA0003687299740000024
Same, will
Figure BDA0003687299740000025
Feeding into a pooling layer and a third rolling block to obtain
Figure BDA0003687299740000026
Will be provided with
Figure BDA0003687299740000027
Feeding into a pooling layer and a final rolled block
Figure BDA0003687299740000028
And
Figure BDA0003687299740000029
respectively has the size of
Figure BDA00036872997400000210
And
Figure BDA00036872997400000211
each convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence; the two active layers adopt ReLu active functions, the two convolutional layers adopt convolution with convolution kernel of 3 x 3, the first convolutional layer realizes the change of the number of channels, and the second convolutional layer maintains the number of channels unchanged.
Further, the spatial Transformer module consists of nine linear sparse attention degranulation layers and one absolute position code;
wherein the input of the first layer is a feature map of a feature extraction module
Figure BDA0003687299740000031
The input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer t Calculating the spatial attention of the feature map in linear time complexity of each layer for the final output of the spatial transform module;
the absolute position code is
Figure BDA0003687299740000032
Learnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;
the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the two random inactivation layers is set to be 0.1, and the two normalization layers adopt layer normalization; the multilayer perceptron is composed of a first full connection layer, an activation layer and a second full connection layer in sequence, wherein the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;
the spatial self-attention layer consists of four learnable matrixes, namely a Query weight matrix W q Key weight matrix W k Value weight matrix W v And bottleneck matrix W p (ii) a The calculation formula for this layer is as follows:
Q=Dot(W q ,F in )
K=Dot(W k ,F in )
V=Dot(W v ,F in )
H=Dot(Softmax(Q),Dot(Softmax(K T ),V))
F out =Dot(W p ,H)
wherein, F in For spatial self-attention layer input, F out For spatial self-attention layer output, Q, K and V are Query matrix, Key matrix and Value matrix, K T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W v Are sparse matrices under the constraint of the L2 loss function.
Further, the input of the time Transformer module is the final output F of the space Transformer module t (ii) a The system consists of four time attention degamma layers, an absolute position code and an absolute time code;
the absolute position code shares parameters with the absolute position code of the space Transformer module;
the absolute time coding is a learnable matrix with the scale of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training;
the first temporal attention democratic layer input is F for five video frames t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t frame
Figure BDA0003687299740000041
The final output of the time Transformer module to the t frame;
the time attention democratic layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence;
the neuron inactivation probability of all three random inactivation layers is set to be 0.1, all three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function;
the structure of the spatial self-attention layer is the same as that of the spatial self-attention layer in the linear sparse attention democration layer;
before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;
the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' q Key weight matrix W' k Value weight matrix W' v And bottleneck matrix W' p (ii) a The calculation formula for this layer is as follows:
Figure BDA0003687299740000042
Figure BDA0003687299740000043
Figure BDA0003687299740000044
K a =[K 1 ,K 2 ,…,K 5 ]
V a =[V 1 ,V 2 ,…,V 5 ]
H t (i,j)=Dot(Softmax(Dot(Q t (i,j),(K a (i,j)) T )),V a (i,j))
F out =Dot(W′ p ,H)
where t denotes the tth frame, t ∈ [1,5 ]],
Figure BDA0003687299740000051
For input features belonging to the t-th frame in the temporal self-attention layer, F out For output of time from attention horizon, Q t 、K t And V t Respectively a Query matrix, a Key matrix and a Value matrix belonging to the t-th frame, Softmax (C)) Denotes the softmax calculation for the last dimension of the matrix, Dot () denotes the matrix multiplication calculation, superscript T denotes the matrix transpose, b]Representing a composition matrix operation, K a And V a The Key matrix and the Value matrix of five frames are respectively used, H is a complete attention feature map of a time self-attention layer, and (i, j) represents the position of a feature, namely, the feature map is divided into a plurality of 2 x 2 non-overlapping small squares, and (i, j) represents the position of the small square of the feature; h t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K a (i, j) is represented by K a Local Key matrix, V, for the middle (i, j) position a (i, j) is represented at V a Local Value matrix, Q, for the (i, j) position t (i, j) is represented by Q t Local Query matrix, W 'of the (i, j) position' v Is a sparse matrix under the constraint of the L2 loss function.
Further, the input of the image reconstruction module is the final output of the temporal Transformer module on the third frame
Figure BDA0003687299740000052
Having a dimension of
Figure BDA0003687299740000053
Consists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided with
Figure BDA0003687299740000054
Input to the first upsampling block to obtain a feature map
Figure BDA0003687299740000055
It has a size of
Figure BDA0003687299740000056
Will be provided with
Figure BDA0003687299740000057
And feature map of feature extraction module
Figure BDA0003687299740000058
Splicing according to channels and inputting the spliced signals into a first volume block to obtain a feature map
Figure BDA0003687299740000059
Will be provided with
Figure BDA00036872997400000510
Inputting the data into a second up-sampling block to obtain a characteristic diagram
Figure BDA00036872997400000511
It has a size of
Figure BDA00036872997400000512
Will be provided with
Figure BDA00036872997400000513
And feature map of feature extraction module
Figure BDA00036872997400000514
Splicing according to the channels and inputting the spliced signals into a second rolling block to obtain a characteristic diagram
Figure BDA00036872997400000515
Will be provided with
Figure BDA00036872997400000516
Inputting the data into a third up-sampling block to obtain a characteristic diagram
Figure BDA00036872997400000517
The size is C × H × W, will
Figure BDA00036872997400000518
And feature map of feature extraction module
Figure BDA00036872997400000519
Splicing according to channels and inputting the spliced signals into a third volume block to obtain a feature map
Figure BDA00036872997400000520
Will be provided with
Figure BDA00036872997400000521
Inputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame image
Figure BDA00036872997400000522
I.e. corresponding to Moire video frame I 3 The democratic frame of (1);
the up-sampling block is composed of an up-sampling layer, a convolution layer and an activation layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, and the convolution layer realizes the change of the number of characteristic image channels, namely, the number of the channels is reduced to half of the original number, and the activation layer adopts a ReLu activation function; the convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer adopts a ReLu active function.
Further, the loss function used to train the video deglitch network is constructed as follows:
the overall optimization objective of the network is as follows:
min(L),
Figure BDA0003687299740000061
where min (L) represents the minimum L, L represents the total loss of democratic network,
Figure BDA0003687299740000062
indicating Charbonnier loss of the degranulated image versus the clean image,
Figure BDA0003687299740000063
representing the loss of edge texture in the deglitched image versus the clean image,
Figure BDA0003687299740000064
representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;
charbonier loss of the Moire-removed image and the clean image
Figure BDA0003687299740000065
The calculation formula of (a) is as follows:
Figure BDA0003687299740000066
wherein,
Figure BDA0003687299740000067
representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O 3 Representing the clean image with which it is paired, e represents a constant of control precision, λ C A weight representing the loss;
loss of edge texture of the degritted image and the clean image
Figure BDA0003687299740000068
The calculation formula of (a) is as follows:
Figure BDA0003687299740000069
wherein | | | purple hair 1 Is an absolute value taking operation, Sobel * Improved Sobel filters, Sobel, showing different orientations * () Denotes the convolution operation, λ ASL A weight representing the loss;
color loss of the degritted image and the clean image
Figure BDA00036872997400000610
The calculation formula of (a) is as follows:
Figure BDA00036872997400000611
wherein G represents a Gaussian blur kernel,
Figure BDA00036872997400000612
representing a blurred, degritted image, G (O) 3 ) A blurred clean image is represented which is,
Figure BDA00036872997400000613
is a squaring operation taking the two norms, λ cr A weight representing the loss;
loss of sparse matrices in the constrained space transform module and the temporal transform module
Figure BDA00036872997400000614
The calculation formula of (a) is as follows:
Figure BDA00036872997400000615
wherein Q is * And K * Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,
Figure BDA00036872997400000616
represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,
Figure BDA00036872997400000617
represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,
Figure BDA0003687299740000071
is a squaring operation of the absolute value, λ sparse Representing the weight of the loss.
Further, training a linear sparse attention Transformer-based video moir e removing network, processing videos in an original data set by adopting a training set to obtain a moir e video and clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain the Moire image removing network specifically comprises the following steps:
step S11, obtaining an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another;
step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;
step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame image
Figure BDA0003687299740000072
The formula of (1) is as follows:
Figure BDA0003687299740000073
where (i, j) represents the position of the pixel.
Further, the training process specifically comprises the following steps:
step S41, a pair of Moire pattern video and clean video is randomly selected from the training set, next, five adjacent video frames are randomly selected from the Moire pattern video, and five corresponding video frames in the corresponding clean video are simultaneously selected, wherein the Moire pattern video frame and the clean video frame are respectively marked as I t And O r ,t∈[1,5];
Step S42, training a first stage of a video Moire removing network based on a linear sparse attention Transformer: inputting five video frames I t Obtaining the Moire-removed intermediate frame by calculation of the network
Figure BDA0003687299740000074
Calculating the total loss of Moire removalAnd (4) losing L, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 -4 Down to 10 -5
Step S43, training a second stage of the video Moire removing network based on the linear sparse attention Transformer: the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted cr Set to 0 with a learning rate of 10 -5 Down to 10 -6 And performing fine tuning training of the network.
Further, the specific operation of removing moire from the input video after the training is completed is as follows: for a new Moire pattern video, inserting two blank frames at the beginning and the end of the video respectively, starting to take the first five frames of the video and inputting the first five frames into a network for calculation to obtain a Moire pattern removed video frame corresponding to the first frame of the original Moire pattern video, then taking the second frame to the sixth frame of the video and inputting the second frame to the network to obtain a Moire pattern removed video frame corresponding to the second frame of the original Moire pattern video, and subsequently adopting the same operation until the Moire pattern removed video frame corresponding to the Moire pattern removed video frame is obtained.
An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the linear sparse attention Transformer-based video degranulation method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a linear sparse attention Transformer based video degranulation method as described above.
Compared with the prior art, the invention and the optimized scheme thereof have the following beneficial effects:
the spatial attention and the temporal attention are used, the spatial information and the temporal information between different frames are effectively utilized to remove Moire and supplement details, and artifacts and flicker are prevented from appearing in the video. The method uses a linear attention computing mode, can effectively reduce the square time complexity of the original Transformer attention computing mode into the linear time complexity, greatly reduces the computing amount of the network, and improves the practical application effect of the network. Meanwhile, a loss function constraint calculation matrix is adopted as a sparse matrix in the linear attention calculation process, so that a more effective and stable moire removing effect is realized. The method can remove the moire in the video under low calculation complexity, generate the high-quality clean video without moire, and improve the visual effect and the performance index of the generated video without moire. Therefore, the invention has strong practicability and wide application prospect.
Drawings
Fig. 1 is a schematic flow chart of implementation of the embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a video demarked network according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a spatial Transformer module according to an embodiment of the present invention.
FIG. 4 is a schematic structural diagram of a time Transformer module according to an embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, this embodiment further details the scheme of the present invention in a form of steps according to a specific operation example for implementing a linear sparse attention Transformer-based video degranulation method.
The method can be specifically summarized as the following steps:
step S1, processing videos in the original data set to obtain a Moire pattern video and a clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain a training set; step S1 specifically includes the following steps:
s11, acquiring an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; and forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another.
Step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;
step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame image
Figure BDA0003687299740000091
The formula of (1) is as follows:
Figure BDA0003687299740000092
where (i, j) represents the position of the pixel.
And step S2, constructing a video Moire removing network based on the linear sparse attention Transformer. As shown in fig. 2, the video moir e removal network is composed of four parts, namely, a feature extraction module, a spatial Transformer module, a temporal Transformer module and an image reconstruction module;
specifically, step S2 includes the following steps:
step S21, constructing a feature extraction module to extract features of the video frame by using the feature extraction module to prepare for subsequent moir e removal;
as shown in FIG. 2, the input of the feature extraction module is five adjacent video frames in the same Moire pattern video, wherein the input video frame is I t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I t Inputting the data into the first volume block to obtain a feature map
Figure BDA0003687299740000093
The size is C × H × W, will
Figure BDA0003687299740000094
Feeding into a pooling layer and a second rolling block to obtain a characteristic diagram
Figure BDA0003687299740000095
It has a size of
Figure BDA0003687299740000096
Same, will
Figure BDA0003687299740000097
Feeding into a pooling layer and a third rolling block to obtain
Figure BDA0003687299740000101
Will be provided with
Figure BDA0003687299740000102
Feeding into a pooling layer and a final rolled block
Figure BDA0003687299740000103
And
Figure BDA0003687299740000104
respectively of size
Figure BDA0003687299740000105
Figure BDA0003687299740000106
And
Figure BDA0003687299740000107
each convolution block consists of a convolution layer, an activation layer, a convolution layer and an activation layer in sequence; the ReLu activation function is adopted by the two activation layers, the convolution with convolution kernel of 3 x 3 is adopted by the two convolution layers, the first convolution layer realizes the change of the channel number, for example, the channel number of the first convolution block is changed from 3 to C, the channel number of the other convolution blocks is doubled, and the channel number of the second convolution layer is kept unchanged. Step S22, constructing a spatial Transformer module to capture the positions of the moire in the single frame image by using the spatial attention of the spatial Transformer and perform key point removal, so as to achieve a better moire removing effect.
As shown in FIG. 3, the spatial Transformer module consists of nine linear sparse attention democratic layers and one absolute position code, where the input to the first layer is the feature map of the feature extraction module
Figure BDA0003687299740000108
The input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer t For the final output of the spatial transform module, each layer calculates the spatial attention of the feature map within linear time complexity, helps the network to find out and remove the region with serious Moire, and the absolute position code is the sum
Figure BDA0003687299740000109
Learnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;
the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multi-layer perceptron, a random inactivation layer and a normalization layer in sequence; the two random inactivation layers set the neuron inactivation probability to be 0.1, the two normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;
the spatial self-attention layer is composed of four learnable matrixes, namely a Query weight matrix W q Key weight matrix W k Value weight matrix W v And bottleneck matrix W p (ii) a The calculation formula for this layer is as follows:
Q=Dot(W q ,F in )
K=Dot(W k ,F in )
V=Dot(W v ,F in )
H=Dot(Softmax(Q),Dot(Softmax(K T ),V))
F out =Dot(W p ,H)
wherein, F in For input from the spatial attention layer, F out For the output of this layer, Q, K and V are the Query matrix, Key matrix and Value matrix, respectively, K T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W v Are sparse matrices under the constraint of the L2 loss function.
Step S23, constructing a time Transformer module to capture complementary information existing between multiple frames of images by using the time attention of the time Transformer, and performing image restoration by using the complementary information of adjacent frames to further improve the moire removing effect.
As shown in FIG. 4, the input of the temporal Transformer module is the final output F of the spatial Transformer module t (ii) a The module consists of four time attention degaussing layers, oneThe absolute position code and the absolute time code share parameters, the absolute time code is a learnable matrix with the dimension of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training; the first temporal attention democratic layer input is F for five video frames t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t-th frame
Figure BDA0003687299740000111
The final output of the time Transformer module to the t frame; the temporal attention degranulation layer and the linear sparse attention degranulation layer are similar in structure, and the difference is mainly that the temporal attention degranulation layer is provided with a temporal self-attention layer which can be used for capturing complementary information between adjacent video frames, and the removal of the degranulation is facilitated through the temporally complementary information;
the time attention democration layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multi-layer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the three random inactivation layers is set to be 0.1, the three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, the activation layer adopts a ReLu activation function, and the structure of a spatial self-attention layer is the same as that of a spatial self-attention layer in a linear sparse attention degranulation layer; before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are also required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;
the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' q Key weight matrix W' k Value weight matrix W' v And bottleneck matrix W' p (ii) a The calculation formula for this layer is as follows:
Figure BDA0003687299740000112
Figure BDA0003687299740000121
Figure BDA0003687299740000122
K a =[K 1 ,K 2 ,…,K 5 ]
V a =[V 1 ,V 2 ,…,V 5 ]
H t (i,j)=Dot(Softmax(Dot(Q t (i,j),(K a (i,j)) T )),V a (i,j))
F out =Dot(W′ p ,H)
where t denotes the tth frame, t ∈ [1,5 ]],
Figure BDA0003687299740000123
For input features belonging to the t-th frame in the temporal self-attention layer, F out As output of this layer, Q t 、K t And V t Respectively, a Query matrix, a Key matrix and a Value matrix belonging to the T-th frame, Softmax () represents the Softmax calculation of the last dimension of the matrix, Dot () represents the matrix multiplication calculation, superscript T represents the matrix transposition, and]representing a composition matrix operation, K a And V a Key matrix and Value matrix of five frames respectively,h is an attention feature map of the complete time self-attention layer, (i, j) represents the position of the feature, specifically, the feature map is divided into a plurality of 2 × 2 non-overlapping small squares, and (i, j) represents the position of the small square where the feature is located; h t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K a (i, j) is represented by K a Local Key matrix, V, for the middle (i, j) position a (i, j) is represented at V a Local Value matrix, Q, for the (i, j) position t (i, j) is represented by Q t Local Query matrix, W 'for the (i, j) location' v Is a sparse matrix under the constraint of the L2 loss function.
And step S24, constructing an image reconstruction module, and decoding the video frame characteristics passing through the space Transformer module and the time Transformer module by using the image reconstruction module to recover the video frame characteristics into a Moire removing video frame with the same scale as the input video.
As shown in FIG. 2, the input to the image reconstruction module is the final output of the temporal Transformer module for the intermediate frame (third frame)
Figure BDA0003687299740000124
Having a dimension of
Figure BDA0003687299740000125
The module consists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided with
Figure BDA0003687299740000126
Input to the first upsampling block to obtain a feature map
Figure BDA0003687299740000127
It has a size of
Figure BDA0003687299740000128
Figure BDA0003687299740000129
Will be provided with
Figure BDA00036872997400001210
And feature map of feature extraction module
Figure BDA00036872997400001211
Splicing according to channels and inputting the spliced signals into a first volume block to obtain a feature map
Figure BDA00036872997400001212
Will be provided with
Figure BDA00036872997400001213
Inputting the data into a second up-sampling block to obtain a characteristic diagram
Figure BDA00036872997400001214
It has a size of
Figure BDA00036872997400001215
Will be provided with
Figure BDA00036872997400001216
And feature map of feature extraction module
Figure BDA00036872997400001217
Splicing according to the channels and inputting the spliced signals into a second rolling block to obtain a characteristic diagram
Figure BDA00036872997400001218
Will be provided with
Figure BDA00036872997400001219
Inputting the data into a third up-sampling block to obtain a characteristic diagram
Figure BDA00036872997400001220
The size is C × H × W, will
Figure BDA00036872997400001221
And feature map of feature extraction module
Figure BDA00036872997400001222
Splicing according to channels and inputting the spliced signals into a third rolling block to obtainCharacteristic diagram
Figure BDA00036872997400001223
Will be provided with
Figure BDA00036872997400001224
Inputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame image
Figure BDA0003687299740000131
I.e. corresponding to Moire video frame I 3 Removing moire pattern frames;
the up-sampling block is composed of an up-sampling layer, a convolution layer and an active layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, meanwhile, the convolution layer realizes the change of the number of the channels of the characteristic diagram, namely, the number of the channels is reduced to half of the number of the original channels, and the active layer adopts a ReLu active function; the convolution block consists of convolution layers, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer also adopts a ReLu active function;
and step S3, constructing a loss function for training the video degranulation network.
Step S3 specifically includes the following steps:
step S31, constructing a total optimization target of the whole network; the optimization objectives are as follows:
min(L),
Figure BDA0003687299740000132
where min (L) represents the minimum L, L represents the total loss of democratic network,
Figure BDA0003687299740000133
charbonnier loss representing the democratic image versus the clean image,
Figure BDA0003687299740000134
representing the loss of edge texture in the deglitched image versus the clean image,
Figure BDA0003687299740000135
representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;
step S32, constructing Charbonier loss of the Moire pattern removed image and the clean image;
Figure BDA0003687299740000136
the calculation formula of (a) is as follows:
Figure BDA0003687299740000137
wherein,
Figure BDA0003687299740000138
representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O 3 Representing the clean image with which it is paired, e represents a constant of control precision, λ C A weight representing the loss;
step S33, constructing the edge texture loss of the Moire pattern removed image and the clean image;
Figure BDA0003687299740000139
the calculation formula of (a) is as follows:
Figure BDA00036872997400001310
wherein | | | purple hair 1 Is an absolute value taking operation, Sobel * Improved Sobel filters, Sobel, showing different orientations * () Denotes the convolution operation, λ ASL A weight representing the loss;
step S34, constructing color loss of the Moire pattern removed image and the clean image;
Figure BDA00036872997400001311
the calculation formula of (a) is as follows:
Figure BDA00036872997400001312
wherein G represents a Gaussian blur kernel,
Figure BDA00036872997400001313
representing a blurred, degritted image, G (O) 3 ) A clear image that is blurred is represented,
Figure BDA00036872997400001314
is a squaring operation taking the two norms, λ cr A weight representing the loss;
step S35, constructing loss of sparse matrixes in a constrained space Transformer module and a time Transformer module;
Figure BDA0003687299740000141
the calculation formula of (a) is as follows:
Figure BDA0003687299740000142
wherein Q * And K * Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,
Figure BDA0003687299740000143
represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,
Figure BDA0003687299740000144
represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,
Figure BDA0003687299740000145
is a squaring operation of the absolute value, λ sparse A weight representing the loss;
and step S4, training the video democratic texture network by adopting the training data set.
Step S4 specifically includes the following steps:
step S41, randomly selecting a pair of Moire pattern video and clean video from the training data set, then randomly selecting five adjacent video frames from the Moire pattern video, and simultaneously selecting five corresponding video frames from the corresponding clean video, wherein the Moire pattern video frame and the clean video frame are respectively marked as I t And O t ,t∈[1,5];
Step S42, training a first stage of a video Moire pattern removing network based on a linear sparse attention transducer; inputting five video frames I t Obtaining the Moire-removed intermediate frame by calculation of the network
Figure BDA0003687299740000146
Calculating the total loss L of the Moire pattern removal, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 -4 Slowly decreases to 10 -5
Step S43, training a second stage of the video Moire pattern removing network based on the linear sparse attention Transformer; the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted cr Set to 0 with a learning rate of 10 -5 Slowly decreases to 10 -6 Carrying out fine tuning training of the network;
in this embodiment, the whole training process is iterated four million times, and in each iteration process, a plurality of video pairs are randomly sampled and trained as one batch, the first two million and fifty million times are subjected to the first stage training of step S42, and the remaining one hundred and fifty million times are iterated and subjected to the second stage training of step S43;
and step S5, inputting the new Moire pattern video into the trained video Moire pattern removing network, and outputting the clean video without Moire patterns. Specifically, for a new moire video, two blank frames are respectively inserted at the beginning and the end of the video, the first five frames of the video are firstly taken and input into the network for calculation, a moire-removing video frame corresponding to the first frame of the original moire video is obtained, then the second frame to the sixth frame of the video are taken and input into the network, a moire-removing video frame corresponding to the second frame of the original moire video is obtained, and the same operation is subsequently adopted until the moire-removing video frame corresponding to the moire-removing video frame is obtained.
The embodiment also provides a linear sparse attention Transformer-based video degranulation system, which comprises a memory, a processor and computer program instructions stored in the memory and capable of being executed by the processor, wherein when the computer program instructions are executed by the processor, the above-mentioned method steps can be implemented.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow of the flowcharts, and combinations of flows in the flowcharts, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
The present invention is not limited to the above preferred embodiments, and other various forms of linear sparse attention transducer based video moir e removal methods can be derived by anyone in light of the present invention.

Claims (10)

1. A video Moire removing method based on linear sparse attention transducer is characterized by comprising the following steps: training a video Moire removing network based on a linear sparse attention transducer to remove Moire of an input video after training is finished;
the video degranulation network based on the linear sparse attention Transformer comprises:
the characteristic extraction module is used for extracting the characteristics of the video frame;
the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points;
the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames;
and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.
2. The linear sparse attention fransformer-based video degranulation method of claim 1, wherein: the input of the feature extraction module is five adjacent video frames in the same Moire pattern video, wherein the input video frame is I t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I t Inputting the data into the first volume block to obtain a feature map
Figure FDA0003687299730000011
The size is C × H × W, will
Figure FDA0003687299730000012
Feeding into a pooling layer and a second rolling block to obtain a feature map
Figure FDA0003687299730000013
It has a size of
Figure FDA0003687299730000014
Same, will
Figure FDA0003687299730000015
Feeding into a pooling layer and a third rolling block to obtain
Figure FDA0003687299730000016
Will be provided with
Figure FDA0003687299730000017
Feeding into the pooling layer and the last rollBlock formation
Figure FDA0003687299730000018
And
Figure FDA0003687299730000019
respectively of size
Figure FDA00036872997300000110
Figure FDA00036872997300000111
And
Figure FDA00036872997300000112
each convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence; the two active layers adopt ReLu active functions, the two convolutional layers adopt convolution with convolution kernel of 3 x 3, the first convolutional layer realizes the change of the number of channels, and the second convolutional layer maintains the number of channels unchanged.
3. The linear sparse attention fransformer-based video democration method of claim 2, wherein: the space Transformer module consists of nine linear sparse attention democration layers and an absolute position code;
wherein the input of the first layer is a feature map of a feature extraction module
Figure FDA0003687299730000021
The input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer t Calculating the spatial attention of the feature map in linear time complexity of each layer for the final output of the spatial transform module;
the absolute position code is
Figure FDA0003687299730000022
Learnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;
the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the two random inactivation layers is set to be 0.1, and the two normalization layers adopt layer normalization; the multilayer perceptron is composed of a first full connection layer, an activation layer and a second full connection layer in sequence, wherein the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;
the spatial self-attention layer consists of four learnable matrixes, namely a Query weight matrix W q Key weight matrix W k Value weight matrix W v And bottleneck matrix W p (ii) a The calculation formula for this layer is as follows:
Q=Dot(W q ,F in )
K=Dot(W k ,F in )
V=Dot(W v ,F in )
H=Dot(Softmax(Q),Dot(Softmax(K T ),V))
F out =Dot(W p ,H)
wherein, F in For spatial self-attention layer input, F out For spatial self-attention layer output, Q, K and V are Query matrix, Key matrix and Value matrix, K T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W v Are sparse matrices under the constraint of the L2 loss function.
4. The linear sparse attention Transformer-based video degranulation method according to claim 3, wherein:
the input of the time Transformer module is the final output F of the space Transformer module t (ii) a The system consists of four time attention degamma layers, an absolute position code and an absolute time code;
the absolute position code shares parameters with the absolute position code of the space Transformer module;
the absolute time coding is a learnable matrix with the scale of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training;
the first temporal attention democratic layer input is F for five video frames t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t-th frame
Figure FDA0003687299730000031
The final output of the time Transformer module to the t frame;
the time attention democratic layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence;
the neuron inactivation probability of all three random inactivation layers is set to be 0.1, all three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function;
the structure of the spatial self-attention layer is the same as that of the spatial self-attention layer in the linear sparse attention democration layer;
before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;
the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' q Key weight matrix W' k Value weight matrix W' v And bottleneck matrix W' p (ii) a The calculation formula for this layer is as follows:
Figure FDA0003687299730000032
Figure FDA0003687299730000033
Figure FDA0003687299730000034
K a =[K 1 ,K 2 ,…,K 5 ]
V a =[V 1 ,V 2 ,…,V 5 ]
H t (i,j)=Dot(Softmax(Dot(Q t (i,j),(K a (i,j)) T )),V a (i,j))
F out =Dot(W′ p ,H)
where t denotes the tth frame, t ∈ [1,5 ]],
Figure FDA0003687299730000041
For input features belonging to the t-th frame in the temporal self-attention layer, F out Is time fromOutput of the attention layer, Q t 、K t And V t Respectively, a Query matrix, a Key matrix and a Value matrix belonging to the T-th frame, Softmax () represents the Softmax calculation of the last dimension of the matrix, Dot () represents the matrix multiplication calculation, superscript T represents the matrix transposition, and]representing a composition matrix operation, K a And V a The Key matrix and the Value matrix of five frames are respectively used, H is a complete attention feature map of a time self-attention layer, and (i, j) represents the position of a feature, namely, the feature map is divided into a plurality of 2 x 2 non-overlapping small squares, and (i, j) represents the position of the small square of the feature; h t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K a (i, j) is represented by K a Local Key matrix, V, for the middle (i, j) position a (i, j) is represented at V a Local Value matrix, Q, for the (i, j) position t (i, j) is represented by Q t Local Query matrix, W 'for the (i, j) location' v Is a sparse matrix under the constraint of the L2 loss function.
5. The linear sparse attention transducer-based video degranulation method of claim 4, wherein:
the input of the image reconstruction module is the final output of the time Transformer module to the third frame
Figure FDA0003687299730000042
Having a dimension of
Figure FDA0003687299730000043
Consists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided with
Figure FDA0003687299730000044
Input to the first upsampling block to obtain a feature map
Figure FDA0003687299730000045
It has a size of
Figure FDA0003687299730000046
Will be provided with
Figure FDA0003687299730000047
And feature map of feature extraction module
Figure FDA0003687299730000048
Splicing according to channels and inputting the spliced signals into a first volume block to obtain a feature map
Figure FDA0003687299730000049
Will be provided with
Figure FDA00036872997300000410
Inputting the data into a second up-sampling block to obtain a characteristic diagram
Figure FDA00036872997300000411
It has a size of
Figure FDA00036872997300000412
Will be provided with
Figure FDA00036872997300000413
And feature map of feature extraction module
Figure FDA00036872997300000414
Splicing according to channels and inputting the spliced result into a second volume block to obtain a feature map
Figure FDA00036872997300000415
Will be provided with
Figure FDA00036872997300000416
Inputting the data into a third up-sampling block to obtain a characteristic diagram
Figure FDA00036872997300000417
The size of which is CxHxW, is
Figure FDA00036872997300000418
And feature map of feature extraction module
Figure FDA00036872997300000419
Splicing according to channels and inputting the spliced signals into a third volume block to obtain a feature map
Figure FDA00036872997300000420
Will be provided with
Figure FDA00036872997300000421
Inputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame image
Figure FDA00036872997300000422
I.e. corresponding to Moire video frame I 3 Removing moire pattern frames;
the up-sampling block is composed of an up-sampling layer, a convolution layer and an activation layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, and the convolution layer realizes the change of the number of characteristic image channels, namely, the number of the channels is reduced to half of the original number, and the activation layer adopts a ReLu activation function; the convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer adopts a ReLu active function.
6. The linear sparse attention transducer-based video degranulation method of claim 5, wherein: the loss function used to train the video degamma network is constructed as follows:
the overall optimization objective of the network is as follows:
min(L),
Figure FDA0003687299730000051
where min (L) represents the minimum L, L represents the total loss of democratic network,
Figure FDA0003687299730000052
indicating Charbonnier loss of the degranulated image versus the clean image,
Figure FDA0003687299730000053
representing the loss of edge texture in the deglitched image versus the clean image,
Figure FDA0003687299730000054
representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;
charbonier loss of the Moire-removed image and the clean image
Figure FDA0003687299730000055
The calculation formula of (c) is as follows:
Figure FDA0003687299730000056
wherein,
Figure FDA0003687299730000057
representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O 3 Representing the clean image with which it is paired, e represents a constant of control precision, λ C A weight representing the loss;
loss of edge texture of the degritted image and the clean image
Figure FDA0003687299730000058
The calculation formula of (a) is as follows:
Figure FDA0003687299730000059
wherein | | | purple hair 1 Is an absolute value taking operation, Sobel * Improved Sobel filters, Sobel, showing different orientations * () Denotes the convolution operation, λ ASL A weight representing the loss;
color loss of the degritted image and the clean image
Figure FDA00036872997300000510
The calculation formula of (a) is as follows:
Figure FDA00036872997300000511
wherein G represents a Gaussian blur kernel,
Figure FDA00036872997300000512
representing a blurred, degritted image, G (O) 3 ) A blurred clean image is represented which is,
Figure FDA00036872997300000513
is a squaring operation taking the two norms, λ cr A weight representing the loss;
loss of sparse matrices in the constrained space transform module and the temporal transform module
Figure FDA00036872997300000514
The calculation formula of (a) is as follows:
Figure FDA00036872997300000515
wherein Q is * And K * Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,
Figure FDA00036872997300000516
represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,
Figure FDA00036872997300000517
represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,
Figure FDA0003687299730000061
is a squaring operation of the absolute value, λ sparse Representing the weight of the loss.
7. The linear sparse attention transducer-based video degranulation method of claim 6, wherein: training a video Moire pattern removing network based on a linear sparse attention transducer, processing videos in an original data set by adopting a training set to obtain a Moire pattern video and a clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain the Moire pattern video and the clean video, wherein the method specifically comprises the following steps:
step S11, obtaining an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another;
step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;
step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame image
Figure FDA0003687299730000062
The formula of (1) is as follows:
Figure FDA0003687299730000063
where (i, j) represents the position of the pixel.
8. The linear sparse attention fransformer-based video degranulation method of claim 7, wherein: the training process specifically comprises the following steps:
step S41, randomly selecting a pair of Moire pattern video and clean video from the training set, then randomly selecting five adjacent video frames from the Moire pattern video, and simultaneously selecting five corresponding video frames from the corresponding clean video, wherein the Moire pattern video frame and the clean video frame are respectively marked as I t And O t ,t∈[1,5];
Step S42, training a first stage of a video Moire removing network based on a linear sparse attention Transformer: inputting five video frames I t Obtaining the Moire-removed intermediate frame by calculation of the network
Figure FDA0003687299730000064
Calculating the total loss L of the Moire pattern removal, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 -4 Down to 10 -5
Step S43, training a second stage of the video Moire removing network based on the linear sparse attention Transformer: the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted cr Set to 0 with a learning rate of 10 -5 Down to 10 -6 And performing fine tuning training of the network.
9. The linear sparse attention fransformer-based video degranulation method of claim 8, wherein: the specific operation of removing the moire from the input video after the training is completed is as follows: for a new Moire pattern video, inserting two blank frames at the beginning and the end of the video respectively, starting to take the first five frames of the video and inputting the first five frames into a network for calculation to obtain a Moire pattern removed video frame corresponding to the first frame of the original Moire pattern video, then taking the second frame to the sixth frame of the video and inputting the second frame to the network to obtain a Moire pattern removed video frame corresponding to the second frame of the original Moire pattern video, and subsequently adopting the same operation until the Moire pattern removed video frame corresponding to the Moire pattern removed video frame is obtained.
10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the linear sparse attention Transformer based video degranulation method as recited in any one of claims 1-9 when executing the program.
CN202210649880.XA 2022-06-10 2022-06-10 Video Moire removing method based on linear sparse attention transducer Pending CN114881888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210649880.XA CN114881888A (en) 2022-06-10 2022-06-10 Video Moire removing method based on linear sparse attention transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210649880.XA CN114881888A (en) 2022-06-10 2022-06-10 Video Moire removing method based on linear sparse attention transducer

Publications (1)

Publication Number Publication Date
CN114881888A true CN114881888A (en) 2022-08-09

Family

ID=82680890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210649880.XA Pending CN114881888A (en) 2022-06-10 2022-06-10 Video Moire removing method based on linear sparse attention transducer

Country Status (1)

Country Link
CN (1) CN114881888A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596779A (en) * 2023-04-24 2023-08-15 天津大学 Transform-based Raw video denoising method
CN116634209A (en) * 2023-07-24 2023-08-22 武汉能钠智能装备技术股份有限公司 Breakpoint video recovery system and method based on hot plug
CN116831581A (en) * 2023-06-15 2023-10-03 中南大学 Remote physiological sign extraction-based driver state monitoring method and system
CN117725844A (en) * 2024-02-08 2024-03-19 厦门蝉羽网络科技有限公司 Large model fine tuning method, device, equipment and medium based on learning weight vector
CN117808706A (en) * 2023-12-28 2024-04-02 山东财经大学 Video rain removing method, system, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539884A (en) * 2020-04-21 2020-08-14 温州大学 Neural network video deblurring method based on multi-attention machine mechanism fusion
CN112598602A (en) * 2021-01-06 2021-04-02 福建帝视信息科技有限公司 Mask-based method for removing Moire of deep learning video
CN113065645A (en) * 2021-04-30 2021-07-02 华为技术有限公司 Twin attention network, image processing method and device
WO2021134874A1 (en) * 2019-12-31 2021-07-08 深圳大学 Training method for deep residual network for removing a moire pattern of two-dimensional code

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021134874A1 (en) * 2019-12-31 2021-07-08 深圳大学 Training method for deep residual network for removing a moire pattern of two-dimensional code
CN111539884A (en) * 2020-04-21 2020-08-14 温州大学 Neural network video deblurring method based on multi-attention machine mechanism fusion
CN112598602A (en) * 2021-01-06 2021-04-02 福建帝视信息科技有限公司 Mask-based method for removing Moire of deep learning video
CN113065645A (en) * 2021-04-30 2021-07-02 华为技术有限公司 Twin attention network, image processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
聂可卉;刘文哲;童同;杜民;高钦泉;: "基于自适应可分离卷积核的视频压缩伪影去除算法", 计算机应用, no. 05, 10 May 2019 (2019-05-10), pages 233 - 239 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596779A (en) * 2023-04-24 2023-08-15 天津大学 Transform-based Raw video denoising method
CN116596779B (en) * 2023-04-24 2023-12-01 天津大学 Transform-based Raw video denoising method
CN116831581A (en) * 2023-06-15 2023-10-03 中南大学 Remote physiological sign extraction-based driver state monitoring method and system
CN116634209A (en) * 2023-07-24 2023-08-22 武汉能钠智能装备技术股份有限公司 Breakpoint video recovery system and method based on hot plug
CN116634209B (en) * 2023-07-24 2023-11-17 武汉能钠智能装备技术股份有限公司 Breakpoint video recovery system and method based on hot plug
CN117808706A (en) * 2023-12-28 2024-04-02 山东财经大学 Video rain removing method, system, equipment and storage medium
CN117725844A (en) * 2024-02-08 2024-03-19 厦门蝉羽网络科技有限公司 Large model fine tuning method, device, equipment and medium based on learning weight vector
CN117725844B (en) * 2024-02-08 2024-04-16 厦门蝉羽网络科技有限公司 Large model fine tuning method, device, equipment and medium based on learning weight vector

Similar Documents

Publication Publication Date Title
CN114881888A (en) Video Moire removing method based on linear sparse attention transducer
CN107403415B (en) Compressed depth map quality enhancement method and device based on full convolution neural network
Yu et al. A unified learning framework for single image super-resolution
CN109272452B (en) Method for learning super-resolution network based on group structure sub-band in wavelet domain
CN106709875A (en) Compressed low-resolution image restoration method based on combined deep network
CN114693558A (en) Image Moire removing method and system based on progressive fusion multi-scale strategy
CN113284051A (en) Face super-resolution method based on frequency decomposition multi-attention machine system
CN114723630A (en) Image deblurring method and system based on cavity double-residual multi-scale depth network
Liu et al. Research on super-resolution reconstruction of remote sensing images: A comprehensive review
CN114418853A (en) Image super-resolution optimization method, medium and device based on similar image retrieval
CN108122262B (en) Sparse representation single-frame image super-resolution reconstruction algorithm based on main structure separation
Hai et al. Advanced retinexnet: a fully convolutional network for low-light image enhancement
CN116797456A (en) Image super-resolution reconstruction method, system, device and storage medium
CN113096032B (en) Non-uniform blurring removal method based on image region division
CN117333398A (en) Multi-scale image denoising method and device based on self-supervision
Guo et al. Orthogonally regularized deep networks for image super-resolution
CN115272131B (en) Image mole pattern removing system and method based on self-adaptive multispectral coding
CN110895790A (en) Scene image super-resolution method based on posterior degradation information estimation
CN115760638A (en) End-to-end deblurring super-resolution method based on deep learning
CN115456891A (en) Under-screen camera image restoration method based on U-shaped dynamic network
Ling et al. PRNet: Pyramid Restoration Network for RAW Image Super-Resolution
CN114331853A (en) Single image restoration iteration framework based on target vector updating module
Xu et al. Joint learning of super-resolution and perceptual image enhancement for single image
Xu et al. Swin transformer and ResNet based deep networks for low-light image enhancement
CN117291855B (en) High resolution image fusion method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination