CN114881888A - Video Moire removing method based on linear sparse attention transducer - Google Patents
Video Moire removing method based on linear sparse attention transducer Download PDFInfo
- Publication number
- CN114881888A CN114881888A CN202210649880.XA CN202210649880A CN114881888A CN 114881888 A CN114881888 A CN 114881888A CN 202210649880 A CN202210649880 A CN 202210649880A CN 114881888 A CN114881888 A CN 114881888A
- Authority
- CN
- China
- Prior art keywords
- layer
- video
- attention
- matrix
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 230000000295 complement effect Effects 0.000 claims abstract description 10
- 238000011084 recovery Methods 0.000 claims abstract description 3
- 239000011159 matrix material Substances 0.000 claims description 104
- 238000010586 diagram Methods 0.000 claims description 44
- 230000002779 inactivation Effects 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 36
- 238000010606 normalization Methods 0.000 claims description 36
- 230000004913 activation Effects 0.000 claims description 34
- 230000006870 function Effects 0.000 claims description 29
- 238000011176 pooling Methods 0.000 claims description 18
- 238000005070 sampling Methods 0.000 claims description 18
- 230000002123 temporal effect Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 238000005096 rolling process Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000011423 initialization method Methods 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 2
- 238000011161 development Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/73—Deblurring; Sharpening
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Processing (AREA)
Abstract
The invention provides a video Moire removing method based on a linear sparse attention transducer, which is used for training a video Moire removing network based on the linear sparse attention transducer so as to remove Moire of an input video after training is finished; the video degranulation network based on the linear sparse attention Transformer comprises: the characteristic extraction module is used for extracting the characteristics of the video frames; the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points; the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames; and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.
Description
Technical Field
The invention belongs to the technical field of video processing and computer vision, and particularly relates to a linear sparse attention Transformer-based video Moire removing method.
Background
With the rapid development of mobile devices and multimedia technologies, smart phones have become indispensable tools in daily life, and the popularity of mobile photography is increasing with the support of the improvement of the quality of photography. Images and videos are an indispensable part of modern human communication and information transmission, and have great significance for the development of various aspects of society. Digital screens are ubiquitous in modern day life, such as television screens at home, computers and large-scale LED screens in public places, and it is common practice to capture these screens with a mobile phone to quickly store information, and sometimes capturing images and videos is the only practical way to store information. However, when taking digital screens, moire patterns often appear and contaminate underlying clean images and video. Moire is caused by mutual interference between a camera Color Filter Array (CFA) and a sub-pixel layout of a screen, resulting in color-distorted stripes in photographed images and videos, seriously degrading the visual quality of the images and videos. The development of computer vision and the upgrading of hardware make it possible to realize the problem, so that many researchers are beginning to put into the research of removing moire patterns from images, but the problem of removing moire patterns from videos is still rarely researched.
Removing moir e is a challenging task because the moir e is irregular in shape and color, and it spans low and high frequencies. Unlike other image and video restoration tasks, such as image or video denoising, image demosaicing and image or video super-resolution, the moir e removal task needs to cope with complex low-frequency and high-frequency moir fringes and also needs to restore details in images and videos, and meanwhile, the moir fringes can also influence the appearance of chromatic aberration of shot images. Since moire formation is closely related to the camera imaging process, especially the frequency of the Color Filter Array (CFA). Thus, many have proposed methods aimed at improving the imaged pipe to eliminate moir e. However, these methods have high computational complexity and are not suitable for practical application. In 2018, Sun et al created a large-scale Morgan removing reference dataset, TIP2018 dataset, containing hundreds of thousands of image pairs, and proposed a novel multi-resolution full convolution network to remove Morgan, which greatly promoted the development of the image Morgan removing task. Compared with image moir e removal, the method for removing the moir e of the video is more difficult, and the method cannot simply remove the moir e frame by frame because artifacts and flicker are introduced into the video, the time coherence between frames cannot be guaranteed, and the performance cannot be satisfactory. Therefore, a new method for solving the Moire removing task of the video is urgently needed.
The method has important practical significance for removing Moire patterns from the video, and for digital videos with large quantity cardinality, people manually remove Moire patterns from the video, so that great labor and time costs are consumed. The video moire removing algorithm just solves the problem, developers only need to use a trained video moire removing network to automatically remove moire in the video, repetitive labor is avoided, and a large amount of time is saved. However, since the task of removing moire patterns from video is rarely studied and the task of removing moire patterns from video cannot be solved by simply using the image moire removing method, the task still remains to be studied.
Disclosure of Invention
In order to make up for the blank and the defects of the prior art, the invention provides a video Moire removing method based on a linear sparse attention transducer, which is based on a designed video Moire removing network based on the linear sparse attention transducer to realize high-quality video Moire removing.
The invention specifically adopts the following technical scheme:
a video Moire removing method based on linear sparse attention transducer is characterized by comprising the following steps: training a video Moire removing network based on a linear sparse attention transducer to remove Moire of an input video after training is finished;
the video degranulation network based on the linear sparse attention Transformer comprises:
the characteristic extraction module is used for extracting the characteristics of the video frames;
the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points;
the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames;
and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.
Furthermore, the input of the feature extraction module is five adjacent video frames in the same moire video, wherein the input video frame is represented by I t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I t Inputting the data into the first volume block to obtain a feature mapThe size is C × H × W, willFeeding into a pooling layer and a second rolling block to obtain a feature mapIt has a size ofSame, willFeeding into a pooling layer and a third rolling block to obtainWill be provided withFeeding into a pooling layer and a final rolled blockAndrespectively has the size ofAnd
each convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence; the two active layers adopt ReLu active functions, the two convolutional layers adopt convolution with convolution kernel of 3 x 3, the first convolutional layer realizes the change of the number of channels, and the second convolutional layer maintains the number of channels unchanged.
Further, the spatial Transformer module consists of nine linear sparse attention degranulation layers and one absolute position code;
wherein the input of the first layer is a feature map of a feature extraction moduleThe input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer t Calculating the spatial attention of the feature map in linear time complexity of each layer for the final output of the spatial transform module;
the absolute position code isLearnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;
the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the two random inactivation layers is set to be 0.1, and the two normalization layers adopt layer normalization; the multilayer perceptron is composed of a first full connection layer, an activation layer and a second full connection layer in sequence, wherein the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;
the spatial self-attention layer consists of four learnable matrixes, namely a Query weight matrix W q Key weight matrix W k Value weight matrix W v And bottleneck matrix W p (ii) a The calculation formula for this layer is as follows:
Q=Dot(W q ,F in )
K=Dot(W k ,F in )
V=Dot(W v ,F in )
H=Dot(Softmax(Q),Dot(Softmax(K T ),V))
F out =Dot(W p ,H)
wherein, F in For spatial self-attention layer input, F out For spatial self-attention layer output, Q, K and V are Query matrix, Key matrix and Value matrix, K T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W v Are sparse matrices under the constraint of the L2 loss function.
Further, the input of the time Transformer module is the final output F of the space Transformer module t (ii) a The system consists of four time attention degamma layers, an absolute position code and an absolute time code;
the absolute position code shares parameters with the absolute position code of the space Transformer module;
the absolute time coding is a learnable matrix with the scale of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training;
the first temporal attention democratic layer input is F for five video frames t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t frameThe final output of the time Transformer module to the t frame;
the time attention democratic layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence;
the neuron inactivation probability of all three random inactivation layers is set to be 0.1, all three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function;
the structure of the spatial self-attention layer is the same as that of the spatial self-attention layer in the linear sparse attention democration layer;
before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;
the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' q Key weight matrix W' k Value weight matrix W' v And bottleneck matrix W' p (ii) a The calculation formula for this layer is as follows:
K a =[K 1 ,K 2 ,…,K 5 ]
V a =[V 1 ,V 2 ,…,V 5 ]
H t (i,j)=Dot(Softmax(Dot(Q t (i,j),(K a (i,j)) T )),V a (i,j))
F out =Dot(W′ p ,H)
where t denotes the tth frame, t ∈ [1,5 ]],For input features belonging to the t-th frame in the temporal self-attention layer, F out For output of time from attention horizon, Q t 、K t And V t Respectively a Query matrix, a Key matrix and a Value matrix belonging to the t-th frame, Softmax (C)) Denotes the softmax calculation for the last dimension of the matrix, Dot () denotes the matrix multiplication calculation, superscript T denotes the matrix transpose, b]Representing a composition matrix operation, K a And V a The Key matrix and the Value matrix of five frames are respectively used, H is a complete attention feature map of a time self-attention layer, and (i, j) represents the position of a feature, namely, the feature map is divided into a plurality of 2 x 2 non-overlapping small squares, and (i, j) represents the position of the small square of the feature; h t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K a (i, j) is represented by K a Local Key matrix, V, for the middle (i, j) position a (i, j) is represented at V a Local Value matrix, Q, for the (i, j) position t (i, j) is represented by Q t Local Query matrix, W 'of the (i, j) position' v Is a sparse matrix under the constraint of the L2 loss function.
Further, the input of the image reconstruction module is the final output of the temporal Transformer module on the third frameHaving a dimension ofConsists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided withInput to the first upsampling block to obtain a feature mapIt has a size ofWill be provided withAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced signals into a first volume block to obtain a feature mapWill be provided withInputting the data into a second up-sampling block to obtain a characteristic diagramIt has a size ofWill be provided withAnd feature map of feature extraction moduleSplicing according to the channels and inputting the spliced signals into a second rolling block to obtain a characteristic diagramWill be provided withInputting the data into a third up-sampling block to obtain a characteristic diagramThe size is C × H × W, willAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced signals into a third volume block to obtain a feature mapWill be provided withInputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame imageI.e. corresponding to Moire video frame I 3 The democratic frame of (1);
the up-sampling block is composed of an up-sampling layer, a convolution layer and an activation layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, and the convolution layer realizes the change of the number of characteristic image channels, namely, the number of the channels is reduced to half of the original number, and the activation layer adopts a ReLu activation function; the convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer adopts a ReLu active function.
Further, the loss function used to train the video deglitch network is constructed as follows:
the overall optimization objective of the network is as follows:
min(L),
where min (L) represents the minimum L, L represents the total loss of democratic network,indicating Charbonnier loss of the degranulated image versus the clean image,representing the loss of edge texture in the deglitched image versus the clean image,representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;
charbonier loss of the Moire-removed image and the clean imageThe calculation formula of (a) is as follows:
wherein,representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O 3 Representing the clean image with which it is paired, e represents a constant of control precision, λ C A weight representing the loss;
loss of edge texture of the degritted image and the clean imageThe calculation formula of (a) is as follows:
wherein | | | purple hair 1 Is an absolute value taking operation, Sobel * Improved Sobel filters, Sobel, showing different orientations * () Denotes the convolution operation, λ ASL A weight representing the loss;
wherein G represents a Gaussian blur kernel,representing a blurred, degritted image, G (O) 3 ) A blurred clean image is represented which is,is a squaring operation taking the two norms, λ cr A weight representing the loss;
loss of sparse matrices in the constrained space transform module and the temporal transform moduleThe calculation formula of (a) is as follows:
wherein Q is * And K * Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,is a squaring operation of the absolute value, λ sparse Representing the weight of the loss.
Further, training a linear sparse attention Transformer-based video moir e removing network, processing videos in an original data set by adopting a training set to obtain a moir e video and clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain the Moire image removing network specifically comprises the following steps:
step S11, obtaining an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another;
step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;
step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame imageThe formula of (1) is as follows:
where (i, j) represents the position of the pixel.
Further, the training process specifically comprises the following steps:
step S41, a pair of Moire pattern video and clean video is randomly selected from the training set, next, five adjacent video frames are randomly selected from the Moire pattern video, and five corresponding video frames in the corresponding clean video are simultaneously selected, wherein the Moire pattern video frame and the clean video frame are respectively marked as I t And O r ,t∈[1,5];
Step S42, training a first stage of a video Moire removing network based on a linear sparse attention Transformer: inputting five video frames I t Obtaining the Moire-removed intermediate frame by calculation of the networkCalculating the total loss of Moire removalAnd (4) losing L, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 -4 Down to 10 -5 ;
Step S43, training a second stage of the video Moire removing network based on the linear sparse attention Transformer: the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted cr Set to 0 with a learning rate of 10 -5 Down to 10 -6 And performing fine tuning training of the network.
Further, the specific operation of removing moire from the input video after the training is completed is as follows: for a new Moire pattern video, inserting two blank frames at the beginning and the end of the video respectively, starting to take the first five frames of the video and inputting the first five frames into a network for calculation to obtain a Moire pattern removed video frame corresponding to the first frame of the original Moire pattern video, then taking the second frame to the sixth frame of the video and inputting the second frame to the network to obtain a Moire pattern removed video frame corresponding to the second frame of the original Moire pattern video, and subsequently adopting the same operation until the Moire pattern removed video frame corresponding to the Moire pattern removed video frame is obtained.
An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the linear sparse attention Transformer-based video degranulation method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a linear sparse attention Transformer based video degranulation method as described above.
Compared with the prior art, the invention and the optimized scheme thereof have the following beneficial effects:
the spatial attention and the temporal attention are used, the spatial information and the temporal information between different frames are effectively utilized to remove Moire and supplement details, and artifacts and flicker are prevented from appearing in the video. The method uses a linear attention computing mode, can effectively reduce the square time complexity of the original Transformer attention computing mode into the linear time complexity, greatly reduces the computing amount of the network, and improves the practical application effect of the network. Meanwhile, a loss function constraint calculation matrix is adopted as a sparse matrix in the linear attention calculation process, so that a more effective and stable moire removing effect is realized. The method can remove the moire in the video under low calculation complexity, generate the high-quality clean video without moire, and improve the visual effect and the performance index of the generated video without moire. Therefore, the invention has strong practicability and wide application prospect.
Drawings
Fig. 1 is a schematic flow chart of implementation of the embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a video demarked network according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a spatial Transformer module according to an embodiment of the present invention.
FIG. 4 is a schematic structural diagram of a time Transformer module according to an embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, this embodiment further details the scheme of the present invention in a form of steps according to a specific operation example for implementing a linear sparse attention Transformer-based video degranulation method.
The method can be specifically summarized as the following steps:
step S1, processing videos in the original data set to obtain a Moire pattern video and a clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain a training set; step S1 specifically includes the following steps:
s11, acquiring an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; and forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another.
Step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;
step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame imageThe formula of (1) is as follows:
where (i, j) represents the position of the pixel.
And step S2, constructing a video Moire removing network based on the linear sparse attention Transformer. As shown in fig. 2, the video moir e removal network is composed of four parts, namely, a feature extraction module, a spatial Transformer module, a temporal Transformer module and an image reconstruction module;
specifically, step S2 includes the following steps:
step S21, constructing a feature extraction module to extract features of the video frame by using the feature extraction module to prepare for subsequent moir e removal;
as shown in FIG. 2, the input of the feature extraction module is five adjacent video frames in the same Moire pattern video, wherein the input video frame is I t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I t Inputting the data into the first volume block to obtain a feature mapThe size is C × H × W, willFeeding into a pooling layer and a second rolling block to obtain a characteristic diagramIt has a size ofSame, willFeeding into a pooling layer and a third rolling block to obtainWill be provided withFeeding into a pooling layer and a final rolled blockAndrespectively of size And
each convolution block consists of a convolution layer, an activation layer, a convolution layer and an activation layer in sequence; the ReLu activation function is adopted by the two activation layers, the convolution with convolution kernel of 3 x 3 is adopted by the two convolution layers, the first convolution layer realizes the change of the channel number, for example, the channel number of the first convolution block is changed from 3 to C, the channel number of the other convolution blocks is doubled, and the channel number of the second convolution layer is kept unchanged. Step S22, constructing a spatial Transformer module to capture the positions of the moire in the single frame image by using the spatial attention of the spatial Transformer and perform key point removal, so as to achieve a better moire removing effect.
As shown in FIG. 3, the spatial Transformer module consists of nine linear sparse attention democratic layers and one absolute position code, where the input to the first layer is the feature map of the feature extraction moduleThe input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer t For the final output of the spatial transform module, each layer calculates the spatial attention of the feature map within linear time complexity, helps the network to find out and remove the region with serious Moire, and the absolute position code is the sumLearnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;
the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multi-layer perceptron, a random inactivation layer and a normalization layer in sequence; the two random inactivation layers set the neuron inactivation probability to be 0.1, the two normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;
the spatial self-attention layer is composed of four learnable matrixes, namely a Query weight matrix W q Key weight matrix W k Value weight matrix W v And bottleneck matrix W p (ii) a The calculation formula for this layer is as follows:
Q=Dot(W q ,F in )
K=Dot(W k ,F in )
V=Dot(W v ,F in )
H=Dot(Softmax(Q),Dot(Softmax(K T ),V))
F out =Dot(W p ,H)
wherein, F in For input from the spatial attention layer, F out For the output of this layer, Q, K and V are the Query matrix, Key matrix and Value matrix, respectively, K T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W v Are sparse matrices under the constraint of the L2 loss function.
Step S23, constructing a time Transformer module to capture complementary information existing between multiple frames of images by using the time attention of the time Transformer, and performing image restoration by using the complementary information of adjacent frames to further improve the moire removing effect.
As shown in FIG. 4, the input of the temporal Transformer module is the final output F of the spatial Transformer module t (ii) a The module consists of four time attention degaussing layers, oneThe absolute position code and the absolute time code share parameters, the absolute time code is a learnable matrix with the dimension of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training; the first temporal attention democratic layer input is F for five video frames t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t-th frameThe final output of the time Transformer module to the t frame; the temporal attention degranulation layer and the linear sparse attention degranulation layer are similar in structure, and the difference is mainly that the temporal attention degranulation layer is provided with a temporal self-attention layer which can be used for capturing complementary information between adjacent video frames, and the removal of the degranulation is facilitated through the temporally complementary information;
the time attention democration layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multi-layer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the three random inactivation layers is set to be 0.1, the three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, the activation layer adopts a ReLu activation function, and the structure of a spatial self-attention layer is the same as that of a spatial self-attention layer in a linear sparse attention degranulation layer; before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are also required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;
the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' q Key weight matrix W' k Value weight matrix W' v And bottleneck matrix W' p (ii) a The calculation formula for this layer is as follows:
K a =[K 1 ,K 2 ,…,K 5 ]
V a =[V 1 ,V 2 ,…,V 5 ]
H t (i,j)=Dot(Softmax(Dot(Q t (i,j),(K a (i,j)) T )),V a (i,j))
F out =Dot(W′ p ,H)
where t denotes the tth frame, t ∈ [1,5 ]],For input features belonging to the t-th frame in the temporal self-attention layer, F out As output of this layer, Q t 、K t And V t Respectively, a Query matrix, a Key matrix and a Value matrix belonging to the T-th frame, Softmax () represents the Softmax calculation of the last dimension of the matrix, Dot () represents the matrix multiplication calculation, superscript T represents the matrix transposition, and]representing a composition matrix operation, K a And V a Key matrix and Value matrix of five frames respectively,h is an attention feature map of the complete time self-attention layer, (i, j) represents the position of the feature, specifically, the feature map is divided into a plurality of 2 × 2 non-overlapping small squares, and (i, j) represents the position of the small square where the feature is located; h t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K a (i, j) is represented by K a Local Key matrix, V, for the middle (i, j) position a (i, j) is represented at V a Local Value matrix, Q, for the (i, j) position t (i, j) is represented by Q t Local Query matrix, W 'for the (i, j) location' v Is a sparse matrix under the constraint of the L2 loss function.
And step S24, constructing an image reconstruction module, and decoding the video frame characteristics passing through the space Transformer module and the time Transformer module by using the image reconstruction module to recover the video frame characteristics into a Moire removing video frame with the same scale as the input video.
As shown in FIG. 2, the input to the image reconstruction module is the final output of the temporal Transformer module for the intermediate frame (third frame)Having a dimension ofThe module consists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided withInput to the first upsampling block to obtain a feature mapIt has a size of Will be provided withAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced signals into a first volume block to obtain a feature mapWill be provided withInputting the data into a second up-sampling block to obtain a characteristic diagramIt has a size ofWill be provided withAnd feature map of feature extraction moduleSplicing according to the channels and inputting the spliced signals into a second rolling block to obtain a characteristic diagramWill be provided withInputting the data into a third up-sampling block to obtain a characteristic diagramThe size is C × H × W, willAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced signals into a third rolling block to obtainCharacteristic diagramWill be provided withInputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame imageI.e. corresponding to Moire video frame I 3 Removing moire pattern frames;
the up-sampling block is composed of an up-sampling layer, a convolution layer and an active layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, meanwhile, the convolution layer realizes the change of the number of the channels of the characteristic diagram, namely, the number of the channels is reduced to half of the number of the original channels, and the active layer adopts a ReLu active function; the convolution block consists of convolution layers, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer also adopts a ReLu active function;
and step S3, constructing a loss function for training the video degranulation network.
Step S3 specifically includes the following steps:
step S31, constructing a total optimization target of the whole network; the optimization objectives are as follows:
min(L),
where min (L) represents the minimum L, L represents the total loss of democratic network,charbonnier loss representing the democratic image versus the clean image,representing the loss of edge texture in the deglitched image versus the clean image,representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;
step S32, constructing Charbonier loss of the Moire pattern removed image and the clean image;the calculation formula of (a) is as follows:
wherein,representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O 3 Representing the clean image with which it is paired, e represents a constant of control precision, λ C A weight representing the loss;
step S33, constructing the edge texture loss of the Moire pattern removed image and the clean image;the calculation formula of (a) is as follows:
wherein | | | purple hair 1 Is an absolute value taking operation, Sobel * Improved Sobel filters, Sobel, showing different orientations * () Denotes the convolution operation, λ ASL A weight representing the loss;
step S34, constructing color loss of the Moire pattern removed image and the clean image;the calculation formula of (a) is as follows:
wherein G represents a Gaussian blur kernel,representing a blurred, degritted image, G (O) 3 ) A clear image that is blurred is represented,is a squaring operation taking the two norms, λ cr A weight representing the loss;
step S35, constructing loss of sparse matrixes in a constrained space Transformer module and a time Transformer module;the calculation formula of (a) is as follows:
wherein Q * And K * Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,is a squaring operation of the absolute value, λ sparse A weight representing the loss;
and step S4, training the video democratic texture network by adopting the training data set.
Step S4 specifically includes the following steps:
step S41, randomly selecting a pair of Moire pattern video and clean video from the training data set, then randomly selecting five adjacent video frames from the Moire pattern video, and simultaneously selecting five corresponding video frames from the corresponding clean video, wherein the Moire pattern video frame and the clean video frame are respectively marked as I t And O t ,t∈[1,5];
Step S42, training a first stage of a video Moire pattern removing network based on a linear sparse attention transducer; inputting five video frames I t Obtaining the Moire-removed intermediate frame by calculation of the networkCalculating the total loss L of the Moire pattern removal, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 -4 Slowly decreases to 10 -5 ;
Step S43, training a second stage of the video Moire pattern removing network based on the linear sparse attention Transformer; the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted cr Set to 0 with a learning rate of 10 -5 Slowly decreases to 10 -6 Carrying out fine tuning training of the network;
in this embodiment, the whole training process is iterated four million times, and in each iteration process, a plurality of video pairs are randomly sampled and trained as one batch, the first two million and fifty million times are subjected to the first stage training of step S42, and the remaining one hundred and fifty million times are iterated and subjected to the second stage training of step S43;
and step S5, inputting the new Moire pattern video into the trained video Moire pattern removing network, and outputting the clean video without Moire patterns. Specifically, for a new moire video, two blank frames are respectively inserted at the beginning and the end of the video, the first five frames of the video are firstly taken and input into the network for calculation, a moire-removing video frame corresponding to the first frame of the original moire video is obtained, then the second frame to the sixth frame of the video are taken and input into the network, a moire-removing video frame corresponding to the second frame of the original moire video is obtained, and the same operation is subsequently adopted until the moire-removing video frame corresponding to the moire-removing video frame is obtained.
The embodiment also provides a linear sparse attention Transformer-based video degranulation system, which comprises a memory, a processor and computer program instructions stored in the memory and capable of being executed by the processor, wherein when the computer program instructions are executed by the processor, the above-mentioned method steps can be implemented.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow of the flowcharts, and combinations of flows in the flowcharts, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
The present invention is not limited to the above preferred embodiments, and other various forms of linear sparse attention transducer based video moir e removal methods can be derived by anyone in light of the present invention.
Claims (10)
1. A video Moire removing method based on linear sparse attention transducer is characterized by comprising the following steps: training a video Moire removing network based on a linear sparse attention transducer to remove Moire of an input video after training is finished;
the video degranulation network based on the linear sparse attention Transformer comprises:
the characteristic extraction module is used for extracting the characteristics of the video frame;
the spatial Transformer module is used for capturing the positions with Moire patterns in a single-frame image by using the spatial attention of the spatial Transformer and removing key points;
the time Transformer module captures complementary information existing among the images of the multiple frames by using the time attention of the time Transformer and carries out image recovery by using the complementary information of the adjacent frames;
and the image reconstruction module is used for decoding the video frame characteristics passing through the space Transformer module and the time Transformer module and recovering the video frame characteristics into the Moire-removed video frame with the same scale as the input video.
2. The linear sparse attention fransformer-based video degranulation method of claim 1, wherein: the input of the feature extraction module is five adjacent video frames in the same Moire pattern video, wherein the input video frame is I t It is expressed that the size is 3 XHXW, t ∈ [1,5 ]](ii) a The module consists of four convolution blocks and three pooling layers, wherein the convolution blocks are responsible for extracting image features, and the pooling layers adopt 2 multiplied by 2 average pooling layers to reduce feature scales; video frame I t Inputting the data into the first volume block to obtain a feature mapThe size is C × H × W, willFeeding into a pooling layer and a second rolling block to obtain a feature mapIt has a size ofSame, willFeeding into a pooling layer and a third rolling block to obtainWill be provided withFeeding into the pooling layer and the last rollBlock formationAndrespectively of size And
each convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence; the two active layers adopt ReLu active functions, the two convolutional layers adopt convolution with convolution kernel of 3 x 3, the first convolutional layer realizes the change of the number of channels, and the second convolutional layer maintains the number of channels unchanged.
3. The linear sparse attention fransformer-based video democration method of claim 2, wherein: the space Transformer module consists of nine linear sparse attention democration layers and an absolute position code;
wherein the input of the first layer is a feature map of a feature extraction moduleThe input of each subsequent layer is the output of the previous layer, and the output characteristic diagram F of the last layer t Calculating the spatial attention of the feature map in linear time complexity of each layer for the final output of the spatial transform module;
the absolute position code isLearnable matrixes with the same scale are obtained, and before training, the matrixes are subjected to parameter initialization by using an Xavier initialization method;
the linear sparse attention democratic layer consists of a spatial self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence; the neuron inactivation probability of the two random inactivation layers is set to be 0.1, and the two normalization layers adopt layer normalization; the multilayer perceptron is composed of a first full connection layer, an activation layer and a second full connection layer in sequence, wherein the activation layer adopts a ReLu activation function; before the input feature map is sent to the spatial self-attention layer, the feature map and the absolute position code are added element by element, and then the feature map and the absolute position code are sent to the spatial self-attention layer; residual connection exists between the input characteristic diagram of the absolute position code and the output of the first random inactivation layer, and residual connection exists between the output of the first normalization layer and the output of the second random inactivation layer;
the spatial self-attention layer consists of four learnable matrixes, namely a Query weight matrix W q Key weight matrix W k Value weight matrix W v And bottleneck matrix W p (ii) a The calculation formula for this layer is as follows:
Q=Dot(W q ,F in )
K=Dot(W k ,F in )
V=Dot(W v ,F in )
H=Dot(Softmax(Q),Dot(Softmax(K T ),V))
F out =Dot(W p ,H)
wherein, F in For spatial self-attention layer input, F out For spatial self-attention layer output, Q, K and V are Query matrix, Key matrix and Value matrix, K T A transposed matrix representing a K matrix, H is an attention feature graph of a spatial self-attention layer, and Dot () represents matrix multiplication; q, K and W v Are sparse matrices under the constraint of the L2 loss function.
4. The linear sparse attention Transformer-based video degranulation method according to claim 3, wherein:
the input of the time Transformer module is the final output F of the space Transformer module t (ii) a The system consists of four time attention degamma layers, an absolute position code and an absolute time code;
the absolute position code shares parameters with the absolute position code of the space Transformer module;
the absolute time coding is a learnable matrix with the scale of 5 multiplied by 8C multiplied by 1, and the matrix is initialized by parameters by using an Xavier initialization method before training;
the first temporal attention democratic layer input is F for five video frames t The input of each subsequent layer is the output of the previous layer, and the output characteristic graph of the last layer to the t-th frameThe final output of the time Transformer module to the t frame;
the time attention democratic layer consists of a time self-attention layer, a random inactivation layer, a normalization layer, a space self-attention layer, a random inactivation layer, a normalization layer, a multilayer perceptron, a random inactivation layer and a normalization layer in sequence;
the neuron inactivation probability of all three random inactivation layers is set to be 0.1, all three normalization layers adopt layer normalization, the multilayer perceptron is composed of a full connection layer, an activation layer and a full connection layer in sequence, and the activation layer adopts a ReLu activation function;
the structure of the spatial self-attention layer is the same as that of the spatial self-attention layer in the linear sparse attention democration layer;
before inputting the characteristic diagram into a time self-attention layer, firstly splicing the characteristic diagrams of five input video frames in a time dimension, then adding the spliced characteristic diagrams and absolute time codes element by element, and then sending the characteristic diagrams into the time self-attention layer, wherein before inputting the characteristic diagrams into a space self-attention layer, the spliced characteristic diagrams are required to be split according to the video frames, and then absolute position codes of space transformers are required to be added; residual connection exists between the input characteristic diagram of the absolute time coding and the output of the first random inactivation layer, residual connection exists between the characteristic diagram of the absolute position coding and the output of the second random inactivation layer, and residual connection exists between the output of the second normalization layer and the output of the third random inactivation layer;
the time self-attention layer consists of four learnable matrixes, namely a Query weight matrix W' q Key weight matrix W' k Value weight matrix W' v And bottleneck matrix W' p (ii) a The calculation formula for this layer is as follows:
K a =[K 1 ,K 2 ,…,K 5 ]
V a =[V 1 ,V 2 ,…,V 5 ]
H t (i,j)=Dot(Softmax(Dot(Q t (i,j),(K a (i,j)) T )),V a (i,j))
F out =Dot(W′ p ,H)
where t denotes the tth frame, t ∈ [1,5 ]],For input features belonging to the t-th frame in the temporal self-attention layer, F out Is time fromOutput of the attention layer, Q t 、K t And V t Respectively, a Query matrix, a Key matrix and a Value matrix belonging to the T-th frame, Softmax () represents the Softmax calculation of the last dimension of the matrix, Dot () represents the matrix multiplication calculation, superscript T represents the matrix transposition, and]representing a composition matrix operation, K a And V a The Key matrix and the Value matrix of five frames are respectively used, H is a complete attention feature map of a time self-attention layer, and (i, j) represents the position of a feature, namely, the feature map is divided into a plurality of 2 x 2 non-overlapping small squares, and (i, j) represents the position of the small square of the feature; h t (i, j) denotes the local attention feature of the (i, j) position in H of the t-th frame, K a (i, j) is represented by K a Local Key matrix, V, for the middle (i, j) position a (i, j) is represented at V a Local Value matrix, Q, for the (i, j) position t (i, j) is represented by Q t Local Query matrix, W 'for the (i, j) location' v Is a sparse matrix under the constraint of the L2 loss function.
5. The linear sparse attention transducer-based video degranulation method of claim 4, wherein:
the input of the image reconstruction module is the final output of the time Transformer module to the third frameHaving a dimension ofConsists of three upsampling blocks, three convolution blocks, a 1 x 1 convolution and a Tanh activation layer; will be provided withInput to the first upsampling block to obtain a feature mapIt has a size ofWill be provided withAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced signals into a first volume block to obtain a feature mapWill be provided withInputting the data into a second up-sampling block to obtain a characteristic diagramIt has a size ofWill be provided withAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced result into a second volume block to obtain a feature mapWill be provided withInputting the data into a third up-sampling block to obtain a characteristic diagramThe size of which is CxHxW, isAnd feature map of feature extraction moduleSplicing according to channels and inputting the spliced signals into a third volume block to obtain a feature mapWill be provided withInputting the input image into a 1 × 1 convolution and a Tanh activation layer to obtain an output frame imageI.e. corresponding to Moire video frame I 3 Removing moire pattern frames;
the up-sampling block is composed of an up-sampling layer, a convolution layer and an activation layer in sequence, the up-sampling layer adopts bilinear interpolation up-sampling with the magnification of 2, the convolution layer is convolution with the convolution kernel of 3 multiplied by 3, and the convolution layer realizes the change of the number of characteristic image channels, namely, the number of the channels is reduced to half of the original number, and the activation layer adopts a ReLu activation function; the convolution block consists of a convolution layer, an active layer, a convolution layer and an active layer in sequence, wherein the convolution layers are all convolutions with convolution kernels of 3 x 3, and the active layer adopts a ReLu active function.
6. The linear sparse attention transducer-based video degranulation method of claim 5, wherein: the loss function used to train the video degamma network is constructed as follows:
the overall optimization objective of the network is as follows:
min(L),
where min (L) represents the minimum L, L represents the total loss of democratic network,indicating Charbonnier loss of the degranulated image versus the clean image,representing the loss of edge texture in the deglitched image versus the clean image,representing the loss of the sparse matrix in the constrained space Transformer module and the time Transformer module;
charbonier loss of the Moire-removed image and the clean imageThe calculation formula of (c) is as follows:
wherein,representing the corresponding de-moir e image of the third frame of the five input moir e video frames, O 3 Representing the clean image with which it is paired, e represents a constant of control precision, λ C A weight representing the loss;
loss of edge texture of the degritted image and the clean imageThe calculation formula of (a) is as follows:
wherein | | | purple hair 1 Is an absolute value taking operation, Sobel * Improved Sobel filters, Sobel, showing different orientations * () Denotes the convolution operation, λ ASL A weight representing the loss;
wherein G represents a Gaussian blur kernel,representing a blurred, degritted image, G (O) 3 ) A blurred clean image is represented which is,is a squaring operation taking the two norms, λ cr A weight representing the loss;
loss of sparse matrices in the constrained space transform module and the temporal transform moduleThe calculation formula of (a) is as follows:
wherein Q is * And K * Representing the Query matrix and the Key matrix calculated in all the spaces in the space Transformer module from the attention layer,represents the Value weight matrix calculated in all the space self-attention layers in the space Transformer module and the time Transformer module,represents the Value weight matrix calculated from the attention layer for all the time in the time Transformer module,is a squaring operation of the absolute value, λ sparse Representing the weight of the loss.
7. The linear sparse attention transducer-based video degranulation method of claim 6, wherein: training a video Moire pattern removing network based on a linear sparse attention transducer, processing videos in an original data set by adopting a training set to obtain a Moire pattern video and a clean video pair, decoding each video into a video frame, and preprocessing each frame of image of the video to obtain the Moire pattern video and the clean video, wherein the method specifically comprises the following steps:
step S11, obtaining an original data set, wherein the Moire pattern video in the original data set has the same size and the corresponding clean video has the same content, and the Moire pattern video and the corresponding clean video correspond to each other one by one; forming Moire pattern video and clean video pairs by videos from the same video content, and ensuring that each frame of content in the video pairs corresponds to one another;
step S12, randomly turning all video frames of the Moire pattern video and the clean video pair in the same way, then randomly cutting the Moire pattern video frames and the clean video frames into H multiplied by W, and keeping the relation between the Moire pattern video and the clean video pair;
step S13, carrying out normalization processing on all Moire pattern video frames and clean video frames with the size of H multiplied by W; given a frame image I (I, j), calculating a normalized frame imageThe formula of (1) is as follows:
where (i, j) represents the position of the pixel.
8. The linear sparse attention fransformer-based video degranulation method of claim 7, wherein: the training process specifically comprises the following steps:
step S41, randomly selecting a pair of Moire pattern video and clean video from the training set, then randomly selecting five adjacent video frames from the Moire pattern video, and simultaneously selecting five corresponding video frames from the corresponding clean video, wherein the Moire pattern video frame and the clean video frame are respectively marked as I t And O t ,t∈[1,5];
Step S42, training a first stage of a video Moire removing network based on a linear sparse attention Transformer: inputting five video frames I t Obtaining the Moire-removed intermediate frame by calculation of the networkCalculating the total loss L of the Moire pattern removal, calculating the gradient of each parameter in the network by adopting a back propagation calculation method, and updating the parameters by utilizing an Adam optimization method, wherein the learning rate is 10 -4 Down to 10 -5 ;
Step S43, training a second stage of the video Moire removing network based on the linear sparse attention Transformer: the input and training method is the same as the first stage, except that the color loss weight λ in the total loss L of democratic lines is weighted cr Set to 0 with a learning rate of 10 -5 Down to 10 -6 And performing fine tuning training of the network.
9. The linear sparse attention fransformer-based video degranulation method of claim 8, wherein: the specific operation of removing the moire from the input video after the training is completed is as follows: for a new Moire pattern video, inserting two blank frames at the beginning and the end of the video respectively, starting to take the first five frames of the video and inputting the first five frames into a network for calculation to obtain a Moire pattern removed video frame corresponding to the first frame of the original Moire pattern video, then taking the second frame to the sixth frame of the video and inputting the second frame to the network to obtain a Moire pattern removed video frame corresponding to the second frame of the original Moire pattern video, and subsequently adopting the same operation until the Moire pattern removed video frame corresponding to the Moire pattern removed video frame is obtained.
10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the linear sparse attention Transformer based video degranulation method as recited in any one of claims 1-9 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210649880.XA CN114881888A (en) | 2022-06-10 | 2022-06-10 | Video Moire removing method based on linear sparse attention transducer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210649880.XA CN114881888A (en) | 2022-06-10 | 2022-06-10 | Video Moire removing method based on linear sparse attention transducer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114881888A true CN114881888A (en) | 2022-08-09 |
Family
ID=82680890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210649880.XA Pending CN114881888A (en) | 2022-06-10 | 2022-06-10 | Video Moire removing method based on linear sparse attention transducer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114881888A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116596779A (en) * | 2023-04-24 | 2023-08-15 | 天津大学 | Transform-based Raw video denoising method |
CN116634209A (en) * | 2023-07-24 | 2023-08-22 | 武汉能钠智能装备技术股份有限公司 | Breakpoint video recovery system and method based on hot plug |
CN116831581A (en) * | 2023-06-15 | 2023-10-03 | 中南大学 | Remote physiological sign extraction-based driver state monitoring method and system |
CN117725844A (en) * | 2024-02-08 | 2024-03-19 | 厦门蝉羽网络科技有限公司 | Large model fine tuning method, device, equipment and medium based on learning weight vector |
CN117808706A (en) * | 2023-12-28 | 2024-04-02 | 山东财经大学 | Video rain removing method, system, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539884A (en) * | 2020-04-21 | 2020-08-14 | 温州大学 | Neural network video deblurring method based on multi-attention machine mechanism fusion |
CN112598602A (en) * | 2021-01-06 | 2021-04-02 | 福建帝视信息科技有限公司 | Mask-based method for removing Moire of deep learning video |
CN113065645A (en) * | 2021-04-30 | 2021-07-02 | 华为技术有限公司 | Twin attention network, image processing method and device |
WO2021134874A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳大学 | Training method for deep residual network for removing a moire pattern of two-dimensional code |
-
2022
- 2022-06-10 CN CN202210649880.XA patent/CN114881888A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021134874A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳大学 | Training method for deep residual network for removing a moire pattern of two-dimensional code |
CN111539884A (en) * | 2020-04-21 | 2020-08-14 | 温州大学 | Neural network video deblurring method based on multi-attention machine mechanism fusion |
CN112598602A (en) * | 2021-01-06 | 2021-04-02 | 福建帝视信息科技有限公司 | Mask-based method for removing Moire of deep learning video |
CN113065645A (en) * | 2021-04-30 | 2021-07-02 | 华为技术有限公司 | Twin attention network, image processing method and device |
Non-Patent Citations (1)
Title |
---|
聂可卉;刘文哲;童同;杜民;高钦泉;: "基于自适应可分离卷积核的视频压缩伪影去除算法", 计算机应用, no. 05, 10 May 2019 (2019-05-10), pages 233 - 239 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116596779A (en) * | 2023-04-24 | 2023-08-15 | 天津大学 | Transform-based Raw video denoising method |
CN116596779B (en) * | 2023-04-24 | 2023-12-01 | 天津大学 | Transform-based Raw video denoising method |
CN116831581A (en) * | 2023-06-15 | 2023-10-03 | 中南大学 | Remote physiological sign extraction-based driver state monitoring method and system |
CN116634209A (en) * | 2023-07-24 | 2023-08-22 | 武汉能钠智能装备技术股份有限公司 | Breakpoint video recovery system and method based on hot plug |
CN116634209B (en) * | 2023-07-24 | 2023-11-17 | 武汉能钠智能装备技术股份有限公司 | Breakpoint video recovery system and method based on hot plug |
CN117808706A (en) * | 2023-12-28 | 2024-04-02 | 山东财经大学 | Video rain removing method, system, equipment and storage medium |
CN117725844A (en) * | 2024-02-08 | 2024-03-19 | 厦门蝉羽网络科技有限公司 | Large model fine tuning method, device, equipment and medium based on learning weight vector |
CN117725844B (en) * | 2024-02-08 | 2024-04-16 | 厦门蝉羽网络科技有限公司 | Large model fine tuning method, device, equipment and medium based on learning weight vector |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114881888A (en) | Video Moire removing method based on linear sparse attention transducer | |
CN107403415B (en) | Compressed depth map quality enhancement method and device based on full convolution neural network | |
Yu et al. | A unified learning framework for single image super-resolution | |
CN109272452B (en) | Method for learning super-resolution network based on group structure sub-band in wavelet domain | |
CN106709875A (en) | Compressed low-resolution image restoration method based on combined deep network | |
CN114693558A (en) | Image Moire removing method and system based on progressive fusion multi-scale strategy | |
CN113284051A (en) | Face super-resolution method based on frequency decomposition multi-attention machine system | |
CN114723630A (en) | Image deblurring method and system based on cavity double-residual multi-scale depth network | |
Liu et al. | Research on super-resolution reconstruction of remote sensing images: A comprehensive review | |
CN114418853A (en) | Image super-resolution optimization method, medium and device based on similar image retrieval | |
CN108122262B (en) | Sparse representation single-frame image super-resolution reconstruction algorithm based on main structure separation | |
Hai et al. | Advanced retinexnet: a fully convolutional network for low-light image enhancement | |
CN116797456A (en) | Image super-resolution reconstruction method, system, device and storage medium | |
CN113096032B (en) | Non-uniform blurring removal method based on image region division | |
CN117333398A (en) | Multi-scale image denoising method and device based on self-supervision | |
Guo et al. | Orthogonally regularized deep networks for image super-resolution | |
CN115272131B (en) | Image mole pattern removing system and method based on self-adaptive multispectral coding | |
CN110895790A (en) | Scene image super-resolution method based on posterior degradation information estimation | |
CN115760638A (en) | End-to-end deblurring super-resolution method based on deep learning | |
CN115456891A (en) | Under-screen camera image restoration method based on U-shaped dynamic network | |
Ling et al. | PRNet: Pyramid Restoration Network for RAW Image Super-Resolution | |
CN114331853A (en) | Single image restoration iteration framework based on target vector updating module | |
Xu et al. | Joint learning of super-resolution and perceptual image enhancement for single image | |
Xu et al. | Swin transformer and ResNet based deep networks for low-light image enhancement | |
CN117291855B (en) | High resolution image fusion method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |