CN115131254A

CN115131254A - Constant bit rate compressed video quality enhancement method based on two-domain learning

Info

Publication number: CN115131254A
Application number: CN202210768954.1A
Authority: CN
Inventors: 陈兴颖; 郑博仑; 张继勇; 颜成钢
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-30

Abstract

The invention provides a constant bit rate compressed video quality enhancement method based on two-domain learning, which comprises the steps of firstly, preprocessing data to obtain a high-quality and low-quality video frame data set; then constructing a multi-frame video enhancement network model; training a multi-frame video enhancement network model by using the generated data set; and finally, inputting the low-quality video frame into the model to obtain a high-quality video frame, and calculating the peak signal-to-noise ratio. According to the invention, the loss information of the discrete cosine transform domain can be captured by the low-quality video frame by means of inter-frame alignment, inter-frame fusion and superposition of convolution estimation compression quantization loss to the convolution characteristic domain in the discrete cosine transform domain. The multi-frame video enhancement network model constructed by the invention is a multi-scale structure and has up-sampling and down-sampling operations. In the multi-scale structure of discrete cosine transform, the invention provides 0.5 times of discrete cosine transform to replace pixel shuffling upsampling, and can well reduce the part of quantization loss.

Description

Constant bit rate compressed video quality enhancement method based on double-domain learning

Technical Field

The invention belongs to the field of compressed video quality enhancement, and particularly relates to a compressed video quality compression recovery method based on double-domain learning.

Technical Field

In recent years, multimedia is spread more and more frequently on the internet, and multimedia and video can generate 70% to 80% of mobile data traffic, wherein the proportion of high-resolution multimedia is rapidly increasing, and the demand of people for high-definition multimedia is also increasing. To solve the problems of huge storage cost and limited bandwidth when storing and transmitting multimedia data, lossy compression algorithms are usually used to compress multimedia data (such as images, audio and video), and these irreversible compression algorithms usually introduce compression artifacts that degrade the quality of experience, especially for video. Therefore, video compression artifact removal, which aims to reduce the introduced artifacts and recover and compression encode lossy compressed video details, is a hot topic in the multimedia field. In the past decades, conventional video compression standards such as h.264, h.265, etc. have been proposed, but these encoders are made by hand and cannot optimize the pixel loss associated with compression in an end-to-end manner.

Based on the research on the aspect of deep learning image video compression, the deep learning is shown, and the additional time-space information has great potential in improving the distortion of the compressed video. For example, land et al propose optical flow for motion compensation and apply an auto-encoder to compress the optical flow and residual, and zheng et al propose an implicit two-domain convolutional network to reduce JPEG image compression artifacts, using the pixel location marker map and quantization table as inputs, unlike the conventional two-domain learning method in which DCT transform is applied to the DCT domain, the DCT domain loss is estimated directly from the features extracted by convolution, without the DCT transform. Implicit two-domain convolution performs well in improving the quality of JPEG compressed images. Zhao et al propose to enhance the compressed video quality with the loss of the discrete cosine transform domain. The innovation of the invention has a plurality of points worth of reference and learning.

In the aspect of video quality enhancement, the commonly used traditional compression coding is h.264 and h.265, and cannot meet the requirement of high-quality video restoration at the present stage. Whereas deep learning based methods typically learn non-linear mapping to directly regress the artifact-free image from a large amount of training data to obtain results efficiently.

Disclosure of Invention

Aiming at the existing problems, the invention provides a constant bit rate compressed video quality enhancement method based on double-domain learning, which obtains a model for enhancing the quality of video frames by training high-quality video frames and constant bit rate compressed video frames. The invention applies inverse discrete cosine transform upsampling to multi-scale network video quality enhancement, and compared with the conventional upsampling process (pixel shuffling and deconvolution), the pixel frame recovery in the discrete cosine domain is greatly improved, so that the frame quality is improved, and the video quality is improved.

The technical scheme adopted by the invention is as follows:

a constant bit rate compressed video quality enhancement method based on double-domain learning comprises the following steps:

the method comprises the following steps: preprocessing data to obtain a high-quality and low-quality video frame data set;

step two: constructing a multi-frame video enhancement network model;

step three: training a multi-frame video enhancement network model by using the data set generated in the step one;

step four: and inputting the low-quality video frame into the model to obtain a high-quality video frame, and calculating the peak signal-to-noise ratio.

The invention has the following effective effects:

1. according to the method, the loss information of the discrete cosine transform domain can be captured by the low-quality video frame through inter-frame alignment, inter-frame fusion and superposition of the convolution estimation compression quantization loss in the discrete cosine transform domain to the convolution characteristic domain.

2. The multi-frame video enhancement network model constructed by the invention is a multi-scale structure and has up-sampling and down-sampling operations. In the multi-scale structure of discrete cosine transform, the invention provides 0.5 times of discrete cosine transform to replace pixel shuffling upsampling, and can well reduce the part of quantization loss.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall structure of a multi-frame video enhancement network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a DCT domain recovery module according to an embodiment of the present invention;

fig. 4 shows a network model test result according to an embodiment of the present invention.

Detailed Description

As described in the above technical solutions and the accompanying drawings, a video quality enhancement system based on frequency loss and residual error density includes:

the video quality enhancement based on the two-domain learning comprises the steps of arranging a data set, training a model, debugging network parameters and testing results. We use the NTIRE2021 contest video data set which contains the video collected from YouTube. The data set consists of 200 training videos, each with over 100 consecutive frames. FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

the method comprises the following steps of firstly, preprocessing data to obtain a high-quality and low-quality video frame data set, and specifically comprises the following steps:

downloading a data set high-quality video from an NTIRE official network, and carrying out frame cutting on the original high-quality YUV format video by lossless conversion into an MKV file to be used as a high-quality video frame training set;

the converted high quality MKV video is encoded by using an HM16 encoder according to a parameter encoding parameter of 30fps supporting X.265 and fixed bit rate of 800kbps to generate low quality video. In the processing of low quality video, a constant bit rate video is encoded using an FFmpeg open source tool and then frame-clipped to generate a low quality video frame data set.

Step two, building a multi-frame video enhancement network model;

as shown in fig. 2, the multi-frame video enhanced network model includes a feature extraction layer, a multi-frame alignment fusion module, and a dual-domain restoration module. The sequence of video frames entering the network model is a feature extraction layer and a multi-frame alignment fusion module, a down-sampling one-time feature map is obtained through one-time down-sampling, a down-sampling two-time feature map is obtained through down-sampling the down-sampling one-time feature map again, the feature map is combined into the previous scale through a double-domain recovery module and ConvReLU. And then, the feature of the next scale is merged by sampling the one-time feature map, and the result is merged into the original scale through a dual-domain recovery module and a ConvReLU layer. And finally, in the original scale, combining the feature maps, and finally outputting the enhanced video frame through a double-domain recovery module and a ConvReLU layer.

The characteristic extraction layer is composed of a plurality of residual blocks and is used for changing the input image from three channels into sixty-four channels.

The multi-frame alignment fusion module is divided into a multi-frame alignment part and a multi-frame fusion part. The multi-frame alignment part adopts a pyramid, a cascade connection and a deformable convolution module, wherein the pyramid and the cascade connection structure means that video frames are connected and progressed layer by layer through a three-layer network structure similar to a pyramid until the video frames are output from the first layer of the pyramid. The deformable convolution module predicts the offset of a plurality of video frames by using deformable convolution, and continuously compensates to an ideal value through the learnable characteristic of the convolution. The multi-frame fusion part adopts a space-time characteristic attention fusion module, different weights are given to a characteristic diagram in two dimensions of time and space, the concerned part is emphasized to restore, and finally, the input five frames of video frames are fused into a single frame of video frame.

The dual-domain recovery module comprises a pixel domain recovery block and a discrete cosine transform domain recovery module. The dct domain restoration module shown in fig. 3 is based on an inverse dct, and includes a convolution separation layer, a dynamic pooling layer, and an inverse dct unit, and the feature map is separated into a Y channel and a Cr/Cb channel by the convolution separation layer and then passes through the dynamic pooling layer and the inverse dct unit. Wherein the dynamic pooling layer is composed of three adaptive pooling layers for estimating adaptive quantization parameters; the inverse discrete cosine transform unit carries out quantization estimation analysis and upsampling on the input quantization parameters and the characteristic diagram, and the upsampling is realized by setting the sampling rate of the inverse discrete cosine transform unit to be 0.5, so that a plurality of pixel points can be predicted by each transformation, and the purpose of upsampling and compensation of details of a compressed video frame are achieved.

In the discrete cosine transform double-domain recovery module, the feature extraction is formed by connecting a group of 3 multiplied by 3 expansion convolution layers containing dense connection layer by layer, and the information of the previous scale can be integrated when new scale feature information is extracted each time. The discrete cosine transform restoration domain block separates the feature map into a Y channel and a Cr/Cb channel at a convolution separation layer, because the human eye is sensitive to the Y channel, the Y channel quantization estimation has a higher priority than the Cr/Cb channel, and the Y channel is heavily processed.

The pixel domain recovery module consists of a plurality of residual blocks, and the module and the discrete cosine transform are processed in parallel to complete the whole quantization loss compensation and superposition. The two modules together complete the quantization compensation task.

Training a multi-frame video enhancement network model through the data set generated in the step one;

due to the limited memory of the equipment and the expansion of the data set, the network has better generalization, and five continuous high-quality and low-quality video frames are selected as the input of the video image enhancement network model each time.

The Adam optimizer is adopted in the training network, MSE Loss is used as a Loss function, and compared with a common L1 paradigm function, the MSE Loss can better process edges and show good performance and detail sharpening when the MSE Loss is used for training a model.

In the training process, the positions randomly selected from five frames of high-quality and low-quality video frames input each time are cut into small pictures to accelerate the training speed. The initial learning rate is set to 1E-4, and when the objective evaluation index is not further improved by five epochs, learning is carried out

The rate is reduced to 0.5 times of the original rate, and finally, when the learning rate is lower than 1E-6, the training is stopped. The model can quickly give an evaluation result by reducing the learning rate of the model parameters every time, the network can be quickly converged to a loss interval by the higher learning rate in the early stage, and the network can be finely adjusted by the lower learning rate in the later stage, so that the model effect can achieve the optimal effect.

Inputting the low-quality video frames into the trained multi-frame video enhancement network model to obtain high-quality video images;

and (4) adding the low-quality images into the trained multi-frame video enhancement network model. Firstly, the image is processed by a feature extraction layer, and three channels of the image are changed into sixty-four channels. The image processed by the feature extraction layer is aligned with the front and rear frames through the variable convolution of the multi-frame alignment part of the multi-frame alignment fusion module, and the correlation of the front and rear frames is enhanced. And then dynamically aggregating the five input video frames into a single frame through a space-time characteristic attention fusion module of a multi-frame fusion part. And finally, respectively recovering quantization losses of the Y channel and the Cr/Cb channel caused by compression through a double-domain recovery module. And in each branch, a dynamic pooling layer and an inverse discrete cosine transform unit for adaptively acquiring the quantization loss of the characteristic space are added, simultaneously, pixel domain enhancement is performed in parallel, and the tasks of quantization compensation and upsampling are completed together to obtain a final high-quality video image.

Claims

1. A constant bit rate compressed video quality enhancement method based on double-domain learning is characterized by comprising the following steps:

step two: constructing a multi-frame video enhancement network model;

2. The method for enhancing quality of a constant bitrate compressed video based on two-domain learning according to claim 1, wherein the specific method in the first step is as follows:

downloading a data set high-quality video from an NTIRE official network, and converting an original high-quality YUV format video into an MKV file in a lossless manner to cut the MKV file into a frame to be used as a high-quality video frame training set;

the converted high-quality MKV video is encoded by using an HM16 encoder according to the parameter encoding parameters of the frame rate of 30fps, supporting X.265 and the fixed bit rate of 800kbps to generate low-quality video; in the processing of low quality video, a constant bit rate video is encoded using an FFmpeg open source tool and then frame-clipped to generate a low quality video frame data set.

3. The method for enhancing the quality of the compressed video with the constant bit rate based on the two-domain learning according to claim 2, wherein the second specific method comprises the following steps;

the multi-frame video enhancement network model comprises a feature extraction layer, a multi-frame alignment fusion module and a double-domain recovery module; the sequence of the video frames entering the network model is a feature extraction layer and a multi-frame alignment fusion module, then a down-sampling one-time feature map is obtained through one down-sampling, then the down-sampling one-time feature map is down-sampled again to obtain a down-sampling two-time feature map, the feature map is processed through a double-domain recovery module and ConvReLU, and the obtained feature map is merged into the previous scale; then, sampling a one-time feature map at the lower part, merging the features of the next scale, and merging the result into the original scale through a double-domain recovery module and a ConvReLU layer; finally, in the original scale, combining the feature maps, and finally outputting the enhanced video frame through a double-domain recovery module and a ConvReLU layer;

the characteristic extraction layer is composed of a plurality of residual blocks and is used for changing an input image from three channels into sixty-four channels;

the multi-frame alignment fusion module is divided into a multi-frame alignment part and a multi-frame fusion part; the multi-frame alignment part adopts a pyramid, a cascade connection and a deformable convolution module, wherein the pyramid and the cascade connection structure refer to that video frames are connected and progressed layer by layer through a three-layer network structure similar to a pyramid until the video frames are output from the first layer of the pyramid; the deformable convolution module predicts the offset of a plurality of video frames by utilizing deformable convolution, and continuously compensates to an ideal value through the characteristic that the convolution can learn; the multi-frame fusion part adopts a space-time feature attention fusion module, gives different weights to feature maps in two dimensions of time and space, focuses on the concerned part to recover, and finally fuses the input five frames of video frames into a single frame of video frame;

the dual-domain recovery module comprises a pixel domain recovery block and a discrete cosine transform domain recovery module; the discrete cosine transform domain recovery module is based on inverse discrete cosine transform and comprises a convolution separation layer, a dynamic pooling layer and an inverse discrete cosine transform unit, wherein a characteristic diagram is divided into a Y channel and a Cr/Cb channel through the convolution separation layer and then passes through the dynamic pooling layer and the inverse discrete cosine transform unit; wherein the dynamic pooling layer is composed of three adaptive pooling layers for estimating adaptive quantization parameters; the inverse discrete cosine transform unit carries out quantitative estimation analysis and upsampling on the input quantization parameters and the characteristic diagram, and the upsampling is realized by setting the sampling rate of the inverse discrete cosine transform unit to be 0.5, so that a plurality of pixel points can be predicted by each transformation, and the purpose of upsampling and the compensation of details of a compressed video frame are achieved;

in a discrete cosine transform double-domain recovery module, the feature extraction is formed by connecting a group of 3 multiplied by 3 expansion convolution layers containing dense connection layer by layer, and the information of the previous scale can be integrated when new scale feature information is extracted each time; the discrete cosine transform recovery domain block divides the characteristic diagram into a Y channel and a Cr/Cb channel in a convolution separation layer, because human eyes are sensitive to the Y channel, the quantization estimation priority of the Y channel is higher than that of the Cr/Cb channel, and the Y channel is processed emphatically;

the pixel domain recovery module consists of a plurality of residual blocks, and the module and the discrete cosine transform are processed in parallel to complete the whole quantization loss compensation and superposition; the two modules together complete the quantization compensation task.

4. The method for enhancing the quality of the compressed video with the constant bit rate based on the two-domain learning according to claim 3, wherein the third specific method comprises the following steps;

due to the limited memory of the equipment and the expansion of a data set, the network has better generalization, and five continuous high-quality and low-quality video frames are selected as the input of a video image enhancement network model each time;

an Adam optimizer is adopted in a training network, and MSE Loss is used as a Loss function;

in the training process, firstly randomly selecting positions in five frames of high-quality and low-quality video frames input each time, and cutting the positions into small pictures to accelerate the training speed; the initial learning rate is set to 1E-4, when the objective evaluation index is not further improved by five epochs, the learning rate is reduced to 0.5 time of the original learning rate, and finally when the learning rate is lower than 1E-6, the training is stopped.

5. The constant bitrate compressed video quality enhancement method based on two-domain learning according to claim 4, wherein the fourth specific method is as follows;

the method comprises the steps that low-quality images are placed in a trained multi-frame video enhancement network model, firstly, the images are processed through a feature extraction layer, and three channels of the images are changed into sixty-four channels; the image processed by the feature extraction layer is aligned with the front and rear frames through the variable convolution of the multi-frame alignment part of the multi-frame alignment fusion module, and the correlation of the front and rear frames is enhanced; then, dynamically aggregating the five input video frames into a single frame through a space-time characteristic attention fusion module of a multi-frame fusion part; finally, respectively recovering quantization losses of the Y channel and the Cr/Cb channel caused by compression through a double-domain recovery module; in each branch, a dynamic pooling layer and a discrete cosine inverse transformation unit for adaptively obtaining the quantization loss of the characteristic space are added, pixel domain enhancement is performed in parallel, and the tasks of quantization compensation and upsampling are completed together to obtain a final high-quality video image.