CN115131254A - Constant bit rate compressed video quality enhancement method based on two-domain learning - Google Patents

Constant bit rate compressed video quality enhancement method based on two-domain learning Download PDF

Info

Publication number
CN115131254A
CN115131254A CN202210768954.1A CN202210768954A CN115131254A CN 115131254 A CN115131254 A CN 115131254A CN 202210768954 A CN202210768954 A CN 202210768954A CN 115131254 A CN115131254 A CN 115131254A
Authority
CN
China
Prior art keywords
frame
video
quality
domain
discrete cosine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210768954.1A
Other languages
Chinese (zh)
Inventor
陈兴颖
郑博仑
张继勇
颜成钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210768954.1A priority Critical patent/CN115131254A/en
Publication of CN115131254A publication Critical patent/CN115131254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/10Image enhancement or restoration using non-spatial domain filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20052Discrete cosine transform [DCT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a constant bit rate compressed video quality enhancement method based on two-domain learning, which comprises the steps of firstly, preprocessing data to obtain a high-quality and low-quality video frame data set; then constructing a multi-frame video enhancement network model; training a multi-frame video enhancement network model by using the generated data set; and finally, inputting the low-quality video frame into the model to obtain a high-quality video frame, and calculating the peak signal-to-noise ratio. According to the invention, the loss information of the discrete cosine transform domain can be captured by the low-quality video frame by means of inter-frame alignment, inter-frame fusion and superposition of convolution estimation compression quantization loss to the convolution characteristic domain in the discrete cosine transform domain. The multi-frame video enhancement network model constructed by the invention is a multi-scale structure and has up-sampling and down-sampling operations. In the multi-scale structure of discrete cosine transform, the invention provides 0.5 times of discrete cosine transform to replace pixel shuffling upsampling, and can well reduce the part of quantization loss.

Description

Constant bit rate compressed video quality enhancement method based on double-domain learning
Technical Field
The invention belongs to the field of compressed video quality enhancement, and particularly relates to a compressed video quality compression recovery method based on double-domain learning.
Technical Field
In recent years, multimedia is spread more and more frequently on the internet, and multimedia and video can generate 70% to 80% of mobile data traffic, wherein the proportion of high-resolution multimedia is rapidly increasing, and the demand of people for high-definition multimedia is also increasing. To solve the problems of huge storage cost and limited bandwidth when storing and transmitting multimedia data, lossy compression algorithms are usually used to compress multimedia data (such as images, audio and video), and these irreversible compression algorithms usually introduce compression artifacts that degrade the quality of experience, especially for video. Therefore, video compression artifact removal, which aims to reduce the introduced artifacts and recover and compression encode lossy compressed video details, is a hot topic in the multimedia field. In the past decades, conventional video compression standards such as h.264, h.265, etc. have been proposed, but these encoders are made by hand and cannot optimize the pixel loss associated with compression in an end-to-end manner.
Based on the research on the aspect of deep learning image video compression, the deep learning is shown, and the additional time-space information has great potential in improving the distortion of the compressed video. For example, land et al propose optical flow for motion compensation and apply an auto-encoder to compress the optical flow and residual, and zheng et al propose an implicit two-domain convolutional network to reduce JPEG image compression artifacts, using the pixel location marker map and quantization table as inputs, unlike the conventional two-domain learning method in which DCT transform is applied to the DCT domain, the DCT domain loss is estimated directly from the features extracted by convolution, without the DCT transform. Implicit two-domain convolution performs well in improving the quality of JPEG compressed images. Zhao et al propose to enhance the compressed video quality with the loss of the discrete cosine transform domain. The innovation of the invention has a plurality of points worth of reference and learning.
In the aspect of video quality enhancement, the commonly used traditional compression coding is h.264 and h.265, and cannot meet the requirement of high-quality video restoration at the present stage. Whereas deep learning based methods typically learn non-linear mapping to directly regress the artifact-free image from a large amount of training data to obtain results efficiently.
Disclosure of Invention
Aiming at the existing problems, the invention provides a constant bit rate compressed video quality enhancement method based on double-domain learning, which obtains a model for enhancing the quality of video frames by training high-quality video frames and constant bit rate compressed video frames. The invention applies inverse discrete cosine transform upsampling to multi-scale network video quality enhancement, and compared with the conventional upsampling process (pixel shuffling and deconvolution), the pixel frame recovery in the discrete cosine domain is greatly improved, so that the frame quality is improved, and the video quality is improved.
The technical scheme adopted by the invention is as follows:
a constant bit rate compressed video quality enhancement method based on double-domain learning comprises the following steps:
the method comprises the following steps: preprocessing data to obtain a high-quality and low-quality video frame data set;
step two: constructing a multi-frame video enhancement network model;
step three: training a multi-frame video enhancement network model by using the data set generated in the step one;
step four: and inputting the low-quality video frame into the model to obtain a high-quality video frame, and calculating the peak signal-to-noise ratio.
The invention has the following effective effects:
1. according to the method, the loss information of the discrete cosine transform domain can be captured by the low-quality video frame through inter-frame alignment, inter-frame fusion and superposition of the convolution estimation compression quantization loss in the discrete cosine transform domain to the convolution characteristic domain.
2. The multi-frame video enhancement network model constructed by the invention is a multi-scale structure and has up-sampling and down-sampling operations. In the multi-scale structure of discrete cosine transform, the invention provides 0.5 times of discrete cosine transform to replace pixel shuffling upsampling, and can well reduce the part of quantization loss.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an overall structure of a multi-frame video enhancement network model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a DCT domain recovery module according to an embodiment of the present invention;
fig. 4 shows a network model test result according to an embodiment of the present invention.
Detailed Description
As described in the above technical solutions and the accompanying drawings, a video quality enhancement system based on frequency loss and residual error density includes:
the video quality enhancement based on the two-domain learning comprises the steps of arranging a data set, training a model, debugging network parameters and testing results. We use the NTIRE2021 contest video data set which contains the video collected from YouTube. The data set consists of 200 training videos, each with over 100 consecutive frames. FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
a constant bit rate compressed video quality enhancement method based on double-domain learning comprises the following steps:
the method comprises the following steps of firstly, preprocessing data to obtain a high-quality and low-quality video frame data set, and specifically comprises the following steps:
downloading a data set high-quality video from an NTIRE official network, and carrying out frame cutting on the original high-quality YUV format video by lossless conversion into an MKV file to be used as a high-quality video frame training set;
the converted high quality MKV video is encoded by using an HM16 encoder according to a parameter encoding parameter of 30fps supporting X.265 and fixed bit rate of 800kbps to generate low quality video. In the processing of low quality video, a constant bit rate video is encoded using an FFmpeg open source tool and then frame-clipped to generate a low quality video frame data set.
Step two, building a multi-frame video enhancement network model;
as shown in fig. 2, the multi-frame video enhanced network model includes a feature extraction layer, a multi-frame alignment fusion module, and a dual-domain restoration module. The sequence of video frames entering the network model is a feature extraction layer and a multi-frame alignment fusion module, a down-sampling one-time feature map is obtained through one-time down-sampling, a down-sampling two-time feature map is obtained through down-sampling the down-sampling one-time feature map again, the feature map is combined into the previous scale through a double-domain recovery module and ConvReLU. And then, the feature of the next scale is merged by sampling the one-time feature map, and the result is merged into the original scale through a dual-domain recovery module and a ConvReLU layer. And finally, in the original scale, combining the feature maps, and finally outputting the enhanced video frame through a double-domain recovery module and a ConvReLU layer.
The characteristic extraction layer is composed of a plurality of residual blocks and is used for changing the input image from three channels into sixty-four channels.
The multi-frame alignment fusion module is divided into a multi-frame alignment part and a multi-frame fusion part. The multi-frame alignment part adopts a pyramid, a cascade connection and a deformable convolution module, wherein the pyramid and the cascade connection structure means that video frames are connected and progressed layer by layer through a three-layer network structure similar to a pyramid until the video frames are output from the first layer of the pyramid. The deformable convolution module predicts the offset of a plurality of video frames by using deformable convolution, and continuously compensates to an ideal value through the learnable characteristic of the convolution. The multi-frame fusion part adopts a space-time characteristic attention fusion module, different weights are given to a characteristic diagram in two dimensions of time and space, the concerned part is emphasized to restore, and finally, the input five frames of video frames are fused into a single frame of video frame.
The dual-domain recovery module comprises a pixel domain recovery block and a discrete cosine transform domain recovery module. The dct domain restoration module shown in fig. 3 is based on an inverse dct, and includes a convolution separation layer, a dynamic pooling layer, and an inverse dct unit, and the feature map is separated into a Y channel and a Cr/Cb channel by the convolution separation layer and then passes through the dynamic pooling layer and the inverse dct unit. Wherein the dynamic pooling layer is composed of three adaptive pooling layers for estimating adaptive quantization parameters; the inverse discrete cosine transform unit carries out quantization estimation analysis and upsampling on the input quantization parameters and the characteristic diagram, and the upsampling is realized by setting the sampling rate of the inverse discrete cosine transform unit to be 0.5, so that a plurality of pixel points can be predicted by each transformation, and the purpose of upsampling and compensation of details of a compressed video frame are achieved.
In the discrete cosine transform double-domain recovery module, the feature extraction is formed by connecting a group of 3 multiplied by 3 expansion convolution layers containing dense connection layer by layer, and the information of the previous scale can be integrated when new scale feature information is extracted each time. The discrete cosine transform restoration domain block separates the feature map into a Y channel and a Cr/Cb channel at a convolution separation layer, because the human eye is sensitive to the Y channel, the Y channel quantization estimation has a higher priority than the Cr/Cb channel, and the Y channel is heavily processed.
The pixel domain recovery module consists of a plurality of residual blocks, and the module and the discrete cosine transform are processed in parallel to complete the whole quantization loss compensation and superposition. The two modules together complete the quantization compensation task.
Training a multi-frame video enhancement network model through the data set generated in the step one;
due to the limited memory of the equipment and the expansion of the data set, the network has better generalization, and five continuous high-quality and low-quality video frames are selected as the input of the video image enhancement network model each time.
The Adam optimizer is adopted in the training network, MSE Loss is used as a Loss function, and compared with a common L1 paradigm function, the MSE Loss can better process edges and show good performance and detail sharpening when the MSE Loss is used for training a model.
In the training process, the positions randomly selected from five frames of high-quality and low-quality video frames input each time are cut into small pictures to accelerate the training speed. The initial learning rate is set to 1E-4, and when the objective evaluation index is not further improved by five epochs, learning is carried out
The rate is reduced to 0.5 times of the original rate, and finally, when the learning rate is lower than 1E-6, the training is stopped. The model can quickly give an evaluation result by reducing the learning rate of the model parameters every time, the network can be quickly converged to a loss interval by the higher learning rate in the early stage, and the network can be finely adjusted by the lower learning rate in the later stage, so that the model effect can achieve the optimal effect.
Inputting the low-quality video frames into the trained multi-frame video enhancement network model to obtain high-quality video images;
and (4) adding the low-quality images into the trained multi-frame video enhancement network model. Firstly, the image is processed by a feature extraction layer, and three channels of the image are changed into sixty-four channels. The image processed by the feature extraction layer is aligned with the front and rear frames through the variable convolution of the multi-frame alignment part of the multi-frame alignment fusion module, and the correlation of the front and rear frames is enhanced. And then dynamically aggregating the five input video frames into a single frame through a space-time characteristic attention fusion module of a multi-frame fusion part. And finally, respectively recovering quantization losses of the Y channel and the Cr/Cb channel caused by compression through a double-domain recovery module. And in each branch, a dynamic pooling layer and an inverse discrete cosine transform unit for adaptively acquiring the quantization loss of the characteristic space are added, simultaneously, pixel domain enhancement is performed in parallel, and the tasks of quantization compensation and upsampling are completed together to obtain a final high-quality video image.
Fig. 4 shows a network model test result according to an embodiment of the present invention.

Claims (5)

1. A constant bit rate compressed video quality enhancement method based on double-domain learning is characterized by comprising the following steps:
the method comprises the following steps: preprocessing data to obtain a high-quality and low-quality video frame data set;
step two: constructing a multi-frame video enhancement network model;
step three: training a multi-frame video enhancement network model by using the data set generated in the step one;
step four: and inputting the low-quality video frame into the model to obtain a high-quality video frame, and calculating the peak signal-to-noise ratio.
2. The method for enhancing quality of a constant bitrate compressed video based on two-domain learning according to claim 1, wherein the specific method in the first step is as follows:
downloading a data set high-quality video from an NTIRE official network, and converting an original high-quality YUV format video into an MKV file in a lossless manner to cut the MKV file into a frame to be used as a high-quality video frame training set;
the converted high-quality MKV video is encoded by using an HM16 encoder according to the parameter encoding parameters of the frame rate of 30fps, supporting X.265 and the fixed bit rate of 800kbps to generate low-quality video; in the processing of low quality video, a constant bit rate video is encoded using an FFmpeg open source tool and then frame-clipped to generate a low quality video frame data set.
3. The method for enhancing the quality of the compressed video with the constant bit rate based on the two-domain learning according to claim 2, wherein the second specific method comprises the following steps;
the multi-frame video enhancement network model comprises a feature extraction layer, a multi-frame alignment fusion module and a double-domain recovery module; the sequence of the video frames entering the network model is a feature extraction layer and a multi-frame alignment fusion module, then a down-sampling one-time feature map is obtained through one down-sampling, then the down-sampling one-time feature map is down-sampled again to obtain a down-sampling two-time feature map, the feature map is processed through a double-domain recovery module and ConvReLU, and the obtained feature map is merged into the previous scale; then, sampling a one-time feature map at the lower part, merging the features of the next scale, and merging the result into the original scale through a double-domain recovery module and a ConvReLU layer; finally, in the original scale, combining the feature maps, and finally outputting the enhanced video frame through a double-domain recovery module and a ConvReLU layer;
the characteristic extraction layer is composed of a plurality of residual blocks and is used for changing an input image from three channels into sixty-four channels;
the multi-frame alignment fusion module is divided into a multi-frame alignment part and a multi-frame fusion part; the multi-frame alignment part adopts a pyramid, a cascade connection and a deformable convolution module, wherein the pyramid and the cascade connection structure refer to that video frames are connected and progressed layer by layer through a three-layer network structure similar to a pyramid until the video frames are output from the first layer of the pyramid; the deformable convolution module predicts the offset of a plurality of video frames by utilizing deformable convolution, and continuously compensates to an ideal value through the characteristic that the convolution can learn; the multi-frame fusion part adopts a space-time feature attention fusion module, gives different weights to feature maps in two dimensions of time and space, focuses on the concerned part to recover, and finally fuses the input five frames of video frames into a single frame of video frame;
the dual-domain recovery module comprises a pixel domain recovery block and a discrete cosine transform domain recovery module; the discrete cosine transform domain recovery module is based on inverse discrete cosine transform and comprises a convolution separation layer, a dynamic pooling layer and an inverse discrete cosine transform unit, wherein a characteristic diagram is divided into a Y channel and a Cr/Cb channel through the convolution separation layer and then passes through the dynamic pooling layer and the inverse discrete cosine transform unit; wherein the dynamic pooling layer is composed of three adaptive pooling layers for estimating adaptive quantization parameters; the inverse discrete cosine transform unit carries out quantitative estimation analysis and upsampling on the input quantization parameters and the characteristic diagram, and the upsampling is realized by setting the sampling rate of the inverse discrete cosine transform unit to be 0.5, so that a plurality of pixel points can be predicted by each transformation, and the purpose of upsampling and the compensation of details of a compressed video frame are achieved;
in a discrete cosine transform double-domain recovery module, the feature extraction is formed by connecting a group of 3 multiplied by 3 expansion convolution layers containing dense connection layer by layer, and the information of the previous scale can be integrated when new scale feature information is extracted each time; the discrete cosine transform recovery domain block divides the characteristic diagram into a Y channel and a Cr/Cb channel in a convolution separation layer, because human eyes are sensitive to the Y channel, the quantization estimation priority of the Y channel is higher than that of the Cr/Cb channel, and the Y channel is processed emphatically;
the pixel domain recovery module consists of a plurality of residual blocks, and the module and the discrete cosine transform are processed in parallel to complete the whole quantization loss compensation and superposition; the two modules together complete the quantization compensation task.
4. The method for enhancing the quality of the compressed video with the constant bit rate based on the two-domain learning according to claim 3, wherein the third specific method comprises the following steps;
due to the limited memory of the equipment and the expansion of a data set, the network has better generalization, and five continuous high-quality and low-quality video frames are selected as the input of a video image enhancement network model each time;
an Adam optimizer is adopted in a training network, and MSE Loss is used as a Loss function;
in the training process, firstly randomly selecting positions in five frames of high-quality and low-quality video frames input each time, and cutting the positions into small pictures to accelerate the training speed; the initial learning rate is set to 1E-4, when the objective evaluation index is not further improved by five epochs, the learning rate is reduced to 0.5 time of the original learning rate, and finally when the learning rate is lower than 1E-6, the training is stopped.
5. The constant bitrate compressed video quality enhancement method based on two-domain learning according to claim 4, wherein the fourth specific method is as follows;
the method comprises the steps that low-quality images are placed in a trained multi-frame video enhancement network model, firstly, the images are processed through a feature extraction layer, and three channels of the images are changed into sixty-four channels; the image processed by the feature extraction layer is aligned with the front and rear frames through the variable convolution of the multi-frame alignment part of the multi-frame alignment fusion module, and the correlation of the front and rear frames is enhanced; then, dynamically aggregating the five input video frames into a single frame through a space-time characteristic attention fusion module of a multi-frame fusion part; finally, respectively recovering quantization losses of the Y channel and the Cr/Cb channel caused by compression through a double-domain recovery module; in each branch, a dynamic pooling layer and a discrete cosine inverse transformation unit for adaptively obtaining the quantization loss of the characteristic space are added, pixel domain enhancement is performed in parallel, and the tasks of quantization compensation and upsampling are completed together to obtain a final high-quality video image.
CN202210768954.1A 2022-06-30 2022-06-30 Constant bit rate compressed video quality enhancement method based on two-domain learning Pending CN115131254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210768954.1A CN115131254A (en) 2022-06-30 2022-06-30 Constant bit rate compressed video quality enhancement method based on two-domain learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210768954.1A CN115131254A (en) 2022-06-30 2022-06-30 Constant bit rate compressed video quality enhancement method based on two-domain learning

Publications (1)

Publication Number Publication Date
CN115131254A true CN115131254A (en) 2022-09-30

Family

ID=83381248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210768954.1A Pending CN115131254A (en) 2022-06-30 2022-06-30 Constant bit rate compressed video quality enhancement method based on two-domain learning

Country Status (1)

Country Link
CN (1) CN115131254A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977747A (en) * 2023-08-28 2023-10-31 中国地质大学(北京) Small sample hyperspectral classification method based on multipath multi-scale feature twin network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977747A (en) * 2023-08-28 2023-10-31 中国地质大学(北京) Small sample hyperspectral classification method based on multipath multi-scale feature twin network
CN116977747B (en) * 2023-08-28 2024-01-23 中国地质大学(北京) Small sample hyperspectral classification method based on multipath multi-scale feature twin network

Similar Documents

Publication Publication Date Title
CN107197260B (en) Video coding post-filter method based on convolutional neural networks
CN111028150B (en) Rapid space-time residual attention video super-resolution reconstruction method
CN110933429B (en) Video compression sensing and reconstruction method and device based on deep neural network
CN108830790B (en) Rapid video super-resolution reconstruction method based on simplified convolutional neural network
CN112203093B (en) Signal processing method based on deep neural network
CN111711817B (en) HEVC intra-frame coding compression performance optimization method combined with convolutional neural network
CN101883284B (en) Video encoding/decoding method and system based on background modeling and optional differential mode
CN113362225B (en) Multi-description compressed image enhancement method based on residual recursive compensation and feature fusion
CN112102163B (en) Continuous multi-frame image super-resolution reconstruction method based on multi-scale motion compensation framework and recursive learning
CN110177282B (en) Interframe prediction method based on SRCNN
CN105306942B (en) A kind of coding method of video encoder, apparatus and system
CN109982092B (en) HEVC inter-frame rapid method based on multi-branch cyclic convolution neural network
CN104704839A (en) Video compression method
CN110062232A (en) A kind of video-frequency compression method and system based on super-resolution
CN113766249A (en) Loop filtering method, device and equipment in video coding and decoding and storage medium
CN113132729B (en) Loop filtering method based on multiple reference frames and electronic device
CN115131254A (en) Constant bit rate compressed video quality enhancement method based on two-domain learning
CN112188217B (en) JPEG compressed image decompression effect removing method combining DCT domain and pixel domain learning
CN110677644B (en) Video coding and decoding method and video coding intra-frame predictor
CN111726638A (en) HEVC (high efficiency video coding) optimization method combining decompression effect and super-resolution
CN112601095A (en) Method and system for creating fractional interpolation model of video brightness and chrominance
CN110148087B (en) Image compression and reconstruction method based on sparse representation
CN113691792A (en) Video bit depth extension method, device and medium based on 3D convolution
CN117952877B (en) Low-quality image correction method based on hierarchical structure modeling
CN114882133B (en) Image coding and decoding method, system, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination