CN115278249B

CN115278249B - Video block-level rate distortion optimization method and system based on visual self-attention network

Info

Publication number: CN115278249B
Application number: CN202210735183.6A
Authority: CN
Inventors: 刘家瑛; 李书家; 王德昭; 黄浩峰; 郭宗明
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2024-06-28
Anticipated expiration: 2042-06-27
Also published as: CN115278249A

Abstract

The invention discloses a video block-level rate distortion optimization method and a video block-level rate distortion optimization system based on a visual self-attention network, which belong to the field of digital video enhancement, introduce a multi-head self-attention mechanism into a video compression post-processing task, fully utilize the powerful modeling capability of the self-attention mechanism to learn the mapping from damaged frames to lossless frames, construct 3 post-processing models based on the visual self-attention network, introduce various network architectures to perform block-level rate distortion optimization on different contents, and further efficiently eliminate artifacts and compression noise generated during decoding reconstruction of video.

Description

Video block-level rate distortion optimization method and system based on visual self-attention network

Technical Field

The invention belongs to the field of digital video enhancement, and particularly relates to a block-level rate distortion optimization method and system after video impairment compression based on a multi-head self-attention mechanism.

Background

Lossy compression algorithms for video often cause severe artifacts, such as blocking effects caused by block-based coding strategies, ringing effects caused by missing high frequency information in the reconstruction of video frames, etc. Especially, under the coding setting of low code rate, a large amount of artifacts of the reconstructed video greatly reduce the objective quality of video content and influence the subjective feeling of a user. How to train out a post-processing model with excellent performance by fully utilizing video coding information at a decoding end for removing artifacts in video frames has wide practical significance and application value.

The current video post-processing method is mainly divided into in-loop filtering and out-of-loop filtering. In-loop filtering mainly refers to a method for performing loop filtering in a coding loop of a video coding algorithm, such as a deblocking filter, a sample self-adaptive offset filter and the like in general video coding, and a plurality of in-loop filters based on depth learning, wherein an original frame is taken as a target, and a convolutional neural network mode is adopted to process an in-loop damaged frame, so that the quality of a video frame reconstructed by final coding is higher; the out-of-loop filtering is to directly perform post-processing on the reconstructed video frame generated by the video coding algorithm, and the process inside the coding loop is not involved, and the method also has a division of a model based on signal processing and a model based on depth learning.

However, the post-processing model based on signal processing is often based on manually set parameters, and it is difficult to fully consider the conditions of different video frames to perform filtering; the post-processing model based on the convolutional neural network is also characterized in that content independence exists between a convolutional layer and image characteristics, characteristics which agree with a convolutional kernel to process different images are often used, the convolutional neural network has strong locality, and modeling of large-scale data dependence in a video frame is difficult.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a video block level rate distortion optimization method and system based on a visual self-attention network. The invention introduces a multi-head self-attention mechanism into a video compression post-processing task, fully utilizes the modeling capability of the self-attention mechanism, and can mechanically learn the mapping from damaged frames to lossless frames, and introduces various network architectures to optimize block-level rate distortion for different contents, thereby efficiently eliminating artifacts and compression noise generated when the video is decoded and reconstructed.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a video block level rate distortion optimization method based on a visual self-attention network, comprising the steps of:

Constructing a post-processing model based on a visual self-attention network, wherein the post-processing model comprises a shallow feature extraction layer, a depth feature extraction layer and a reconstruction layer; the depth feature extraction layer comprises a plurality of continuous residual blocks, wherein each residual block comprises a plurality of continuous visual self-attention blocks and a convolution layer; each visual self-attention block comprises two residual blocks, wherein the first residual block consists of a normalization layer and a multi-head self-attention layer, and the second residual block consists of a normalization layer and two perceptron layers;

the constructed post-processing model specifically comprises 3 post-processing models, wherein the 3 post-processing models are different in that the multi-head self-attention layer is respectively selected from a traditional multi-head self-attention layer, a multi-head characteristic linear transformation layer and a grouping convolution layer; training the 3 post-processing models by using a training data set;

For an original damaged frame, firstly reading rate distortion optimization parameters of video compression coding at a video coding end, processing the video damaged frame generated by a coding ring by using the 3 post-processing models after training, extracting shallow features by a shallow feature extraction layer, extracting deep features from the shallow features by a depth feature extraction layer, and generating 3 post-processing frames by a reconstruction layer;

And (3) combining the 3 post-processing frames with the unprocessed damaged frames to form 4 frames to be selected, recursively dividing the 4 frames to obtain blocks by using the same dividing method, solving the mean square error of the corresponding position of each block and the original damaged frame, then solving the code rate consumed by dividing each block, and taking out the block with the minimum rate distortion loss to form the final reconstructed video frame.

Further, the shallow feature extraction layer includes only one convolution layer, and the reconstruction layer includes only one convolution layer.

Further, the training data set comprises a brightness component of an original frame of the video sequence and a brightness component of a damaged frame obtained by the original sequence through a video coding algorithm, and the brightness component of the damaged frame is input into the model during training.

Further, the depth feature extraction layer includes 6 residual blocks, each including 6 visual self-attention blocks and 1 convolution layer.

Further, the perceptron layer uses gaussian error linear units as activation functions.

Further, the dividing method comprises the following steps: dividing the 4 frames to be selected into 512 x 512 large blocks by adopting a quadtree, and calculating rate distortion optimization of each frame; dividing each 512 x 512 block into 4 128 x 128 blocks, and performing rate distortion optimization; the partitioning is recursively performed until the blocks are partitioned into 4*4.

Further, when divided into large blocks 512 x 512, the shortfalls are filled with 0.

Further, each block is divided into 5 bit spaces to record the positions of the blocks, and 2 bit spaces are used to record which corresponding block of the candidate frame is selected by the last reconstructed frame.

Further, after division, a block with the minimum rate distortion loss is selected according to a rate distortion formula R+λD, wherein R is the sum of the compression code rate of the damaged frame and the extra space consumption generated by block division, and D is the mean square error of each block to be selected and the corresponding block of the original frame.

A video block-level rate distortion optimization system based on a visual self-attention network, comprising: 3 post-processing models and block-level rate distortion optimization modules based on visual self-attention network;

The 3 post-processing models comprise a shallow feature extraction layer, a depth feature extraction layer and a reconstruction layer; the depth feature extraction layer comprises a plurality of continuous residual blocks, wherein each residual block comprises a plurality of continuous visual self-attention blocks and a convolution layer; each visual self-attention block comprises two residual blocks, wherein the first residual block consists of a normalization layer and a multi-head self-attention layer, and the second residual block consists of a normalization layer and two perceptron layers; the 3 post-processing models are different in that the multi-head self-attention layer is respectively selected from a traditional multi-head self-attention layer, a multi-head characteristic linear transformation layer and a grouping convolution layer;

The block-level rate-distortion optimization module is used for recursively dividing blocks into video frames, solving the mean square error of the corresponding position of each block and the original damaged frame, then solving the code rate consumed by dividing each block, and taking out the block with the minimum rate-distortion loss to form the final reconstructed video frame;

Training the 3 post-processing models by using a training data set; processing video damaged frames generated by the coding ring by using the 3 post-processing models after training, extracting shallow features by a shallow feature extraction layer, extracting deep features from the shallow features by a depth feature extraction layer, and generating 3 post-processing frames by a reconstruction layer; and forming 4 types of frames to be selected by adding the unprocessed damaged frames to the 3 types of post-processing frames, and processing the 4 types of frames to be selected by using a block-level rate distortion optimization module to obtain a final reconstructed video frame.

Compared with the prior art, the invention has the following positive effects:

According to the invention, the visual self-attention network is introduced into the video post-processing task, and a single model can be obtained to be stronger than a general post-processing method based on signal processing and a post-processing method based on a convolutional neural network; meanwhile, the invention provides a method for performing block-level rate distortion optimization on the results of a plurality of visual self-attention network models, and the video reconstruction quality is further improved on the basis of a single visual self-attention model.

Drawings

FIG. 1 is a block diagram of a visual self-attention network-based post-processing model used in an embodiment of the present invention;

Fig. 2A-2C are block diagrams of three multi-headed self-focusing layers.

Detailed Description

In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the specific number of layers, the number of modules, the number of functions, the arrangement of some layers, etc. given in the following examples are only a preferred embodiment, and are not limiting, and those skilled in the art may choose the number and arrangement of some layers according to actual needs, and it should be understood that.

The embodiment discloses a video block-level rate distortion optimization method based on a visual self-attention network, which takes post-processing of video corrupted compressed reconstructed frames as an example, and specifically comprises the following steps:

step 1: collecting a normal high-definition training data set H; and performing compression reconstruction of a plurality of code rates on the H by using a video impairment compression algorithm to obtain an impairment data set H _rec.

Step 2: and building a post-processing model based on the visual self-attention network.

The network structure is shown in fig. 1, and the model is divided into three main modules of a shallow feature extraction layer F _s, a depth feature extraction layer F _d and a reconstruction layer F _r. The shallow feature extraction layer F _s uses a convolution layer, the number of input channels is 1, the number of output channels is 180, the convolution kernel size is 3, the step size is 1, and zero filling with the width of 1 is performed outside.

The depth feature extraction layer F _d is 6 residual blocks, each consisting of 6 consecutive multi-headed self-attention blocks and 1 convolutional layer. As shown in fig. 2A-2C, the multi-headed self-attention block is 3 in total, corresponding to 3 post-processing models, respectively. Each multi-head self-attention block is composed of 1 residual multi-head self-attention layer and 1 two-layer perceptron. The window size for each multi-headed self-attention layer is 8, the feature dimension is 180, and the number of packets is 6.

The reconstruction layer F _r uses a convolution layer, the number of input channels is 180, the number of output channels is 1, the convolution kernel size is 3, the step size is 1, and zero filling with the width of 1 is performed outside.

Step 3: 3 visual self-attention network models were trained.

In step 2,3 visual self-attention models are respectively built according to 3 multi-head self-attention blocks, and the loss function of each model is as follows:

L＝||F_rF_dF_s(H_rec)-H||²

Step 4: post-processing frames are generated on the test sequence.

For a damaged frame I of the video sequence to be tested, the corresponding original frame is I _gt, and the rate distortion optimization parameter of the damaged compression model is lambda. And 3 post-processing the damaged frame I by using the 3 post-processing models trained in the step 3 respectively to obtain 3 post-processing frames I ₁,I₂,I₃, and combining the 3 post-processing frames with the original frames to form 4 frames to be selected.

Step 5: block-level rate-distortion optimization is performed on the frame to be selected.

The 4 candidate frames are divided into 512 x 512 blocks, the shortfall is filled with 0. And then recursively dividing in each block in a quadtree mode, wherein the dividing minimum block size is 4*4. Each division of a block requires 5 more bits of space to record the block position, while 2 bits of space are used to record which of the corresponding blocks of the candidate frame was selected for the last reconstructed frame. And selecting the block with the minimum rate distortion loss according to the rate distortion formula R+lambdaD after the division is finished, and obtaining the final reconstructed video frame after the blocks at all positions are selected. Wherein R is the sum of the compression code rate of the damaged frame I and the extra space consumption generated by block division, and D is the mean square error of each block to be selected and the corresponding block of the original frame.

Experiment

The comparison object of the method is video coding reference software VTM16.2 based on the universal video coding standard, and the test sequence BQSquare, raceHorses, basketballPass, blowingBubbles of the universal video coding standard is subjected to coding reconstruction in a full key frame mode. And then, carrying out post-processing on the brightness component of the video by applying the method on the reconstructed video sequence, and calculating the BD-rate of the video sequence after the post-processing by using the quality of the video sequence reconstructed by the VTM16.2 as an anchor point. BD-rate indicates the rate increase of the optimized algorithm compared with the original algorithm under the condition of the same objective video quality, and BD-rate is negative to indicate that the coding performance of the optimized algorithm is improved. The results of the calculations are shown in the following table.

TABLE 1 BD-rate of video sequences after post-processing by the method of the present invention

BQSquare	-5.5％
		RaceHorses	-5.0％
BasketballPass	-7.5％
		BlowingBubbles	-3.5％

As can be seen from the data in table 1, BD-rate is negative, demonstrating that the coding performance of the method of the present invention is improved over VTM16.2 over the various test sequences, with significant results.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. A video block level rate distortion optimization method based on a visual self-attention network, comprising the steps of:

2. The method of claim 1, wherein the shallow feature extraction layer comprises only one convolution layer and the reconstruction layer comprises only one convolution layer.

3. The method of claim 1, wherein the training data set includes a luminance component of an original frame of the video sequence and a luminance component of a corrupted frame obtained from the original sequence via a video coding algorithm, the luminance component of the corrupted frame being input into the model during training.

4. The method of claim 1, wherein the depth feature extraction layer comprises 6 residual blocks, each residual block comprising 6 visual self-attention blocks and 1 convolution layer.

5. The method of claim 1, wherein the perceptron layer uses gaussian error linear units as the activation function.

6. The method of claim 1, wherein the partitioning method is: dividing the 4 frames to be selected into 512 x 512 large blocks by adopting a quadtree, and calculating rate distortion optimization of each frame; dividing each 512 x 512 block into 4 128 x 128 blocks, and performing rate distortion optimization; the partitioning is recursively performed until the blocks are partitioned into 4*4.

7. The method of claim 6, wherein the shortfall is filled with 0's when divided into chunks of 512 x 512.

8. The method of claim 6, wherein each block is divided by 5 bits of space to record the block position, and wherein the 2 bits of space are used to record which of the corresponding blocks of the candidate frame was selected by the last reconstructed frame.

9. The method of claim 1, wherein the block with the smallest rate distortion loss is selected according to a rate distortion formula r+λd after division, where R is a sum of a compression code rate of the damaged frame and additional space consumption generated by the block division, and D is a mean square error of each of the candidate blocks and a corresponding block of the original frame.

10. A video block-level rate distortion optimization system based on a visual self-attention network, comprising: 3 post-processing models and block-level rate distortion optimization modules based on visual self-attention network;