CN108012157B

CN108012157B - Method for constructing convolutional neural network for video coding fractional pixel interpolation

Info

Publication number: CN108012157B
Application number: CN201711207766.7A
Authority: CN
Inventors: 宋利; 张翰; 杨小康
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2020-02-04
Anticipated expiration: 2037-11-27
Also published as: CN108012157A

Abstract

The invention provides a method for constructing a convolutional neural network for fractional pixel interpolation of video coding, which comprises the following steps: collecting images with different contents and resolutions to form an original training data set containing data with different types and coding complexity; preprocessing the original training data set to obtain training data which accord with the inter-frame prediction fractional pixel interpolation characteristic of video coding; constructing a deep convolutional neural network to obtain a convolutional neural network structure suitable for video coding inter-frame prediction fractional pixel interpolation; and inputting the data obtained by preprocessing into the constructed convolutional neural network, and training the constructed convolutional neural network by taking the original training data set as a corresponding true value. The invention ensures that the convolutional neural network can be trained smoothly, and the fractional pixels obtained by interpolation of the trained convolutional neural network meet the requirement of the interpolation characteristic of the fractional pixels of video coding.

Description

Method for constructing convolutional neural network for video coding fractional pixel interpolation

Technical Field

The invention relates to a method in the technical field of image processing, in particular to a convolutional neural network method suitable for fractional pixel interpolation of video coding interframe prediction.

Background

Inter-frame prediction is a key technology in a video coding standard, and by utilizing the similarity of video contents between frames, the redundancy of the video in time can be effectively removed, so that the coding compression efficiency is improved. Meanwhile, due to the discrete sampling operation in the digitization process, the real object motion is not necessarily performed according to the sampling grid. In order to further improve the accuracy of object motion prediction, the motion of objects in the video coding standard is in units of fractional pixels. Pixel values at fractional pixel positions on the sampling grid are not actually present, and in an application, the pixel values at the fractional pixel positions need to be interpolated by using pixel values at the actually present integer positions.

However, the interpolation filters currently used in video coding to generate fractional pixels are designed artificially based on some a priori assumptions. The parameters of these interpolation filters are fixed, and with the continuous richness of video contents and the continuous increase of video resolution, the fixed parameter filters are not all applicable.

Deep learning is a method for fitting mass data through a designed neural network to obtain a universally applicable model. The method based on deep learning makes a major breakthrough in semantic level problems such as target tracking and pedestrian detection, and remarkably improves the effect in pixel level problems such as image super-resolution.

The inter-frame prediction fractional pixel interpolation and the image super-resolution have certain similarity, namely, a large image is generated by a small image which exists really according to a certain multiplying power. However, image super-resolution is to generate a whole high-resolution large image by using a low-resolution image, while inter-frame prediction fractional pixel interpolation is to generate the rest fractional position pixels according to the real existing integer position pixels, and it is necessary to ensure that the integer position pixels are not changed. In addition, for the interpolation of the interframe prediction fractional pixels, the pixels at the fractional positions do not exist really, so that no real true value can be referred to in the training process of the convolutional neural network, and the training cannot be performed normally.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a construction method of a convolutional neural network suitable for video coding interframe prediction fractional pixel interpolation, which designs the convolutional neural network suitable for video coding interframe prediction fractional pixel interpolation and preprocessing operation which leads the training to be carried out smoothly by utilizing the advantages of the convolutional neural network with good performance on the aspect of obtaining the super-resolution problem of images and simultaneously considering the characteristics of the video coding interframe prediction fractional pixel interpolation, so that the objective quality of video coding reconstructed frames can be improved, and the improvement of the coding efficiency is realized.

In order to achieve the above object, the method for constructing a convolutional neural network for video coding fractional pixel interpolation according to the present invention comprises:

collecting images with different contents and different resolutions to form an original training data set containing data with different types and different coding complexities;

preprocessing the collected original training data set to obtain training data which accords with the inter-frame prediction fractional pixel interpolation characteristic of video coding and is used as input data for training a convolutional neural network;

building a deep convolutional neural network, and considering the fractional pixel interpolation characteristic of video coding to obtain a convolutional neural network structure suitable for the fractional pixel interpolation of video coding interframes prediction;

and inputting the data obtained by preprocessing into the constructed convolutional neural network, and training the constructed convolutional neural network by taking the original training data set as a corresponding true value to obtain the convolutional neural network model suitable for video coding interframe prediction fractional pixel interpolation.

Preferably, the pretreatment operation is as follows:

a) performing down-sampling operation of corresponding magnification on the image in the original training data set according to the fractional pixel position generated by interpolation as required to obtain low-resolution training data used in the step b);

b) performing compression coding on the low-resolution training data according to the configuration of the static image coding in the video coding standard to obtain a low-resolution coding reconstructed image used in the step c);

c) and (b) performing upsampling operation of corresponding multiplying power in the step a) on the low-resolution coding reconstructed image, and recovering the size of the original image to obtain input data of the training convolutional neural network.

More preferably, in c), the upsampling operation on the low-resolution encoded reconstructed image ensures that the pixel values of the integer pixel positions of the high-resolution image after the upsampling are consistent with the low-resolution encoded reconstructed image before the upsampling.

Preferably, the deep convolutional neural network is constructed, wherein the constructed deep convolutional neural network comprises 20 weight layers and 1 weight masking layer; for the weighted masking layer, W_IIs the weight of an integer pixel position, W_HAll fractional pixel positions share a weight, which is the weight of the fractional pixel position.

More preferably, the video coding inter-prediction fractional pixel interpolation, wherein integer pixel position pixel values are unchanged, generates only fractional pixel positions.

Compared with the prior art, the invention has the beneficial effects that:

in addition to the strong capability of extracting features from mass data by utilizing the deep convolutional neural network, the invention also considers the special data characteristics of video coding and the special characteristics of inter-frame prediction fractional pixel interpolation of the video coding compared with image super-resolution, redesigns the deep convolutional neural network, and simultaneously designs matched preprocessing operation to ensure that the training of the convolutional neural network can be smoothly carried out, thereby obtaining a convolutional neural network model suitable for fractional pixel interpolation of the video coding, improving the objective quality of the compression coding reconstructed video, and improving the video coding efficiency.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method of one embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network structure according to an embodiment of the present invention;

FIG. 3 is a diagram of integer pixel location, fractional-half pixel location, and fractional-quarter pixel location according to an embodiment of the invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention provides a method for constructing a convolutional neural network for fractional pixel interpolation of video coding, which comprises the following design ideas as shown in figure 1:

collecting images with different contents and different resolutions to obtain a training data set containing data with different types and different coding complexities;

and preprocessing the collected training data set to obtain input data of the training convolutional neural network. The pretreatment operation specifically comprises:

Establishing a deep convolutional neural network suitable for video coding inter-frame prediction fractional pixel interpolation, taking an image obtained through preprocessing operation as the input of the network, simultaneously taking a corresponding image in an original training data set as a corresponding true value, setting training parameters and training the convolutional neural network;

and performing fractional pixel interpolation operation by using the convolutional neural network model obtained by training, and realizing video coding inter-frame prediction fractional pixel interpolation based on the convolutional neural network.

And b) in the preprocessing step, the low-resolution image after the down sampling is compressed and coded according to the configuration of the compression and coding of the static image in the video coding standard, so that the reconstruction value of the low-resolution image becomes an image containing the characteristics of the video coding data.

C) in the preprocessing step, for the up-sampling operation of the low-resolution reconstructed image after the compression coding, it is required to ensure that the pixel value of the whole pixel position of the high-resolution image after the up-sampling is consistent with the pixel value of the low-resolution image before the up-sampling, and only the pixel value of the fractional pixel position is generated.

On the basis of the image super-resolution convolutional neural network, the inherent characteristic of video coding fractional pixel interpolation, namely the constant integer position pixel is considered, only the fractional position pixel is generated, the convolutional neural network is redesigned, meanwhile, the preprocessing operation is matched, the convolutional neural network can be smoothly trained, the fractional pixels obtained by using the trained convolutional neural network interpolation meet the requirement of the video coding fractional pixel interpolation characteristic, and the improvement of the video coding efficiency can be realized by using the fractional pixel interpolation. In addition, the convolution neural network obtained by the invention can be used for simultaneously generating the pixel values of all fractional pixel positions in one operation.

The invention is applied to the latest video coding standard, namely High Efficiency Video Coding (HEVC), introduces a convolutional neural network construction method suitable for HEVC inter-frame prediction one-half pixel interpolation, and mainly explains concrete implementation details such as data preprocessing, convolutional neural network structure construction and the like in detail. Of course, the invention is also applicable to other coding standards.

1. Data preprocessing process

For the step of compressing and encoding the low-resolution image in the data preprocessing process, the low-resolution image obtained by down-sampling is encoded by adopting the full intra-frame (AI) configuration of HEVC.

And for the up-sampling process of the low-resolution compression coding reconstruction image in the preprocessing process, an interpolation filter based on discrete cosine transform is adopted. For one-half pixel position, the interpolation filter based on discrete cosine transform is an 8-tap filter, and the tap coefficients are shown in table 1.

TABLE 1 interpolation filter tap coefficients based on discrete cosine transform

Index i	-3	-2	-1	0	1	2	3	4
									Hfilter[i]	-1	4	-11	40	40	-11	4	-1

The process of using a discrete cosine transform based interpolation filter to generate the half pixel position pixel in fig. 3 is as follows:

wherein, b_0,j,h_i,0,j_0,0Denotes the pixel value of one-half pixel position, A_i,jRepresenting integer pixel position pixel value, hfilter [ i ]]Which represents the tap coefficients of a discrete cosine transform based interpolation filter, B represents the bit depth for the pixel values.

2. Convolutional neural network structure construction

The invention adopts an accurate electromagnetic Super-Resolution Using Very Deep conditional network published by an IEEE Conference on Computer Vision and pattern Recognition Conference in 2016 (IEEE international Computer Vision and pattern Recognition Conference) by J Kim and the like as a basic frame, a weight masking layer is added in the original frame, and W is a weight masking layer_IIs the weight value of the pixel value at integer position, W_HIs a weight value of one-half pixel position pixel value.

As shown in fig. 2, the convolutional neural network structure constructed in this embodiment includes 20 convolutional layers and 1 weight masking layer. For convolutional layers, each convolutional layer contains 64 different filters, except the first convolutional layer and the last convolutional layer, each filter having a size of 3 × 3 × 64. For the first convolutional layer, 64 filters with a size of 3 × 3 × 1 are included. For the last convolutional layer, 1 filter with size 3 × 3 × 64 is included. For the weighted masking layer, integer pixel positions and fractional pixel positions use different weights, where W_IIs a weight of an integer pixel position, W_HIs a one-half pixel location weight. The convolutional neural network input in this embodiment is a high-resolution image of a target size obtained by preprocessing a low-resolution image. The convolutional neural network in this embodiment predicts a residual image between the final output high-resolution image and the initial input preprocessed image, and is defined as follows:

R＝Y_H-X_ILR(4)

wherein Y is_HRepresenting the final output high resolution image, X_ILRRepresenting the initial input pre-processed image.

And adding a residual image obtained by predicting the convolutional neural network with the input preprocessing image to obtain a finally output high-resolution image.

3. Training convolutional neural networks

The training process of the convolutional neural network adopts Euclidean distance as a loss function:

where theta represents the set of parameters that the convolutional neural network needs to learn,

a representation of a training image is shown,representing the corresponding truth image, F (X), in the original training data set_i(ii) a θ) represents the final output high resolution image. Since the convolutional neural network in this embodiment predicts the residual image, F (X) in equation (5)_i(ii) a θ) should be expressed as:

wherein the content of the first and second substances,representing the initially input pre-processed image.

The convolutional neural network model suitable for the video coding inter-frame prediction fractional pixel interpolation is obtained through the training.

4. Effects of the implementation

The convolutional neural network model obtained by training in the embodiment is applied to an HEVC coding framework, and an improved encoder and a standard HEVC encoder are used for coding a test sequence. Test sequences as shown in table 2, all test sequences are in YUV format 4:2:0, indicating a bit depth of 8.

Table 2 details of the test sequences

The HEVC encoder used in this embodiment is HM-16.7, the encoding configuration is a low-latency P frame (LDP) general test configuration, and the Quantization Parameters (QP) used in encoding are 22,27,32, and 37, respectively.

Under the above-described implementation conditions, the coding test results shown in table 3 were obtained. The performance index adopted in table 3 is a BD-Rate index, which indicates the percentage of the saving of the inter-frame prediction half-fractional pixel interpolation code Rate by using the convolutional neural network obtained by training in the present embodiment under the condition of the same peak signal-to-noise ratio (PSNR) as compared with the standard HEVC encoder. As shown in Table 3, under the above-mentioned conditions, the average BD-Rate of the Y, U, V components was-0.9%, -0.1%, respectively. In particular, the gain of the sequence BasketbalPass is most significant, and the gain of the Y, U, V three-component can reach-2.4%, -0.1%, -1.6%. As can be seen from table 3, compared with the standard HEVC encoder, the method for performing half-pixel interpolation on a luma component by using the convolutional neural network obtained by training for the luma Y component in this embodiment has an obvious improvement in encoding efficiency. In addition, because the encoder uses a technology for predicting the chrominance component based on the luminance component, the remaining chrominance components can also obtain certain encoding performance improvement along with the improvement of the reconstruction quality of the luminance component.

TABLE 3 test sequence coding Performance (BD-Rate)

To further illustrate that the convolutional neural network construction of the present invention is more suitable for fractional pixel interpolation in video coding inter-frame prediction, table 4 shows the test results of performing one-half fractional pixel interpolation directly using the convolutional neural network obtained by training for the image super-resolution problem and comparing with using a standard HEVC encoder. As can be seen from table 4, the fractional pixel interpolation directly using the convolutional neural network for image super-resolution has significant coding performance loss.

TABLE 4 convolutional neural network coding test results using image super resolution (BD-Rate)

In conclusion, the invention designs a special convolutional neural network aiming at the inter-frame prediction fractional pixel interpolation of video coding, and simultaneously designs a matched data preprocessing process, so that the training of the convolutional neural network can be smoothly carried out, and the fractional pixels generated by using the trained convolutional neural network can meet the specific requirements of fractional pixel interpolation. The fractional pixel interpolation of the convolutional neural network obtained by the invention can achieve remarkable improvement of coding performance and is more suitable for the fractional pixel interpolation part of video coding inter-frame prediction.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A method for constructing a convolutional neural network for video coding fractional pixel interpolation is characterized by comprising the following steps: the method comprises the following steps:

inputting the data obtained by preprocessing into a built convolutional neural network, and training the built convolutional neural network by taking the original training data set as a corresponding true value to obtain a convolutional neural network model suitable for video coding interframe prediction fractional pixel interpolation;

the pretreatment operation comprises the following steps:

b) encoding the low-resolution training data according to the configuration of the static image encoding in the video encoding standard to obtain a low-resolution encoding reconstructed image used in the step c);

c) and (b) performing corresponding multiplying power up-sampling operation on the low-resolution coding reconstructed image in the step a) by adopting an interpolation filter based on discrete cosine transform, and recovering to the original image size to obtain input data of the training convolutional neural network.

2. The method of claim 1, wherein the convolutional neural network is constructed by using a convolutional encoder as a basis for fractional pixel interpolation in video coding: in the step c), the upsampling operation of the low-resolution coding reconstructed image is carried out, so that the pixel value of the integer pixel position of the high-resolution image after the upsampling is consistent with the low-resolution coding reconstructed image before the upsampling.

3. The method of constructing a convolutional neural network for video coding fractional pixel interpolation as claimed in any of claims 1-2, wherein: the build depth convolution spiritThe method comprises the steps of constructing a network, wherein the built deep convolutional neural network comprises 20 weight layers and 1 weight masking layer; for the weighted masking layer, W_IIs the weight of an integer pixel position, W_HAll fractional pixel positions share a weight, which is the weight of the fractional pixel position.

4. The method of claim 3, wherein the convolutional neural network is constructed by using a convolutional encoder as a basis for fractional pixel interpolation in video coding: the video coding inter-frame predicts fractional pixel interpolation, wherein integer pixel position pixel values are unchanged and only fractional pixel positions are generated.

5. Use of a convolutional neural network model constructed by the method of any one of claims 1 to 4, wherein: and applying the convolutional neural network model to fractional pixel interpolation operation to realize video coding inter-frame prediction fractional pixel interpolation based on the convolutional neural network.