CN110177282B

CN110177282B - Interframe prediction method based on SRCNN

Info

Publication number: CN110177282B
Application number: CN201910388829.6A
Authority: CN
Inventors: 颜成钢; 黄智坤; 李志胜; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2021-06-04
Anticipated expiration: 2039-05-10
Also published as: CN110177282A

Abstract

The invention discloses an interframe prediction method based on SRCNN, which is characterized in that a super-resolution convolutional neural network is used for interframe prediction of an image sequence; after motion estimation and motion compensation operations are carried out on the image sequence, a characteristic model is trained by combining a super-resolution convolutional neural network; and performing super-resolution reconstruction on the image by using the parameters in the model, and performing motion estimation and motion compensation on the image to obtain an image consistent with the next frame of image of the current image. The invention applies deep learning to interframe prediction of video coding, and uses a convolutional neural network to perform feature extraction and training learning on motion estimation and motion compensation operation among image sequences. Meanwhile, the super-resolution neural network is used, so that the image quality of the image can be enhanced during image reconstruction.

Description

Interframe prediction method based on SRCNN

Technical Field

The invention belongs to interframe prediction in the field of video coding, mainly aims to improve video transmission efficiency, and particularly relates to an interframe prediction method based on SRCNN.

Background

Super-Resolution (Super-Resolution) means that a Low-Resolution (Low-Resolution) image is converted into a High-Resolution (High-Resolution) image, and generally, the image quality and definition can be improved. The Super-Resolution Convolutional Neural Network (SRCNN) is a Convolutional Neural Network applied to image Super-Resolution reconstruction, and reconstructs a high-Resolution image after extracting the characteristics of an image block and performing nonlinear mapping on the characteristics. The convolutional neural network is widely used after being proposed, and the accuracy and the reliability are well verified.

In this information age today, research and statistical data from scientists show that roughly 75% of the information from the outside world that is acquired by humans is acquired by the eyes, which are converted into images by the visual system and transmitted to the brain. With the rapid improvement of the current living standard, the requirements of people on the quality of image videos are higher and higher. The continuous improvement of the resolution of images and videos also brings great challenges to information transmission. Sharper images, video, mean larger data volumes and require higher transmission rates. In order to ensure the comfort of people, the frame rate of videos such as movies and the like is generally higher than 24 frames per second nowadays, and if images of each frame are stored and played frame by frame, the requirements on the capacity of a hard disk are particularly high, and the transmission and display rates of playing equipment are greatly challenged. If the video is played in this manner, there will be no high definition video, such as 2K, 4K, etc., because of the transmission rate limitations. The video coding technology eliminates the redundancy among image sequences to the greatest extent, so that the data volume of the video is greatly compressed, the ultra-high-definition video is enabled to enter the life of people by matching with the existing hardware technology, and the visual and sensory requirements of people are met to the greatest extent.

Interframe prediction is the most important ring in video coding, achieves the purpose of image compression by utilizing the correlation among video image frames, namely time correlation, and is widely used for compression coding of common televisions, conference televisions, video telephones and high-definition televisions. In the image transmission technology, moving images, particularly, television images are a main object of interest. A moving image is a temporal image sequence consisting of successive image frames spaced apart in time by a frame period, which has a greater correlation in time than in space. Most of television images have small detail change between adjacent frames, namely video images have strong correlation between frames, and the compression ratio higher than that of intra-frame coding can be obtained by using the characteristic of the correlation of the frames to perform inter-frame coding.

In the inter-frame prediction coding, there is a certain correlation between scenes in adjacent frames of moving pictures. Therefore, the moving image can be divided into a plurality of blocks or macro blocks, and the position of each block or macro block in the adjacent frame image is searched out, and the relative offset of the spatial position between the two is obtained, the obtained relative offset is commonly referred to as a motion vector, and the process of obtaining the motion vector is called motion estimation. The motion vector and the prediction error obtained after motion matching are jointly sent to a decoding end, and the corresponding block or macro block is found from the decoded adjacent reference frame image at the position indicated by the motion vector at the decoding end and is added with the prediction error to obtain the position of the block or macro block in the current frame. The inter-frame redundancy can be removed by motion estimation, so that the number of bits for video transmission is greatly reduced, and therefore, the motion estimation is an important component in a video compression processing system. This section starts with a general method of motion estimation, and focuses on discussing three key issues of motion estimation: parameterizing the motion field, defining an optimal matching function, and how to find an optimal match.

Disclosure of Invention

The invention aims to provide an inter-frame prediction method based on SRCNN, which is different from a main stream HEVC video coding mode. The invention aims to perform inter-frame prediction on an image sequence by using a super-resolution convolutional neural network. And after the motion estimation and motion compensation operation is carried out on the image sequence, a characteristic model is trained by combining a super-resolution convolutional neural network. By using the parameters in the model, super-resolution reconstruction can be performed on the image, and meanwhile motion estimation and motion compensation are performed on the image to obtain an image which is basically consistent with the next frame of image of the current image.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: collecting a large number of video files of different scenes, and compressing the video according to different Quantization Parameters (QPs);

step 2: extracting an image sequence from a video, wherein the time interval of two frames of images before and after the image sequence is set as t, and t is less than 0.1 second;

and step 3: portions of the image sequence are divided into verification sets. Reading the residual images frame by frame, calculating the residual error between the two frames of images by using the current frame and the previous frame of each image except the first frame of the read image sequence, combining the previous frame of image and the residual error, and performing motion compensation on the previous frame of image to obtain the predicted frame of the previous frame of image. Storing the calculated predicted frame sequence, and dividing the predicted frame image sequence to obtain a training set and a test set, wherein the ratio of the training set to the test set is 4: 1.

and 4, step 4: inputting a training set and a test set, setting appropriate hyper-parameters, and training a parameter model by using a super-resolution convolutional neural network (SRCNN);

and 5: calculating peak signal-to-noise ratio (PSNR) of the ith frame image and the (i + 1) th frame in each image sequence in the verification set, and recording the PSRN 1; reading parameters in the parameter model to process the ith frame of image in the acquired image sequence to obtain a reconstructed image I; calculating PSNR between the reconstructed image I and the ith frame image of the image sequence in the verification set, and recording the PSNR as PSNR 2;

comparing the two PSNR values obtained by calculation, and if the PSNR2 is more than or equal to the PSNR1, considering the model to be effective;

if PSNR2< PSNR1, the model effect is considered not good; notation ERR-PSNR 1-PSNR 2; if ERR is less than 5, the training hyper-parameter is considered to have a problem, the step 4 is returned, the hyper-parameter of the learning rate is adjusted, and then the parameter model is retrained; if ERR is larger than or equal to 5, the partitioning strategy of the data set is considered to be a problem, the step 3 is returned, the data set comprises more scenes by expanding the data set, and training and verification are carried out after the training set and the test set are partitioned again;

if the difference between the two images is large and the PSNR value exceeds the lowest preset threshold value, adjusting a training set and a testing set;

and if the difference between the two images is small and the PSNR value is between the optimal preset threshold and the lowest preset threshold, returning to the step 4 to adjust the parameters of the super-resolution convolutional neural network and retraining the parameter model.

The reconstruction of the image by using the parametric model is specifically realized as follows:

1. and (4) converting the input low-resolution image into a YCbCr color space to obtain a gray scale image, and taking the gray scale image as an input i of the image reconstruction operation. Carrying out down-sampling on the image i, wherein the step length of the down-sampling is set to be k, and obtaining a low-dimensional image;

2. using bicubic interpolation on the low-dimensional image, and amplifying the low-dimensional image to a target size, namely the size of the input low-resolution image;

3. parameters in the parametric model are read, including weights and biases of the respective network nodes. Carrying out nonlinear mapping on the interpolated image through three layers of convolution networks to obtain a reconstructed result, namely an image I;

4. and converting the image I back to the RGB color image to obtain a reconstructed high-resolution image.

The invention has the following beneficial effects:

the invention has the innovativeness that deep learning is applied to interframe prediction of video coding, and a convolutional neural network is used for carrying out feature extraction and training learning on motion estimation and motion compensation operation among image sequences. Meanwhile, the super-resolution neural network is used, so that the image quality of the image can be enhanced during image reconstruction.

Drawings

FIG. 1 is a schematic diagram of a super-resolution convolutional neural network SRCNN;

FIG. 2 is a feature model training flow diagram of the present invention.

Detailed Description

The invention mainly aims at carrying out algorithm innovation on an interframe prediction method in video coding, introduces the training flow of the whole model in detail, and explains the specific implementation steps of the invention in detail by combining the attached drawings, so that the aim and the effect of the invention are more obvious.

Fig. 1 is a schematic diagram of a super-resolution convolutional neural network SRCNN, which is clearly seen in the figure that the convolutional neural network has a simple structure, and can enhance the image quality of an image through nonlinear mapping and image reconstruction. By using the network, the resolution of the image can be improved while the inter-frame prediction is carried out on the image sequence.

FIG. 2 is a flowchart of feature model training according to the present invention, wherein the specific operations include:

1. a large number of video files in YUV format are collected, containing various scenes.

2. Video files are compressed using different quantization parameters, the higher the quantization parameter, the higher the degree of compression, mainly focusing on the compression ratio of the quantization parameter between 28 and 42.

3. And extracting image sequences from the video files, and extracting different numbers of images according to videos with different durations to ensure that the intervals of the image sequences are consistent. In order to ensure that the change between the two previous and next frames is not large, the time interval for extracting the images is set to be small, and is particularly set according to the length of the video.

4. And performing motion estimation and motion compensation on each extracted image, wherein the operation specifically comprises the steps of inputting a current frame and a next frame of image, and performing motion estimation and motion compensation on the current frame by comparing the two frames of images.

5. The training set and the test set are organized using the processed image sequence. The verification set required to verify the model then requires the use of a sequence of images that have not been motion estimated, motion compensated.

6. And inputting a training set and a test set, setting appropriate parameters, and training the model by using the super-resolution convolutional neural network SRCNN.

7. And verifying whether the trained model is effective, and comparing the next frame of image originally extracted with the image reconstructed by using the model parameters, wherein if the two images are almost indistinguishable, the model can be considered to be effective. If the two images have obvious difference, the adjustment is needed to be made according to different situations. If the difference between the two images is large, the data set needs to be adjusted, and the model needs to be retrained, and if the difference between the two images is not large, the imaging effect needs to be improved, the network parameters need to be adjusted, and the model with the composite requirement needs to be retrained.

When comparing the generated image with the next frame of image of the original image, the subjective judgment and the objective numerical analysis need to be combined visually. Subjectively, two frames of images are observed by naked eyes, and if the two images have little difference, the model can be subjectively considered to be effective. However, since the difference between the original previous and next frame images is not very large, a mathematical tool is also needed to compare the two images. The reconstruction effect can be objectively evaluated using, i.e., peak signal-to-noise ratio (PSNR), which is an objective criterion for evaluating an image, and is expressed as follows:

where MSE is the Mean squared error (Mean squared error). PSNR values between the original image and the next frame image of the original image and between the original image and the reconstructed image are respectively calculated, if the PSNR values are close to the PSNR values, the model effect is good, and basically the same picture as the next frame image of the original image is reconstructed. If the PSNR value of the latter is higher, it is considered that the program performs inter prediction on an image and also improves image quality.

With PSNR, the accuracy of the model can be verified again from the customer, thereby reducing the workload and ensuring that the solution is implemented efficiently.

Claims

1. An interframe prediction method based on SRCNN is characterized in that a super-resolution convolutional neural network is used for interframe prediction of an image sequence; after motion estimation and motion compensation operations are carried out on the image sequence, a characteristic model is trained by combining a super-resolution convolutional neural network; performing super-resolution reconstruction on the image by using parameters in the model, and performing motion estimation and motion compensation on the image to obtain an image consistent with the next frame of image of the current image;

the specific implementation comprises the following steps:

step 1: collecting a large number of video files of different scenes, and compressing the video according to different quantization parameters;

and step 3: dividing parts in the image sequence into verification sets; reading the residual image sequence frame by frame, calculating a residual error between two frames of images by using a current frame and a previous frame of each image except for a first frame of the read image sequence, combining the previous frame of image with the residual error, and performing motion compensation on the previous frame of image to obtain a predicted frame of the previous frame of image; storing the calculated prediction frame image sequence, and dividing the prediction frame image sequence to obtain a training set and a test set, wherein the ratio of the training set to the test set is 4: 1;

and 4, step 4: inputting a training set and a test set, setting a super parameter, and training a parameter model by using a super-resolution convolutional neural network;

if the difference between the two images is small and the PSNR value is between the optimal preset threshold and the lowest preset threshold, returning to the step 4 to adjust the parameters of the super-resolution convolutional neural network and retraining the parameter model;

the reconstruction of an image using a parametric model is specifically realized as follows:

1. converting the input low-resolution image into a YCbCr color space to obtain a gray scale image, and taking the gray scale image as an input image i for image reconstruction operation; carrying out down-sampling on an input image i, wherein the step length of the down-sampling is set to be k, and obtaining a low-dimensional image;

3. reading parameters in the parameter model, including the weight and bias of each network node; carrying out nonlinear mapping on the interpolated image through three layers of convolution networks to obtain a reconstructed image I;