CN108012157A

CN108012157A - Construction method for the convolutional neural networks of Video coding fractional pixel interpolation

Info

Publication number: CN108012157A
Application number: CN201711207766.7A
Authority: CN
Inventors: 宋利; 张翰; 杨小康
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2018-05-08
Anticipated expiration: 2037-11-27
Also published as: CN108012157B

Abstract

The present invention provides a kind of construction method of convolutional neural networks for Video coding fractional pixel interpolation, including：Different content, the image of resolution ratio are collected, is formed comprising different type, the original training data collection of the data of encoder complexity；Pretreatment operation is carried out to original training data collection, obtains the training data for meeting Video coding inter prediction fractional pixel interpolation characteristic；Depth convolutional neural networks are built, obtain the convolutional neural networks structure suitable for Video coding inter prediction fractional pixel interpolation；The convolutional neural networks that the data input obtained using pretreatment is put up, while the convolutional neural networks that original training data collection is built as corresponding true value, training.This invention ensures that convolutional neural networks can be trained smoothly, and meeting Video coding fractional pixel interpolation property requirements using the fraction pixel that trained convolutional neural networks interpolation obtains, the lifting of video coding efficiency can be realized by carrying out fractional pixel interpolation using the present invention.

Description

Construction method for the convolutional neural networks of Video coding fractional pixel interpolation

Technical field

The present invention relates to a kind of method of technical field of image processing, is specifically that one kind is suitable for Video coding inter prediction The convolutional neural networks method of fractional pixel interpolation.

Background technology

Inter prediction is a key technology in video encoding standard, using between frame and frame video content it is similar Property, the redundancy of video in time can be effectively removed, so as to improve coding compression efficiency.Simultaneously as digitizing Discrete sampling operation in journey, real object of which movement are not necessarily what is carried out according to sampling grid.In order to further improve thing The accuracy of body motion prediction, the movement of object is all in units of fraction pixel in video encoding standard.Sampling grid The upper pixel value positioned at fractional pixel position is not necessary being, and in the application, the pixel value of these fractional pixel positions needs The pixel value interpolation of the integer position of necessary being is utilized to obtain.

However, it is based on some a priori assumptions that the interpolation filter that fraction pixel is used is generated in Video coding at present On the basis of, artificially design.The parameter of these interpolation filters is fixed, enriching constantly and regarding with video content Frequency division resolution is continuously increased, and the wave filter of this preset parameter can not be all applicable in.

Deep learning is mass data to be fitted by the neutral net of design so as to obtain one kind of universal applicable models Method.Method based on deep learning not only for example achieves great in some semantic class problems in target following, pedestrian detection Break through, effect has also been obviously improved in the Pixel-level problem such as image super-resolution.

Inter prediction fractional pixel interpolation has certain similitude with image super-resolution, i.e., both by necessary being Small figure by the big figure of certain multiplying power generation.But image super-resolution is to generate whole high-resolution using low-resolution image Big figure, and inter prediction fractional pixel interpolation is then to generate remaining fractional position picture according to the integer position pixel of necessary being Element is, it is necessary to ensure that integer position pixel does not change.In addition, for inter prediction fractional pixel interpolation, positioned at the picture of fractional position Element is not necessary being, therefore, in the training process of convolutional neural networks, may be referred to, leads without real true value Training is caused to be normally carried out.

The content of the invention

The present invention is in view of the foregoing defects the prior art has, there is provided one kind is suitable for Video coding inter prediction fraction picture The construction method of the convolutional neural networks of plain interpolation, this method utilize the volume that superperformance is obtained in image super-resolution problem The advantages of product neutral net, while the characteristics of consider Video coding inter prediction fractional pixel interpolation, devise suitable for video Encode the convolutional neural networks of inter prediction fractional pixel interpolation and make the pretreatment operation that is smoothed out of training, so can be with The objective quality of Video coding reconstruction frames is improved, realizes the lifting of code efficiency.

To achieve the above object, the structure of the convolutional neural networks of the present invention for Video coding fractional pixel interpolation Construction method includes：

Different content, the image of different resolution are collected, is formed comprising different type, the data of different coding complexity Original training data collection；

Pretreatment operation is carried out to the original training data collection being collected into, obtains meeting Video coding inter prediction fraction picture The training data of plain interpolation characteristic, input data of the data as training convolutional neural networks；

Depth convolutional neural networks are built, Video coding fractional pixel interpolation characteristic is considered, obtains being suitable for Video coding The convolutional neural networks structure of inter prediction fractional pixel interpolation；

The convolutional neural networks put up of data input obtained using pretreatment, while by the original training data collection The convolutional neural networks built as corresponding true value, training, obtain being suitable for Video coding inter prediction fractional pixel interpolation Convolutional neural networks model.

Preferably, the pretreatment operation, process are as follows：

A) image that the fractional pixel position of interpolation generation concentrates original training data as needed carries out corresponding multiplying power Down-sampled operation, obtain the low resolution training data for step b)；

B) volume is compressed to low resolution training data according to the configuration in video encoding standard to still image coding Code, obtains the low resolution coding and rebuilding image for step c)；

C) up-sampling for carrying out corresponding to multiplying power to low resolution coding and rebuilding image in step a) operates, and returns to original graph As size, the input data of training convolutional neural networks is obtained.

It is highly preferred that it is described c) in, the up-sampling of low resolution coding and rebuilding image is operated, is ensured high after up-sampling The pixel value of image in different resolution integer pixel positions is consistent with the low resolution coding and rebuilding figure before up-sampling.

Preferably, it is described to build depth convolutional neural networks, wherein the depth convolution god network built includes 20 weights Layer and 1 weight masking layer；For weight masking layer, W_IFor the weight of integer pixel positions, W_HFor fractional pixel position Weight, all fractional pixel positions share a weight.

It is highly preferred that the Video coding inter prediction fractional pixel interpolation, wherein integer pixel positions pixel value is constant, Only generate fractional pixel position.

Compared with prior art, the beneficial effects of the invention are as follows：

The present invention is extracted beyond the great ability of feature using depth convolutional neural networks from mass data, it is also contemplated that Video coding distinctive data characteristic and Video coding inter prediction fractional pixel interpolation are only compared to image super-resolution The characteristics of having, redesigned depth convolutional neural networks, while devises supporting pretreatment operation, ensures convolutional Neural net The training of network can be smoothed out, so that the convolutional neural networks model suitable for Video coding fractional pixel interpolation has been obtained, The objective quality that compressed encoding rebuilds video is improved, improves video coding efficiency.

Brief description of the drawings

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, further feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is the method flow diagram of one embodiment of the invention；

Fig. 2 is the convolutional neural networks structure diagram of one embodiment of the invention；

Fig. 3 is one embodiment of the invention integer pixel positions, half fractional pixel position, a quarter fraction pixel Position view.

Embodiment

With reference to specific embodiment, the present invention is described in detail.Following embodiments will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this area For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection domain.

The present invention provides a kind of construction method of convolutional neural networks for Video coding fractional pixel interpolation, such as Fig. 1 Shown, its mentality of designing is：

Different content, the image of different resolution are collected, is obtained comprising different type, the data of different coding complexity Training dataset；

The training dataset being collected into is pre-processed, obtains the input data of training convolutional neural networks.Pretreatment Operation specifically includes：

The depth convolutional neural networks suitable for Video coding inter prediction fractional pixel interpolation are built, pretreatment will be passed through Operate the input of obtained image as network, while original training data concentrated into corresponding image as corresponding true value, Training parameter, training convolutional neural networks are set；

The convolutional neural networks model obtained using training carries out fractional pixel interpolation operation, and realization is based on convolutional Neural net The Video coding inter prediction fractional pixel interpolation of network.

The b of the pre-treatment step), according to the configuration in video encoding standard for still image compression coding, to drop Low-resolution image after sampling is compressed coding, makes the reconstructed value of low-resolution image become special comprising video data encoder The image of property.

The c of the pre-treatment step), operated for the up-sampling of low resolution reconstruction image after compressed encoding, it is necessary to protect The pixel value of the whole location of pixels of high-definition picture is consistent with low-resolution image before up-sampling after card up-sampling, only generates The pixel value of fractional pixel position.

The present invention considers consolidating for Video coding fractional pixel interpolation on the basis of image super-resolution convolutional neural networks There is characteristic i.e. integer position pixel constant, only generate fractional position pixel, redesign convolutional neural networks, meanwhile, with closing Pretreatment operation is stated, ensure that convolutional neural networks can be trained smoothly, and uses trained convolutional neural networks interpolation Obtained fraction pixel meets Video coding fractional pixel interpolation property requirements so that carries out fractional pixel interpolation using the present invention It can realize the lifting of video coding efficiency.In addition, the convolutional neural networks obtained using the present invention, can be in once-through operation The pixel value of all fractional pixel positions is generated at the same time.

Newest video encoding standard is applied the invention to below --- in high-performance video coding (HEVC), introduce suitable For the convolutional neural networks construction method of HEVC inter prediction half picture element interpolations, mainly to data prediction, volume Product neural network structure the specific implementation details such as is built and is described in detail.Certainly, the present invention can also be applied to other compile Code standard.

1. process of data preprocessing

For in process of data preprocessing to the compressed encoding step of low-resolution image, using (AI) in the full frame of HEVC Configuration encodes down-sampled obtained low resolution image.

For in preprocessing process to it is low resolution compressed encoding reconstruction image upsampling process, using based on discrete cosine The interpolation filter of conversion.For half location of pixels, the interpolation filter based on discrete cosine transform is 8 tap filterings Device, tap coefficient are as shown in table 1.

Interpolation filter tap coefficient of the table 1 based on discrete cosine transform

Index i	-3	-2	-1	0	1	2	3	4
									Hfilter[i]	-1	4	-11	40	40	-11	4	-1

The process of the half location of pixels pixel in Fig. 3 is produced using the interpolation filter based on discrete cosine transform It is as follows：

Wherein, b_0,j,h_i,0,j_0,0, the pixel value of expression half location of pixels, A_i,jRepresent whole location of pixels pixel Value, hfilter [i] represent the tap coefficient of the interpolation filter based on discrete cosine transform, and B represents locating depth for pixel value.

2. convolutional neural networks structure is built

The present invention is using J Kim etc. in IEEE Conference on Computer Vision and in 2016 The Accurate delivered in Pattern Recognition (IEEE international computers vision and pattern-recognition meeting) meeting Image Super-Resolution Using Very Deep Convolutional Networks are basic framework, in original Weight masking layer, W are added in beginning frame_IFor the weighted value of integer position pixel value, W_HFor half location of pixels pixel value Weighted value.

As shown in Fig. 2, the convolutional neural networks structure that the present embodiment is built includes 20 convolutional layers, 1 weight masking layer. For convolutional layer, in addition to first convolutional layer and last convolutional layer, each convolutional layer includes 64 different filtering Device, the size of each wave filter is 3 × 3 × 64.For first convolutional layer, the wave filter that 64 sizes are 3 × 3 × 1 is included. For last convolutional layer, the wave filter that 1 size is 3 × 3 × 64 is included.For weight masking layer, integer pixel positions Different weights, wherein W are used from fractional pixel position_IFor integer pixel positions weights, W_HWeighed for half location of pixels Value.Convolutional neural networks input in the present embodiment is the height of the target size obtained by low-resolution image after pretreatment Image in different resolution.What the convolutional neural networks in the present embodiment were predicted is that the high-definition picture of final output and starting input By the residual image between pretreatment image, it is defined as follows：

R=Y_H-X_ILR (4)

Wherein Y_HRepresent the high-definition picture of final output, X_ILRImage after the pretreatment of expression starting input.

By the way that the residual image that convolutional neural networks are predicted is added with input pretreatment image, final output is obtained High-definition picture.

3. training convolutional neural networks

The training process of convolutional neural networks is using Euclidean distance as loss function：

Wherein θ represents that convolutional neural networks need the parameter set learnt,Represent training image,Represent original Training data concentrates corresponding true value image, F (X_i；θ) represent the high-definition picture of final output.By rolling up in this present embodiment Product neural network prediction is residual image, the F (X in formula (5)_i；It should θ) be expressed as：

Wherein,Represent the image by pretreatment of starting input.

Training obtains the convolutional neural networks model suitable for Video coding inter prediction fractional pixel interpolation above.

4. implementation result

The convolutional neural networks model that the present embodiment is trained is applied in HEVC coding frameworks, use is improved Encoder encodes cycle tests with standard HEVC encoders.Cycle tests is as shown in table 2, and all cycle tests are all 4: 2:0 yuv format, it is 8 to represent locating depth.

2 cycle tests details of table

For the HEVC encoders used in the present embodiment for HM-16.7, coding is configured to low latency P frames (LDP) universal test Configuration, the quantization parameter (QP) for encoding use is respectively that 22,27,32,37. the present embodiment are based on just for luminance Y component use The fractional pixel interpolation method of convolutional neural networks, remaining chromatic component is still using standard interpolation filter generation fraction picture Element.

Under above-mentioned implementation condition, the encoded test result shown in table 3 has been obtained.The performance indicator that table 3 uses is BD- Rate indexs, expression is compared with standard HEVC encoders, in the case of identical Y-PSNR (PSNR), uses this The convolutional neural networks that embodiment is trained carry out the percentage that inter prediction half fractional pixel interpolation code check is saved. As shown in table 3, under above-mentioned implementation condition, the average BD-Rate of tri- components of Y, U, V is respectively -0.9%, -0.1%, - 0.1%.Especially, the gain of sequence B asketballPass is most notable, the three-component gain of Y, U, V can reach -2.4%, - 0.1%th, -1.6%.From table 3 it can be seen that compared to standard HEVC encoders, instructed using luminance Y component is directed in the present embodiment The method that the convolutional neural networks got carry out luminance component half picture element interpolation has obvious code efficiency to be lifted. Further, since encoder has used the technology based on luminance component prediction chromatic component, with carrying for luminance component reconstruction quality Rise, remaining chromatic component can also obtain certain coding efficiency lifting.

3 cycle tests coding efficiency (BD-Rate) of table

To further illustrate that the convolutional neural networks of present invention structure is more suitable for the fraction in Video coding inter prediction Picture element interpolation is direct shown in table 4 using two points of the convolutional neural networks trained for image super-resolution problem progress One of test result of the fractional pixel interpolation compared with using standard HEVC encoders.From table 4, it can be seen that directly use image The convolutional neural networks of super-resolution, which carry out fractional pixel interpolation, obvious loss of coding performance.

Table 4 uses image super-resolution convolutional neural networks encoded test result (BD-Rate)

To sum up, the present invention devises special convolutional neural networks for Video coding inter prediction fractional pixel interpolation, Meanwhile the present invention devises supporting process of data preprocessing so that the training of convolutional neural networks can be smoothed out, and The fraction pixel generated using trained convolutional neural networks can meet the particular demands of fractional pixel interpolation.Use this hair Bright obtained convolutional neural networks, which carry out fractional pixel interpolation, can obtain significant coding efficiency lifting, be more suitable for video volume The fractional pixel interpolation part of code inter prediction.

The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring the substantive content of the present invention.

Claims

A kind of 1. construction method of convolutional neural networks for Video coding fractional pixel interpolation, it is characterised in that：The side Method includes：

Collect different content, the image of different resolution, formed comprising different type, the data of different coding complexity it is original Training dataset；

Pretreatment operation is carried out to the original training data collection that is collected into, obtains meeting Video coding inter prediction fraction pixel and inserts It is worth the training data of characteristic, input data of the data as training convolutional neural networks；

Depth convolutional neural networks are built, Video coding fractional pixel interpolation characteristic is considered, obtains being suitable for Video coding interframe Predict the convolutional neural networks structure of fractional pixel interpolation；

The convolutional neural networks put up of data input obtained using pretreatment, at the same using the original training data collection as Corresponding true value, the convolutional neural networks that training is built, obtain the volume suitable for Video coding inter prediction fractional pixel interpolation Product neural network model.
2. the construction method of the convolutional neural networks according to claim 1 for Video coding fractional pixel interpolation, its It is characterized in that：The pretreatment operation, process are as follows：

A) image that the fractional pixel position of interpolation generation concentrates original training data as needed carries out the drop of corresponding multiplying power Sampling operation, obtains the low resolution training data being used in step b)；

B) low resolution training data is encoded according to the configuration in video encoding standard to still image coding, is used Low resolution coding and rebuilding image in step c)；

C) up-sampling for carrying out corresponding to multiplying power to low resolution coding and rebuilding image in step a) operates, and returns to original image ruler It is very little, obtain the input data of training convolutional neural networks.
3. the construction method of the convolutional neural networks according to claim 2 for Video coding fractional pixel interpolation, its It is characterized in that：It is described c) in, the up-sampling of low resolution coding and rebuilding image is operated, ensures high resolution graphics after up-sampling As the pixel value of integer pixel positions is consistent with the low resolution coding and rebuilding figure before up-sampling.
4. it is used for the structure of the convolutional neural networks of Video coding fractional pixel interpolation according to claim 1-3 any one of them Method, it is characterised in that：It is described to build depth convolutional neural networks, wherein the depth convolutional neural networks built include 20 power Double-layer and 1 weight masking layer；For weight masking layer, W_IFor the weight of integer pixel positions, W_HFor fractional pixel position Weight, all fractional pixel positions share a weight.
5. the construction method of the convolutional neural networks according to claim 4 for Video coding fractional pixel interpolation, its It is characterized in that：The Video coding inter prediction fractional pixel interpolation, wherein integer pixel positions pixel value is constant, only generates point Number location of pixels.
6. the application for the convolutional neural networks model that a kind of any one of claim 1-5 the method is built, its feature exist In：The convolutional neural networks model is operated for fractional pixel interpolation, realizes the Video coding based on convolutional neural networks Inter prediction fractional pixel interpolation.