CN108174218A

CN108174218A - Coding and decoding video frame based on study

Info

Publication number: CN108174218A
Application number: CN201810064012.9A
Authority: CN
Inventors: 陈志波; 何天宇; 金鑫; 刘森
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2018-06-15
Anticipated expiration: 2038-01-23
Also published as: CN108174218B

Abstract

The invention discloses a kind of coding and decoding video frame based on study, including：Space-time domain reconstructs memory, for storing the encoded reconstructing video content with after decoding；Space-time domain predicts network, for using the Space-time domain correlation of reconstructing video content, being modeled by convolutional neural networks and Recognition with Recurrent Neural Network to it, exporting the predicted value of present encoding block；Predicted value subtracts each other to form residual error with original value；Iterative analysis device and iteration synthesizer, step by step to inputting residual error encoding and decoding；Binaryzation device represents the output quantization of iterative analysis device for two-value；Entropy coder carries out entropy coding to the coding output after quantization, obtains output code flow afterwards；Entropy decoder carries out entropy decoding to output code flow, exports give iteration synthesizer afterwards.The coding framework realizes the prediction of Space-time domain by the VoxelCNN (Space-time domain prediction network) based on study, and the control of Video coding rate-distortion optimization is realized with the method that residual error iteration encodes.

Description

Coding and decoding video frame based on study

Technical field

The present invention relates to video coding and decoding technology field more particularly to a kind of coding and decoding video frames based on study.

Background technology

Existing image/video coding standard is such as：JPEG, H.261, MPEG-2, H.264, H.265, be based on mixing and compile Code frame.By development for many years, promotion being continuously increased along with complexity of coding efficiency is further compiled in existing mixing Coding efficiency, which is promoted, under code framework also faces more and more challenges.

But hybrid encoding frame usually realizes the Optimized Coding Based of image/video according to didactic method at present, it is more next It is more difficult to meet the complexity such as recognition of face instantly, target following, image retrieval, the media application demand of intelligence.

Invention content

The object of the present invention is to provide a kind of coding and decoding video frames based on study, can realize that video encoding rate is distorted The control of optimization.

The purpose of the present invention is what is be achieved through the following technical solutions：

A kind of coding and decoding video frame based on study, which is characterized in that including：Coding side and decoding end；Wherein encode End includes：Space-time domain reconstruct memory, Space-time domain prediction network, iterative analysis device, iteration synthesizer, binaryzation device, entropy coder And entropy decoder；

The Space-time domain reconstructs memory, for storing the encoded reconstructing video content with after decoding；

The Space-time domain predicts network, for using the Space-time domain correlation of reconstructing video content, passing through convolutional Neural net Network and Recognition with Recurrent Neural Network model it, export the predicted value of present encoding block；

The iterative analysis device, comprising convolutional neural networks and Recognition with Recurrent Neural Network structure, by the pre- survey grid of the Space-time domain The predicted value of network output, as inputting, exports the compression expression for the residual error with the original residual error to be formed of subtracting each other；

The iteration synthesizer comprising convolutional neural networks and Recognition with Recurrent Neural Network structure, receives entropy decoder decoding production The compression expression of raw above-mentioned residual error, and the predicted value of the Space-time domain prediction network output is superimposed, form reconstructing video content；

The iterative analysis device and iteration synthesizer, step by step to inputting residual error encoding and decoding, by increase code stream for cost by Step reduces the distortion level of residual error, realizes the coding of different distortion levels in the case of high-low code flow；

The binaryzation device represents the output quantization of iterative analysis device for two-value；

The entropy coder carries out entropy coding to the coding output after quantization, obtains output code flow afterwards；

The entropy decoder carries out entropy decoding to output code flow, exports give iteration synthesizer afterwards.

As seen from the above technical solution provided by the invention, Space-time domain prediction and residual error iteration coding staff are integrated with Method is realized the prediction of Space-time domain by the VoxelCNN (Space-time domain prediction network) based on study, and is encoded with residual error iteration Method realize Video coding rate-distortion optimization control.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for this For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of coding and decoding video block schematic illustration based on study provided in an embodiment of the present invention；

Fig. 2 is the main processes schematic diagram of coding and decoding video frame provided in an embodiment of the present invention；

Fig. 3 is movement Interpolation Process schematic diagram provided in an embodiment of the present invention；

Fig. 4 extends process schematic for movement provided in an embodiment of the present invention.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on this The embodiment of invention, the every other implementation that those of ordinary skill in the art are obtained without making creative work Example, belongs to protection scope of the present invention.

The embodiment of the present invention provides a kind of coding and decoding video frame based on study, which mainly wraps It includes：Coding side and decoding end；As shown in Figure 1, wherein coding side mainly includes：Space-time domain reconstructs memory, Space-time domain predicts network, Iterative analysis device, iteration synthesizer, binaryzation device, entropy coder and entropy decoder；

The Space-time domain reconstructs memory, for storing the encoded reconstructing video content with after decoding, comprising having solved Decoded piece of the frame and present frame of code.Encoding-decoding process usually according to the forward direction (P-frame) of video time axis or Two-way (B-frame) is carried out, and each usual block-by-block of frame is according to sequence encoding and decoding from left to right, from top to bottom.

The Space-time domain prediction network (VoxelCNN), for using the Space-time domain correlation of reconstructing video content, passing through Convolutional neural networks and Recognition with Recurrent Neural Network model it, export the predicted value of present encoding block；Predicted value and original value phase Subtract to form residual error, and pass through iterative analysis device and be iterated coding with iteration synthesizer, realize rate-distortion optimization.

The entropy decoder after carrying out entropy decoding to output code flow, exports and gives iteration synthesizer.

Entropy coder, entropy decoder can utilize the methods of arithmetic coding/decoding based on context real in the embodiment of the present invention It is existing, i.e., using arithmetic encoder/decoder as entropy encoder/decoder.

In the embodiment of the present invention, the Space-time domain reconstruct memory, Space-time domain prediction network, iteration synthesizer and entropy decoder Form the decoder in coding side.

It will be understood by those skilled in the art that because decoding end is only capable of obtaining reconstructing video content rather than original video content, Therefore coding side includes decoding function to provide reconstructing video content for encoder reference.

In order to make it easy to understand, with reference to specific example as shown in Figure 2 to the main processing in coding and decoding video frame Process is described in detail.

In the embodiment of the present invention, the predicted value of the Space-time domain prediction network calculations encoding block includes movement synthesis with mixing Predict two processes.

1st, movement synthesis.

Movement is synthesized comprising interpolation and movement extension is moved, and is two kinds of different coding patterns, and in operation optionally wherein one Kind mode.

1) movement interpolation is according to adjacent two frame in reconstructing video content to obtain movement locus of object and interpolation is to adjacent two Between frame, as interpolation frame.As shown in figure 3, movement Interpolation Process is as follows：Enable v_x,v_y,x,Wherein (v_x,v_y) represent fortune Dynamic vector,Represent set of integers.Interpolation frame is denoted asAdjacent two frame is denoted as respectively in reconstructing video contentWithIt is logical Cross the motion vector (v that the operation of motion compensation that coding block size is m determines an encoding block centered on coordinate (x, y)_x, v_y), interpolation frameIn encoding block centered on (x, y)Value byIn withCentered on encoding blockDuplication obtains, and in this manner, can obtain a complete interpolation frameAnd as the output of movement interpolation operation.

2) movement, which is extended to, obtains movement locus of object by the front cross frame of reconstructing video content and extends back, so as to obtain Obtain an extension frameAs shown in figure 4, movement extension process is as follows：First in front cross frameWithIn, pass through encoding block ruler The very little operation of motion compensation for m determines the motion vector (v of an encoding block centered on coordinate (x, y)_x,v_y), extend frameIn encoding block centered on (x, y)Value byIn with (x-v_x,y-v_y) centered on encoding blockDuplication obtains, In this manner, a complete extension frame can be obtainedAnd as the output of movement extension operation.

2nd, hybrid predicting.

Hybrid predicting includes convolution and convolution LSTM structures, and (interpolation frame or extension frame being assumed, movement synthesized in Fig. 2 Done in journey movement extension operation, then referred to herein as extension frame), interpolation frame or extend frame front cross frame (With), with And positioned in present frame above present encoding block with decoded piece of left as inputting, video Space-time domain is believed by study The modeling of breath generates the predicted value of present encoding block in present frame；By iterative calculation, according to from left to right, from top to bottom Sequentially, the predicted value of present encoding block can be all generated each time, finally risk totality.

As illustrated in fig. 2, it is assumed that using movement stretched-out coding pattern, then in the case where moving extension mode, will extend two before frame Frame (With) and in present frame decoded piece of present encoding block upper left side (for each frame according to from upper Arrive down, sequence from left to right carries out encoding and decoding) as input；In the case where moving interpolation pattern, by the former frame of interpolation frame with after One frame (With) and decoded piece of present encoding block upper left side is used as input in present frame.Hybrid predicting By learning the modeling to video space time-domain information, the predicted value of present encoding block is generated；By iterative calculation, according on to Under, sequence from left to right can all generate the predicted value of present encoding block, finally risk totality each time.The embodiment of the present invention In, the predicted value of Space-time domain prediction network output subtracts each other to form residual error, and pass through iterative analysis device and synthesize with iteration with original value Device is iterated coding, and the optimization aim of Space-time domain prediction network is：

Wherein, B is the involved totalframes of optimization, and J is the coding number of blocks that each frame is total in reconstructing video content, Original value, the predicted value of j-th of encoding block in the i-th frame are corresponded to respectively.

In the embodiment of the present invention, optimization aim is equivalent to loss function, and Space-time domain prediction role of network is to generate predicted value, And cause this predicted value close to original value.

In the embodiment of the present invention, iterative analysis device is formed with iteration synthesizer comprising the S self-encoding encoders based on convolution S coding stage, constantly with synthesis to realize variable compression ratio, each stage changes iterative analysis for reconstruction value and desired value A compression for inputting residual error is generated for analyzer to express, compression expression forms output code flow, iteration point after quantization The optimization aim of parser and iteration synthesizer is expressed as：

Wherein,For the starting stage (i.e. the 1st stage) input residual error,It represents to input during n-th of stage residual Difference,Represent the output (i.e. the n stage expresses for inputting the compression of residual error) in n-th of stage.

In the embodiment of the present invention, iterative analysis device and iteration synthesizer are combined optimizations, in formulaActuallyThe output of iterative analysis device, binaryzation device, iteration synthesizer is have passed through, therefore parameter here contains iterative analysis All parameters in device, iteration synthesizer.

Said program provided in an embodiment of the present invention solves and is difficult to realize that movement is pre- by integration trainingt in neural network The problems such as survey, proposes VoxelCNN to model the Space-time domain priori of video content simultaneously, and integrated iterative analysis device/synthesizer, Binaryzation device, entropy encoder/decoder etc. realize the coding and decoding video based on study.In verification experimental verification, in no entropy coding In the case of device/decoder, the performance of this method has been more than Moving Picture Experts Group-2 encoder, has been reached and H.264 approximate effect.

The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can readily occur in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of coding and decoding video frame based on study, which is characterized in that including：Coding side and decoding end；Wherein coding side Including：Space-time domain reconstruct memory, Space-time domain prediction network, iterative analysis device, iteration synthesizer, binaryzation device, entropy coder with And entropy decoder；

The Space-time domain predicts network, for using the Space-time domain correlation of reconstructing video content, by convolutional neural networks with And Recognition with Recurrent Neural Network models it, exports the predicted value of present encoding block；

The iterative analysis device, it is comprising convolutional neural networks and Recognition with Recurrent Neural Network structure, Space-time domain prediction network is defeated The predicted value gone out, as inputting, exports the compression expression for the residual error with the original residual error to be formed of subtracting each other；

The iteration synthesizer comprising convolutional neural networks and Recognition with Recurrent Neural Network structure, receives what entropy decoder decoding generated The compression expression of above-mentioned residual error, and the predicted value of the Space-time domain prediction network output is superimposed, form reconstructing video content；

The iterative analysis device and iteration synthesizer step by step to inputting residual error encoding and decoding, are gradually subtracted by increasing code stream for cost The distortion level of few residual error realizes the coding of different distortion levels in the case of high-low code flow；

A kind of 2. coding and decoding video frame based on study according to claim 1, which is characterized in that the Space-time domain weight Structure memory, Space-time domain prediction network, iteration synthesizer form the decoder in coding side with entropy decoder.

3. a kind of coding and decoding video frame based on study according to claim 1, which is characterized in that the Space-time domain is pre- The predicted value of survey grid network calculation code block includes movement synthesis and two processes of hybrid predicting, wherein：

Movement synthesizes movement interpolation or movement extension operation, and movement interpolation is to be obtained by adjacent two frame of reconstructing video content Movement locus of object and interpolation are between adjacent two frame, as interpolation frame；Movement is extended to by two before reconstructing video content Frame obtains movement locus of object and extends back, so as to obtain an extension frame；

Hybrid predicting includes convolution and convolution LSTM structures, interpolation frame is either extended two before frame, interpolation frame or extension frame Frame and positioned in present frame above present encoding block with decoded piece of left as inputting, by learning to video space The modeling of time-domain information generates the predicted value of present encoding block in present frame；By iterative calculation, each encoding block is finally obtained Predicted value.

A kind of 4. coding and decoding video frame based on study according to claim 3, which is characterized in that movement Interpolation Process It is as follows：Interpolation frame is denoted asAdjacent two frame is denoted as respectively in reconstructing video contentWithIt is m by coding block size Operation of motion compensation determine the motion vector (v of an encoding block centered on coordinate (x, y)_x,v_y), interpolation frameIn with Encoding block centered on (x, y)Value byIn withCentered on encoding blockDuplication obtains, according to Such method obtains a complete interpolation frame

A kind of 5. coding and decoding video frame based on study according to claim 3, which is characterized in that movement extension process It is as follows：In reconstructing video content front cross frameWithIn, determine one by the operation of motion compensation that coding block size is m Motion vector (the v of encoding block centered on coordinate (x, y)_x,v_y), extend frameIn encoding block centered on (x, y) Value byIn with (x-v_x,y-v_y) centered on encoding blockDuplication obtains, and in this manner, obtains one completely Extend frame

6. a kind of coding and decoding video frame based on study according to claim 1 or 3, which is characterized in that Space-time domain is pre- The predicted value of survey grid network output subtracts each other to form residual error with original value, and passes through iterative analysis device and be iterated volume with iteration synthesizer Code, the optimization aim of Space-time domain prediction network are：

Wherein, B is the involved totalframes of optimization, and J is the coding number of blocks that each frame is total in reconstructing video content, Respectively Correspond to original value, the predicted value of j-th of encoding block in the i-th frame.

7. a kind of coding and decoding video frame based on study according to claim 6, which is characterized in that iterative analysis device with Iteration synthesizer includes S coding stage of the S self-encoding encoder compositions based on convolution, and reconstruction value and desired value constantly change With synthesis to realize variable compression ratio, each stage iterative analysis device generates a compaction table for inputting residual error for generation analysis It reaches, compression expression forms output code flow after quantization, and iterative analyzer and the optimization aim of iteration synthesizer are expressed as：

Wherein,For the starting stage input residual error,Represent the residual error inputted during n-th of stage,Represent n-th of rank The output of section.