CN110741640A

CN110741640A - Optical flow estimation for motion compensated prediction in video coding

Info

Publication number: CN110741640A
Application number: CN201880036783.5A
Authority: CN
Inventors: 许耀武; 李博晗; 韩敬宁
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2017-08-22
Filing date: 2018-05-10
Publication date: 2020-01-31
Anticipated expiration: 2038-05-10
Also published as: JP6905093B2; KR20200002036A; KR102295520B1; WO2019040134A1; KR102400078B1; JP2020522200A; CN110741640B; KR20210109049A; CN118055253A; EP3673655A1

Abstract

The motion field is used to warp some or all of the pixels of the reference frame into pixels of the current frame.

Description

Optical flow estimation for motion compensated prediction in video coding

Background

Digital video streams may represent video using series of frames or still images digital video may be used for a variety of applications including, for example, video conferencing, high definition video entertainment, video or sharing of user generated video.

compression techniques use a reference frame to generate a prediction block corresponding to a current block to be encoded, in addition to the value of the current block itself, a difference between the prediction block and the current block may be encoded to reduce the amount of data encoded.

Disclosure of Invention

The present disclosure relates generally to encoding and decoding video data, and more particularly to using block-based optical flow estimation for motion compensated prediction in video compression. Frame-level based optical flow estimation is also described, which can interpolate co-located (co-located) reference frames for motion compensated prediction in video compression.

A method according to an embodiment of the present disclosure determines a th frame portion of a th frame to be predicted, the frame in a video sequence, determines a th reference frame from the video sequence for forward inter prediction of an th frame, determines a second reference frame from the video sequence for backward inter prediction of a th frame, generates an optical-flow reference frame portion for inter prediction of a th frame portion by performing optical-flow estimation using a th reference frame and the second reference frame, and performs a prediction process of the th frame portion using the optical-flow reference frame.

An apparatus according to an embodiment of the present disclosure includes a non-transitory storage medium or memory and a processor, the medium including instructions executable by the processor to perform a method including determining a th frame to be predicted in a video sequence, and determining availability of a th reference frame for forward inter prediction of a th frame and a second reference frame for backward inter prediction of a th frame, the method further including, in response to determining availability of both a th 2 reference frame and a second reference frame, generating respective motion fields for pixels of a th frame portion using the th reference frame and the second reference frame as inputs to an optical flow estimation process, warping an th reference frame portion into an th frame portion using the motion fields to form an th warped reference frame portion, the th reference frame portion including pixels of a th reference frame co-located with pixels of the th frame portion, warping th reference frame portion using the motion fields to form a warped second reference frame portion, and blending pixels of the second reference frame portion with the warped reference frame portion to form an optical flow reference frame portion, the optical flow portion, the and the warped reference frame portion.

Another apparatus according to an embodiment of the present disclosure also includes a non-transitory storage medium or memory and a processor, the medium including instructions executable by the processor to perform a method including generating an inter-frame predicted optical flow reference frame portion for a block of a th frame of a video sequence using th reference frame from the video sequence and a second reference frame of the video sequence by initializing motion fields of pixels of a th frame portion for optical flow estimation in a processing level, using a downscaled (downscale) motion within a th frame portion and including levels of a plurality of levels, and for each of the plurality of levels warping the 48 th reference frame portion into a th frame portion using the motion fields to form a th warped reference frame portion, using the motion fields to warp the 48 th reference frame portion into a th frame portion to form a second reference frame portion, using the motion fields to estimate motion fields to warp the second motion fields to the second reference frame portion to form a final reference frame portion, and using the final reference frame portion to warp the motion fields between the second motion fields to the second reference frame portion and the final reference frame portion to form a final reference frame portion, the final reference frame portion using the second reference field warping method for warping the final reference frame portion, 638 and warping the final reference frame portion to form a final reference frame portion using a final reference frame portion, and a final reference frame portion of the second reference frame portion, the final reference frame portion, the second reference field distortion method for warping the optical flow estimation, wherein the optical field distortion is used for warping the optical field distortion of the optical frame portion.

Another apparatus according to embodiments of the disclosure includes a non-transitory storage medium or memory and a processor, the medium including instructions executable by the processor to perform a method including determining a th frame portion of a th frame to be predicted, the th frame being in a video sequence, determining a th reference frame from the video sequence for forward inter-frame prediction of a th frame, determining a second reference frame from the video sequence for backward inter-frame prediction of a th frame, generating an optical flow reference frame portion by performing optical flow estimation using the th reference frame and the second reference frame for inter-frame prediction of a th frame portion, and performing a prediction process of a th frame portion using the optical flow reference frame.

These and other aspects of the disclosure are disclosed in the following detailed description of the embodiments, the appended claims, and the accompanying drawings.

Drawings

The description herein makes reference to the following drawings wherein like reference numerals refer to like parts throughout the several views unless otherwise specified.

Fig. 1 is a schematic diagram of a video encoding and decoding system.

Fig. 2 is a block diagram of an example of a computing device that may implement a transmitting station or a receiving station.

Fig. 3 is a diagram of a typical video stream to be encoded and subsequently decoded.

Fig. 4 is a block diagram of an encoder according to an embodiment of the present disclosure.

Fig. 5 is a block diagram of a decoder according to an embodiment of the present disclosure.

FIG. 6 is a block diagram of an example of a reference frame buffer.

Fig. 7 is a diagram of frame groups in the display order of a video sequence.

Fig. 8 is a diagram of an example of a coding order of the frame group of fig. 7.

Fig. 9 is a diagram for explaining the linear projection of motion fields according to the teachings herein.

FIG. 10 is a flow diagram of a process for motion compensated prediction of video frames using at least portions of reference frames generated using optical flow estimation.

FIG. 11 is a flow diagram of a process for generating optical flow reference frame portions.

FIG. 12 is a flow diagram of another processes for generating optical flow reference frame portions.

Fig. 13 is a diagram illustrating the processes of fig. 11 and 12.

FIG. 14 is a diagram illustrating object occlusion.

Fig. 15 is a diagram illustrating a technique of optimizing a decoder.

Detailed Description

Video streams may be encoded into a bitstream, which involves compression, which is then transmitted to a decoder, which may decode or decompress the video stream, to prepare it for viewing or processing at .

In some embodiments, the reference frames may be located before or after the current frame in the video stream sequence and may be reconstructed before being used as reference frames.A number of reference frames for encoding or decoding blocks of the current frame of the video sequence may exist. are frames that may be referred to as golden frames. are recently encoded or decoded frames. are substitute reference frames that are encoded or decoded before or more frames in the sequence but displayed after these frames in output display order.

In this technique, the pixels that form the prediction block are obtained directly from or more of the available reference frames.

To more fully utilize motion information from available bidirectional reference frames (e.g., or more forward reference frames and or more backward reference frames), embodiments of the teachings herein describe a reference frame portion collocated with a current coded frame portion that uses motion fields per pixels computed by optical flow to estimate true motion activity in a video signal.

The details of the step of interpolating reference frame portions using optical flow estimation for video compression and reconstruction are described herein initially with reference to a system in which the teachings herein can be implemented.

Fig. 1 is a schematic diagram of a video encoding and decoding system 100. For example, transmitting station 102 may be a computer having an internal hardware configuration such as that described in fig. 2. However, other suitable implementations of transmitting station 102 are possible. For example, the processing of transmitting station 102 may be distributed among multiple devices.

Network 104 may connect transmitting station 102 and receiving station 106 to encode and decode the video stream, in particular, the video stream may be encoded in transmitting station 102 and the encoded video stream may be decoded in receiving station 106, for example, network 104 may be the internet network 104 may also be a Local Area Network (LAN), domain network (WAN), a Virtual Private Network (VPN), a cellular telephone network, or any other manner of transmitting the video stream from transmitting station 102 to receiving station 106 in this example.

In examples, the receiving station 106 may be a computer having an internal hardware configuration such as that depicted in FIG. 2, however, other suitable implementations of the receiving station 106 are possible.

Other implementations of the video encoding and decoding system 100 are possible, for example, implementations may omit the network 104 in another implementations, the video stream may be encoded and then stored for later transmission to the receiving station 106 or any other device having a non-transitory storage medium or memory in implementations, the receiving station 106 receives (e.g., via the network 104, a computer bus, and/or communication paths) the encoded video stream and stores the video stream for later decoding in an example implementation, real-time transport protocol (RTP) is used to transmit the encoded video over the network 104 in another implementations, transport protocols other than RTP may be used, for example, hypertext transport protocol (HTTP) -based video streaming protocols.

When used in a video conferencing system, for example, transmitting station 102 and/or receiving stations 106 may include the capability to encode and decode video streams as described below, for example, receiving stations 106 may be video conference participants that receive encoded video bitstreams from a video conference server (e.g., transmitting station 102) for decoding and viewing and further encode and communicate their own video bitstreams to the video conference server for decoding and viewing by other participants.

Fig. 2 is a block diagram of an example of a computing device 200 that may implement a transmitting station or a receiving station, for example, computing device 200 may implement or both of transmitting station 102 and receiving station 106 of fig. 1 computing device 200 may be in the form of a computing system including multiple computing devices or in the form of computing devices, such as a mobile phone, a tablet computer, a laptop computer, a notebook computer, a desktop computer, and so on.

Although the disclosed embodiments may be practiced with processors as shown, the processors shown are, for example, the CPU202, advantages in speed and efficiency may be realized by using more than processors.

In an embodiment, memory 204 in computing device 200 may be a read-only memory (ROM) device or a Random Access Memory (RAM) device any other suitable type of storage device or non-transitory storage medium may be used as memory 204. memory 204 may include code and data 206 accessed by CPU202 using bus 212. memory 204 may further include operating system 208 and application 210, application 210 including at least programs that allow CPU202 to perform the methods described herein. for example, application 210 may include applications 1 through N further including a video coding application that performs the methods described herein. computing device 200 may also include secondary storage 214, which may be, for example, a memory card used with mobile computing device . since video communication sessions may contain a large amount of information, they may be stored in whole or in part in secondary storage 214, and may be loaded into memory 204 for processing as needed.

Computing device 200 may also include or more output devices, such as display 218 in examples, display 218 may be a touch sensitive display that combines the display with touch sensitive elements operable to sense touch inputs display 218 may be coupled to CPU202 via bus 212 other output devices may be provided that allow a user to program or otherwise use computing device 200 in addition to or in place of display 218 when the output device is or includes a display, the display may be implemented in a variety of ways, including by a Liquid Crystal Display (LCD), Cathode Ray Tube (CRT) display, or Light Emitting Diode (LED) display, such as an Organic LED (OLED) display.

Computing device 200 may also include or be in communication with an image sensing device 220, for example, a camera or any other image sensing device 220 now existing or later developed that is capable of sensing images, such as images of a user operating computing device 200. The image sensing device 220 may be positioned such that it is facing a user operating the computing device 200. In an example, the position and optical axis of the image sensing device 220 may be configured such that the field of view includes an area directly adjacent to the display 218 and visible from the display 218.

Computing device 200 may also include a sound sensing device 222 or be in communication with sound sensing device 222, such as a microphone or any other sound sensing device now existing or later developed that is capable of sensing sound in the vicinity of computing device 200. The sound sensing device 222 can be positioned toward a user operating the computing device 200 and can be configured to receive sound, such as speech or other utterances made by the user while the user is operating the computing device 200.

Although FIG. 2 depicts the CPU202 and memory 204 of the computing device 200 as being integrated into units, other configurations may also be utilized, the operations of the CPU202 may be distributed across multiple machines (where a single machine may have or more processors) that may be coupled directly or across a local network or other network, the memory 204 may be distributed across multiple machines, such as a network-based memory or a memory among multiple machines that perform the operations of the computing device 200. although depicted here as buses, the bus 212 of the computing device 200 may be comprised of multiple buses, further , the secondary storage 214 may be directly coupled to other components of the computing device 200 or may be accessed via a network, and may include integrated units such as memory cards or multiple units such as multiple memory cards.

FIG. 3 is a diagram of an example of a video stream 300 to be encoded and subsequently decoded the video stream 300 includes a video sequence 302. at the lower level, the video sequence 302 includes a plurality of adjacent frames 304. although three frames are depicted as adjacent frames 304, the video sequence 302 may include any number of adjacent frames 304. the adjacent frames 304 may then be further subdivided into a single frame, e.g., frame 306. at the lower level, the frame 306 may be divided into series of planes or segments 308. for example, the segments 308 may be a subset of frames that allow for parallel processing.

Whether or not the frame 306 is divided into segments 308, the frame 306 may be further subdivided into blocks 310, the blocks 310 may contain data corresponding to, for example, 16 × 16 pixels in the frame 306. the blocks 310 may also be arranged to include data from or more segments 308 of pixel data.

Fig. 4 is a block diagram of an encoder 400 in accordance with an embodiment of the present disclosure, as described above, encoder 400 may be implemented in transmitting station 102, such as by providing a computer software program stored in a memory, such as memory 204, the computer software program may include machine instructions that, when executed by a processor, such as CPU202, cause transmitting station 102 to encode video data in the manner described in fig. 4, encoder 400 may also be implemented as dedicated hardware included in, for example, transmitting station 102, in particularly desirable embodiments, encoder 400 is a hardware encoder.

The encoder 400 has the following stages for performing various functions in the forward path (represented by the solid connecting lines) to generate an encoded or compressed bitstream 420 using the video stream 300 as input: an intra/inter prediction stage 402, a transform stage 404, a quantization stage 406, and an entropy coding stage 408. The encoder 400 may also include a reconstruction path (represented by dashed connected lines) for reconstructing the frame for encoding of future blocks. In fig. 4, the encoder 400 has the following stages for performing various functions in the reconstruction path: a dequantization stage 410, an inverse transform stage 412, a reconstruction stage 414, and a loop filtering stage 416. Other structural variations of the encoder 400 may be used to encode the video stream 300.

In either case, a prediction block may be formed.

Referring still to FIG. 4, next, the prediction block may be subtracted from the current block at an intra/inter prediction stage 402 to produce a residual block (also referred to as a residual), a transform stage 404 transforms the residual into transform coefficients in, for example, the frequency domain using a block-based transform, a quantization stage 406 transforms the transform coefficients into discrete magnitude values using quantizer values or quantization levels, referred to as quantized transform coefficients.

The reconstruction path in fig. 4 (shown with dashed lines) may be used to ensure that the encoder 400 and decoder 500 (described below) use the same reference frame to decode the compressed bitstream 420. The reconstruction path performs functions similar to those occurring during the decoding process described in more detail below, including dequantizing the quantized transform coefficients in a dequantization stage 410 and inverse transforming the dequantized transform coefficients in an inverse transform stage 412 to produce a block of derivative residues (also referred to as derivative residues). In the reconstruction stage 414, the predicted block predicted in the intra/inter prediction stage 402 may be added to the derivative residual to create a reconstructed block. Loop filtering stage 416 may be applied to the reconstructed block to reduce distortion such as blocking artifacts.

For example, a non-transform based encoder may quantize the residual signal directly without the transform stage 404 for certain blocks or frames in another implementations, the encoder may have the quantization stage 406 and the dequantization stage 410 combined into a common stage.

Fig. 5 is a block diagram of a decoder 500 according to an embodiment of the present disclosure. The decoder 500 may be implemented in the receiving station 106, for example, by providing a computer software program stored in the memory 204. The computer software program may include machine instructions that, when executed by a processor such as CPU202, cause receiving station 106 to decode video data in the manner described in fig. 5. Decoder 500 may also be implemented in hardware included in, for example, transmitting station 102 or receiving station 106.

Similar to the reconstruction path of the encoder 400 described above, in the examples, the decoder 500 includes stages for performing various functions to generate an output video stream 516 from the compressed bitstream 420, an entropy decoding stage 502, a dequantization stage 504, an inverse transform stage 506, an intra/inter prediction stage 508, a reconstruction stage 510, a loop filtering stage 512, and a deblocking filtering stage 514 other structural variations of the decoder 500 may also be used to decode the compressed bitstream 420.

When the compressed bitstream 420 is presented for decoding, data elements within the compressed bitstream 420 may be decoded by the entropy decoding stage 502 to produce a set of quantized transform coefficients. Dequantization stage 504 dequantizes the quantized transform coefficients (e.g., by multiplying the quantized transform coefficients by quantizer values), and inverse transform stage 506 inverse transforms the dequantized transform coefficients to produce derivative residuals that may be the same as the derivative residuals created by inverse transform stage 412 in encoder 400. Using the header information decoded from the compressed bitstream 420, the decoder 500 may use the intra/inter prediction stage 508 to create the same prediction block as was created in the encoder 400, e.g., in the intra/inter prediction stage 402. In the reconstruction stage 510, the prediction block may be added to the derivative residual to create a reconstructed block. Loop filtering stage 512 may be applied to the reconstructed block to reduce blocking artifacts.

Other filtering may be applied to the reconstructed block. In this example, a deblocking filtering stage 514 is applied to the reconstructed block to reduce block distortion, and the result is output as an output video stream 516. The output video stream 516 may also be referred to as a decoded video stream, and the terms will be used interchangeably herein. Other variations of the decoder 500 may be used to decode the compressed bitstream 420. For example, the decoder 500 may generate the output video stream 516 without the deblocking filtering stage 514.

FIG. 6 is a block diagram of an example of a reference frame buffer 600. The reference frame buffer 600 stores reference frames used to encode or decode blocks of frames of a video sequence. In this example, the reference FRAME buffer 600 includes reference FRAMEs identified as a LAST FRAME LAST _ FRAME 602, a GOLDEN FRAME GOLDEN _ FRAME 604, and a substitute reference FRAME ALTREF _ FRAME 606. The header of the reference frame may include a virtual index for the location within the reference frame buffer where the reference frame is stored. The reference frame map may map a virtual index of the reference frame to a physical index of memory where the reference frame is stored. Where two reference frames are the same frame, these reference frames will have the same physical index, even though they have different virtual indices, and the number, type, and names used of the reference locations within the reference frame buffer 600 are merely examples.

The reference frames stored in the reference frame buffer 600 may be used to identify motion vectors for predicting blocks of a frame to be encoded or decoded. Different reference frames may be used depending on the type of prediction used to predict the current block of the current frame. For example, in bi-directional prediction, a block of the current FRAME may be forward predicted using any FRAME stored as LAST FRAME 602 or GOLDEN FRAME 604, and backward predicted using a FRAME stored as ALTREF FRAME 606.

As shown in FIG. 6, the reference FRAME buffer 600 may store up to eight reference FRAMEs, where each stored reference FRAME may be associated with a different virtual index of the reference FRAME buffer, although three of the eight spaces in the reference FRAME buffer 600 are used by FRAMEs designated LAST _ FRAME 602, GOLDEN _ FRAME 604, and ALTREF _ FRAME606, five spaces may still be used to store other reference FRAMEs.

In embodiments, the substitute reference FRAME designated ALTREF _ FRAME606 can be a FRAME of the video sequence that is further away in display order from the current FRAME but is encoded or decoded before it is displayed.

A substitute reference frame may instead be generated using or more frames that have had filtering applied, combined at , or combined at and filtered.

In this example, frame 700 is a key (also referred to as an intra-predicted frame), which refers to the state of a predicted block within the frame that is predicted using intra-prediction only. however, frame 700 may be an overlay frame, which is an inter-predicted frame that may be a reconstructed frame of a previous frame group.

The coding order of each frame group may be different from the display order. This allows frames located after the current frame in the video sequence to be used as reference frames for encoding the current frame. A decoder, such as decoder 500, may share a common group coding structure with an encoder, such as encoder 400. The group coding structure assigns different roles (e.g., last frame, substitute reference frame, etc.) that the respective frames within the group may play in the reference buffer and defines or indicates the coding order for the frames within the group.

The order shown in FIG. 8 is generally referred to herein as the coding order because the encoding and decoding order is the same, the key or overlay FRAMEs 700 are designated as gold FRAMEs in a reference FRAME buffer, such as GOLDEN _ FRAME 604 in the reference FRAME buffer 600. the FRAMEs 700 are intra-predicted in this example, so it does not require a reference FRAME, but an overlay FRAME as a reconstructed FRAME from a previous group as the FRAMEs 700 is also not using a reference FRAME of the current FRAME group. the final FRAME 716 in a group is designated as an alternate reference FRAME in a reference FRAME buffer, such as ALTERREF _ FRAME606 in the reference FRAME buffer 600. according to this coding order, the FRAMEs 716 are coded out of display order after the

FRAME

700, 702 to provide a backward reference FRAME in each of the remaining FRAME 714. the coding order of the FRAME groups 716 is used as a forward reference FRAME code for a plurality of backward reference FRAME blocks 714 or .

As briefly mentioned above, the available reference frame portions may be reference frame portions that are interpolated using optical flow estimation. For example, the reference frame portion may be a block, a slice, or an entire frame. When frame-level optical flow estimation is performed as described herein, the resulting reference frame is referred to herein as a co-located reference frame because the dimensions are the same as the current frame. This interpolated reference frame may also be referred to herein as an optical flow reference frame.

Fig. 9 is a diagram for explaining the linear projection of motion fields according to the teachings herein. Within the layered coding framework, the optical flow (also referred to as motion field) of a current frame can be estimated using the nearest available reconstructed (e.g., reference) frames before and after the current frame. In fig. 9, reference frame 1 is a reference frame that can be used for forward prediction of the current frame 900, and reference frame 2 is a reference frame that can be used for backward prediction of the current frame 900. 6-8, if the current FRAME 900 is FRAME 706, the immediately preceding or following FRAME 704 (e.g., the reconstructed FRAME stored in the reference FRAME buffer 600 as LAST _ FRAME 602) may be used as reference FRAME 1, while FRAME 716 (e.g., the reconstructed FRAME stored in the reference FRAME buffer 600 as ALTREF _ FRAME 606) may be used as reference FRAME 2.

Knowing the display indices of the current frame and the reference frame, motion vectors can be projected between pixels in

reference frames

1 and 2 to pixels in current frame 900, assuming that the motion field is linear in time. In the simple example described with respect to fig. 6-8, the index of the current frame 900 is 3, the index of the reference frame 1 is 0, and the index of the reference frame 2 is 716. In fig. 9, projected motion vectors 904 for pixels 902 of a current frame 900 are shown. Explained using the previous example, the display index of the frame group of FIG. 7 would indicate that frame 704 is temporally closer to frame 706 than frame 716. Thus, the single motion vector 904 shown in FIG. 9 represents the amount of different motion between reference frame 1 and current frame 900 and between reference frame 2 and current frame 900. However, projection motion field 906 is linear between reference frame 1, current frame 900, and reference frame 2.

Selecting the nearest available reconstructed forward and backward reference frames and assuming that the motion field of the corresponding pixels of the current frame is linear in time allows the generation of interpolated reference frames using optical-flow estimation to be performed at both the encoder and decoder (e.g., in intra/inter prediction stage 402 and intra/inter prediction stage 508) without conveying additional information.

FIG. 10 is a flow diagram of a method or process 1000 for motion compensated prediction of frames of a video sequence using at least portions of reference frames generated using optical flow estimation, for example, a reference frame portion may be a block, slice, or whole frame.

The frames to be predicted may also be referred to as th, second, third, etc. the labels of , second, etc. do not denote the order of the frames , rather, unless otherwise noted, the labels are used herein to distinguish the current frame from another .

Although not explicitly shown in FIG. 10, if any forward or backward reference frames are not present, the process 100 ends.

Assuming forward and backward reference frames are present in 1004, optical-flow reference frame portions may be generated using the reference frames in 1006. generating optical-flow reference frame portions is described in more detail with reference to FIGS. 11-14. in embodiments, optical-flow reference frame portions may be stored at defined locations within reference frame buffer 600. first, optical-flow estimation in accordance with the teachings herein is described.

The optical flow estimation can be performed for the corresponding pixels of the current frame portion by minimizing the following lagrangian function (1):

J＝J_data+λJ_spatial(1)

in function (1), J_dataIs a data penalty based on the assumption that luminance is constant (i.e., the assumption that the intensity values of a small portion of the image remain constant over time despite changes in position). J. the design is a square_spatialIt is a spatial penalty based on the smoothness of the motion field (i.e., the characteristic that neighboring pixels may belong to the same object terms in the image, resulting in substantially the same image motion).

According to an embodiment of the teachings herein, the data penalty may be represented by a data penalty function:

J_data＝(E_xu+E_yv+E_t)²(2)

the horizontal component of the motion field of the current pixel is denoted by u, while the vertical component of the motion field is denoted by v. Colloquially, E_x、E_yAnd E_tIs the derivative of the pixel values of the reference frame portion with respect to the horizontal axis x, the vertical axis y, and time t (e.g., as represented by the frame index). The horizontal and vertical axes are defined with respect to the pixel arrays forming a current frame, such as current frame 900, and reference frames, such as

reference frames

1 and 2.

In the data penalty function, the derivative E_x、E_yAnd E_tIt can be calculated according to the following functions (3), (4) and (5):

E_t＝E^(r2)-E^(r1)(5)

variable E^(r1)Is the pixel value at the projected position in reference frame 1 based on the motion field of the current pixel position of the current frame being encoded. Similarly, variable E^(r2)Is the pixel value at the projected position in reference frame 2 based on the motion field of the current pixel position of the current frame being encoded.

Variable index_r1Is the display index of reference frame 1, where the display index of a frame is its index in the display order of the video sequence. Similarly, variable index_r2Is the display index of reference frame 2, and the variable index_curIs the display index of the current frame 900.

Variables of

Is the horizontal derivative calculated at reference frame 1 using a linear filter. Variables of

Is the horizontal derivative calculated at the reference frame 2 using a linear filter. Variables of

Is the vertical derivative calculated at reference frame 1 using a linear filter. Variables of

Is the vertical derivative calculated at the reference frame 2 using a linear filter.

In an embodiment of the teachings herein, the linear filter used to calculate the horizontal derivatives is a 7-tap filter with filter coefficients [ -1/60, 9/60, -45/60, 0, 45/60, -9/60, 1/60 ]. The filters may have different frequency profiles, different numbers of taps, or both. The linear filter used to calculate the vertical derivatives may be the same as or different from the linear filter used to calculate the horizontal derivatives.

The spatial penalty may be represented by a spatial penalty function:

J_spatial＝(Δu)²+(Δv)²(6)

in the spatial penalty function (6), Δ u is the laplacian of the horizontal component u of the motion field, and Δ v is the laplacian of the vertical component v of the motion field.

FIG. 11 is a flow diagram of a method or process 1100 for generating optical flow reference frame portions. In this example, the optical flow reference frame portion is the entire reference frame. Process 1100 may perform step 1006 of process 1000. For example, process 1100 may be implemented as a software program that may be executed by a computing device, such as transmitting station 102 or receiving station 106. For example, a software program may include machine-readable instructions that may be stored in a memory, such as memory 204 or secondary storage 214, and which, when executed by a processor, such as CPU202, may cause the computing device to perform process 1100. Process 1100 may be implemented using dedicated hardware or firmware. As noted above, multiple processors, memories, or both may be used.

To reduce potential errors in the motion of pixels resulting from this problem, estimated motion vectors from the current frame to the reference frame may be used to initialize the optical flow estimation of the current frame 1102 all pixels in the current frame may be assigned initialized motion vectors.

Motion field mv of current pixel_curAn estimated motion vector mv representing pointing from the current pixel to the backward reference frame can be used according to the following equation_r2And an estimated motion vector mv pointing from the current pixel to the forward reference frame_r2The motion vector of the difference between, which is in this example reference frame 2, and the forward reference frame, which is in this example reference frame 1:

mv_cur＝--mv_r1+mv_r2

if of the motion vectors are not available, then of the available motion vectors may be used to infer the initial motion according to the following function:

mv_cur＝-mv_r1·(index_r2-index_r1)/(index_cur-index_r1) Or is or

mv_cur＝mv_r2·(index_r2-index_r1)/(index_r2-index_cur)。

or more spatial neighborhoods with initialization motion vectors may be used in the case where no motion vector references are available for the current pixel.

In the example of initializing the motion field at processing level in 1102, reference frame 2 may be used to predict the pixels of reference frame 1, where reference frame 1 is the last frame before the current frame was coded, the motion vectors projected onto the current frame using linear projection in a similar manner as shown in FIG. 9 produce motion fields mv at intersecting pixel locations_curSuch as motion field 906 at pixel location 902.

FIG. 11 relates to initializing motion fields at a processing level because there are desirably multiple processing levels for process 110. As can be seen with reference to FIG. 13, FIG. 13 is a diagram illustrating process 1100 of FIG. 11 (and process 1200 of FIG. 12 discussed below). the following description uses the phrase motion fields.

In pyramid structures, for example, the reference frame is reduced to or more different scales.

The basis for this process is that it is easier to capture large movements when the image is reduced. However, scaling the reference frame itself using a simple rescaling filter may reduce the reference frame quality. To avoid losing detailed information due to rescaling, the pyramid structure scales the derivatives instead of the pixels of the reference frame to estimate the optical flow. This pyramid scheme represents a regression analysis of the optical flow estimates. The scheme is shown in fig. 13 and is implemented by process 1100 of fig. 11 and process 1200 of fig. 12.

After initialization, the Lagrangian parameter λ is set for solving the Lagrangian function (1) in 1104 desirably the process 1100 uses multiple values of the Lagrangian parameter λ the th value to which the Lagrangian parameter λ is set in 1104 may be a relatively large value such as 100 although it is desirable that the process 1100 use multiple values of the Lagrangian parameter λ in the Lagrangian function (1), only values may be used as described in the process 1200 described below.

In 1106, the reference frame is warped into the current frame according to the motion field of the current processing level. Warping the reference frame to the current frame may be performed using sub-pixel position rounding (rounding). Notably, the warping is performedPreviously, motion field mv used at processing level _curDown from its full resolution value to that level of resolution. Zooming out the motion field is discussed in more detail below.

Knowing the flow of light mv_curThe motion field that warps the reference frame 1 is inferred from the following linear projection assumption (e.g., motion is projected linearly over time):

mv_r1＝(index_cur-index_r1)/(index_r2-index_r1)·mv_cur

to perform the twisting, the motion field mv_r1Horizontal component u of_r1And a vertical component u_r1It can be rounded to 1/8 pixel accuracy for the Y component and 1/16 pixel accuracy for the U and V components. Other values for sub-pixel position rounding may be used. After rounding, the warped image is computed

As a motion vector mv_r1A given reference pixel. The sub-pixel interpolation may be performed using a conventional sub-pixel interpolation filter.

The same warping method is applied to the reference frame 2 to obtain a warped image

Wherein the motion field is calculated by:

mv_r2＝(index_r2-index_cur)/(index_r2-index_r1)·mv_cur

at the end of the calculation in 1106, there are two warped reference frames. In 1108, two warped reference frames are used to estimate the motion field between them. Estimating the motion field in 1108 may include a number of steps.

First, the derivative E is calculated using the functions (3), (4) and (5)_x、E_yAnd E_t. When calculating the derivative, the frame boundary of the warped reference frame may be extended by copying the nearest available pixel. In this manner, the pixel value (i.e., E) is determined when the projection location is outside of the warped reference frame^(r1)And/or E^(r2)) Can be obtained. Then, if there are multiple layers, the derivative is reduced to the current level. As shown in fig. 13, the reference frame is used to compute the derivative at the original scale to capture the details. Reducing the derivative at each level 1 may be accomplished by reducing the derivative at 2¹×2¹Intra block averaging. Notably, because calculating the derivatives and averaging them are both linear operations, the two operations can be combined in a single linear filter to calculate the derivatives in each level 1. This may reduce computational complexity.

, the optical flow estimation can be performed according to the lagrangian function (1) if applicable, more specifically, by setting the derivatives of the lagrangian function (1) with respect to the horizontal component u of the motion field and the vertical component v of the motion field to zero (i.e.,and

) The components u and v of all N pixels of the frame can be solved with 2 x N linear equations. This is due to the fact that the laplacian is similar to a two-dimensional (2D) filter. Instead of solving linear equations directly, which is accurate but highly complex, an iterative approach with faster but less accurate results can be used to minimize the lagrangian function (1).

In 1108, the motion field of the pixels of the current frame is updated or refined using the estimated motion field between warped reference frames. For example, the current motion field of a pixel may be updated by adding an estimated motion field of the pixel on a pixel-by-pixel basis.

upon estimating the motion field in 1108, a query is made in 1110 to determine if there is an additional value of the lagrangian parameter λ available, a smaller value of the lagrangian parameter λ can account for the smaller scale motion, if there is an additional value, the process 1100 can return to 1104 to set the lower values for the lagrangian parameter λ.

once there are no remaining values of the lagrangian parameter λ at 1110, the process 1100 proceeds to 1112 to determine if there are more processing levels to process if there are additional processing levels at 1112, the process 1100 proceeds to 1114 where the motion field is scaled up before layers per of the available values of the lagrangian parameter λ starting at 1104 is used.

This process of magnifying the motion fields, using them to initialize the optical-flow estimates at the next levels, and obtaining the motion fields continues until the lowest level of the pyramid is reached in 1112 (i.e., until the optical-flow estimates are completed for the derivatives computed at the full scale).

Once the levels are at the level where the reference frames are not downscaled (i.e., they are at their original resolution), process 1100 proceeds to 1116. for example, the number of levels may be three, such as in the example of FIG. 13. in 1116, the warped reference frames are blended to form optical-flow reference frame E^(cur)Note that the warped reference frame mixed in 1116 may be a full-scale reference frame that is warped again using the motion field estimated in 1108 according to the process described in 1106After the motion field is refined at the full scale level. The blending may be performed using the following linear-in-time assumption (e.g., frames spaced by equal segments):

in embodiments, it is contemplated that it is preferable to warp only the pixels in of the reference frame rather than blending values_r1Representation) is outside the boundary (e.g., outside the size of the frame) and the reference pixel in reference frame 2 is not, then only the pixels in the warped image caused by reference frame 2 are used according to:

occlusion of objects and background typically occurs in video sequences, where portions of objects appear in reference frames, but are hidden in another reference frames.

However, even in this case, the simple blending method described above does not give us a satisfactory interpolation result . this can be demonstrated by referring to FIG. 14, FIG. 14 is a diagram illustrating object occlusion. in this example, the occluded part of object A is displayed in reference frame 1 and hidden by object B in reference frame 2. because the hidden part of object A is not displayed in reference frame 2, the reference pixel from reference frame 2 is from object B.

For occlusion detection, it is observed from FIG. 14 that when occlusion occurs and the motion field is reasonably accurate, the motion vector of the occluded part of object A points to object A in reference frame 2, which can lead to the following situation is the case of warped pixel values

Andvery different because they come from two different objects. The second case is that pixels in object B are referenced by multiple motion vectors for the occluded part of object B in the current frame and object a in the current frame.

With these observations, the following conditions can be established to determine occlusion and use E only^curIs/are as follows

Wherein similar conditions apply equally to using only E^curIs/are as follows

Greater than a threshold value T_pixel(ii) a And is

Greater than a threshold value T_ref。

Assuming the above sub-pixel interpolation exists, when the reference sub-pixel location is within pixel lengths of the pixel location of interest,is counted. Furthermore, if mv_r2Pointing to sub-pixel positions, of four adjacent pixels

As the total number of references for the current sub-pixel position.

Can be defined in a similar manner.

Similarly, an occlusion may be detected in the second reference frame using the th warped reference frame and the second warped reference frame.

Experiments have shown that process 1100 provides significant compression performance gains. These performance gains include: for low resolution frame set, 2.5% gain in PSNR and 3.3% gain in SSIM, and for medium resolution frame set, 3.1% in PSNR and 4.0% in SSIM. However, and as described above, the optical flow estimation performed according to the lagrangian function (1) uses 2 × N linear equations to solve the horizontal and vertical components u, v of the motion field for all N pixels of the frame. In other words, the computational complexity of optical flow estimation is a polynomial function of the frame size, which imposes a burden on the complexity of the decoder. Thus, a sub-frame (e.g., block-based) optical flow estimation is described next, which may reduce decoder complexity over the frame-based optical flow estimation described with respect to FIG. 11.

FIG. 12 is a flow diagram of a method or process 1200 for generating optical flow reference frame portions. In this example, the optical flow reference frame portion is less than the entire reference frame. The co-located frame portions in this example are described with reference to blocks, but other frame portions may be processed according to fig. 12. Process 1200 may perform step 1006 of process 1000. For example, process 1200 may be implemented as a software program that may be executed by a computing device, such as transmitting station 102 or receiving station 106. For example, a software program may include machine-readable instructions that may be stored in a memory, such as memory 204 or secondary storage 214, and which, when executed by a processor, such as CPU202, may cause the computing device to perform process 1200. Process 1200 may be implemented using dedicated hardware or firmware. As noted above, multiple processors, memories, or both may be used.

All pixels in the current frame may be assigned initialization motion vectors in 1202 they define an initial motion field that may be used to warp the reference frame to the current frame at processing levels to shorten the length of motion between the reference frames the initialization in 1202 may be performed using the same process as described with respect to the initialization in 1102 and, therefore, the description will not be repeated here.

In 1204, reference frames, such as

reference frames

1 and 2, are warped into the current frame according to the motion fields initialized in 1202. The warping in 1204 may be performed using the same process as described with respect to warping in 1106, except that, desirably, the motion field mv initialized in 1202 is not warped before the reference frame is warped_curIs scaled down from its full resolution value.

Like process 1100 , process 1200 may estimate the motion field between two reference frames using a multi-level process similar to the multi-level process described with respect to FIG. 13.

More specifically, the motion field mv for the block at the current (or th) processing level is initialized 1206_cur. The blocks may be selected in a scanning order (e.g., raster scanning order) of the current frameA block of the current frame. Sports field mv of blocks_curIncluding the motion field of the corresponding pixel of the block. In other words, in 1206, all pixels with the current block are assigned an initialization motion vector. The initialized motion vector is used to warp a reference block into a current block to shorten the length between reference blocks in a reference frame.

In 1206, make the motion field mv_curDown from its full resolution value to that level of resolution. In other words, the initialization in 1206 may include: the motion field for the corresponding pixel of the block is scaled down from the full resolution value initialized at 1202. The scaling down may be performed using any technique such as the scaling down described above.

Warping reference blocks to current blocks 1208, co-located reference blocks corresponding to motion fields in every of the warped reference frame the warping of reference blocks is performed similarly to process 1100 at 1106_curThe motion field for warping is inferred by the following linear projection assumption (e.g., motion is projected linearly over time):

mv_r1＝(index_cur-index_r1)/(index_r2-index_r1)·mv_cur

to perform the twisting, the motion field mv_r1Horizontal component u of_r1u_r1And a vertical component u_r1It can be rounded to 1/8 pixel accuracy for the Y component and 1/16 pixel accuracy for the U and V components. Other values may be used. After rounding, warp blocks are computed

As a motion vector mv_r1For a given reference pixel, the warped block is, for example

The sub-pixel interpolation may be performed using a conventional sub-pixel interpolation filter.

The same warping method is performed on the reference block of the reference frame 2 to obtain a warped block, for example,

wherein the motion field is calculated by:

mv_r2＝(index_r2-index_cur)/(index_r2-index_r1)·mv_cur

at the end of the calculation in 1208, there are two warped reference blocks. In 1210, two warped reference blocks are used to estimate the motion field between them. The process in 1210 may be similar to the process described with respect to the process in 1108 of FIG. 11.

More specifically, the two warped reference blocks may be at full resolution. From the pyramid structure in fig. 13, the derivatives E are calculated using the functions (3), (4) and (5)_x、E_yAnd E_t. When calculating the derivative of the frame level estimate, the frame boundary may be expanded by copying the nearest available pixels to obtain the out-of-boundary pixel values, as described with respect to process 110. However, for other frame portions, neighboring pixels are typically available in the distorted reference frame in 1204. For example, for block-based estimation, pixels of neighboring blocks are available in the warped reference frame unless the block itself is at a frame boundary. Thus, for out-of-boundary pixels associated with the warped reference frame portion, pixels in the adjacent portion of the warped reference frame may be used as the pixel value E, if applicable^(r1)And E^(r2). If the projected pixel is outside the frame boundary, then the nearest available (i.e., within the boundary) pixel may still be used for replication. After calculating the derivatives, they may be scaled down to the current level. The reduced derivative of each level 1 may be determined by taking the derivative at 2¹×2¹Intra block averaging, as previously discussed. The computational complexity can be reduced by combining two linear operations of calculating the derivative and averaging the derivative in a single linear filter, but this is not essential

Continuing with the processing in 1210, optical flow estimation may be performed to estimate the motion field between warped reference portions using the scaled derivatives as input to the Lagrangian function (1). All N images of a part, here a blockThe horizontal and vertical components u and v of the motion field of the element may be determined by setting the derivative of the lagrangian function (1) with respect to the horizontal and vertical components u and v to zero (i.e.,

and

) is to assume zero correlation with neighboring blocks and assume that the motion vector outside the boundary is the same as the motion vector at the boundary position closest to the pixel position outside the boundary. is to use the initialized motion vector for the current pixel (i.e., the motion field initialized in 1206) as the motion vector for the pixel position outside the boundary corresponding to the current pixel.

After the motion field is estimated, the current motion field at the level is updated or refined using the estimated motion field between warped reference blocks to complete the processing in 1210. For example, the current motion field of a pixel may be updated by adding an estimated motion field of the pixel on a pixel-by-pixel basis.

In process 1200, such a loop is omitted, that is, in process 1200 as shown, only values of the Lagrangian parameter λ are used to estimate the motion field at the current processing level.

In other embodiments, the process 1200 may include an additional loop for changing the lagrangian parameter λ. In embodiments that include such a loop, the lagrangian parameter λ may be set prior to estimating the motion field in 1210, such that warping the reference block in 1208 and estimating and updating the motion field in 1210 is repeated until all values of the lagrangian parameter λ have been used, as described with respect to the processing in 1104 and 1110 in process 1100.

After estimating and updating the motion field in 1210, the process 1200 proceeds to a query at 1212, this is done after estimating and updating the motion field at level 1210 at , which is also only times when using a single value of the Lagrangian parameter λ when modifying multiple values of the Lagrangian parameter λ at processing level, process 1200 proceeds to a query at 1212 after estimating and updating the motion field using the final values of the Lagrangian parameter λ in 1210.

If there are additional levels of processing in response to the query in 1212, process 1200 proceeds to 1214, where the motion field is enlarged beginning at 1206 before processing at layers.

This process of magnifying the motion fields, using them to initialize the optical-flow estimates at the next levels, and obtaining the motion fields continues until the lowest level of the pyramid is reached in 1212 (i.e., until the optical-flow estimates are completed for the derivatives computed at the full scale).

Once the levels are at the level where the reference frames are not downscaled (i.e., they are at the original resolution), the process 1200 proceeds to 1216. for example, the number of levels may be three, such as in the example of FIG. 13. in 1216, the warped reference blocks are blended to form optical-flow reference blocks (e.g., the E described previously)^(cur)) Note that the warped reference block mixed in 1216 may be a full-scale reference block that is warped again using the motion field estimated in 1210 according to the process described in 1208. in other words, the full-scale reference block may be warped twice- times using the initial amplified motion field from the previously processed layer and again after refining the motion field at the full-scale levelThe optional occlusion detection shown by example in 14 is incorporated into 1216 as part of the blending.

After generating the co-located reference block in 1216, process 1200 proceeds to 1218 to determine if there is a further steps of frame portion (here, block) for prediction.

Referring again to fig. 10, process 1200 may implement 1006 in process 1000 at the end of the processing in 1006, whether performed according to process 1100, process 1200, or any of the variations described herein, there are or more warped reference frame portions.

The optical-flow reference frames may be optical-flow reference frames generated by combining the optical-flow reference portions output by process 1200. combining the optical-flow reference portions may include arranging the optical-flow reference portions (e.g., co-located reference blocks) according to the pixel positions of the corresponding current frame portion for each of the generated optical-flow reference portions. the resulting optical-flow reference frames may be stored for use in a reference frame buffer of an encoder, such as reference frame buffer 600 of encoder 400.

Generating the prediction block at the encoder may include selecting a co-located block in the optical flow reference frame as the prediction block.

At the encoder, process 1000 may form part of a rate-distortion loop for the current block that uses various prediction modes, including or more intra-prediction modes and both single and compound inter-prediction modes using available prediction frames for the current frame, the inter-prediction mode of single uses only a single forward or backward reference frame for inter-prediction.

In embodiments, it may be desirable to limit the use of optical-flow reference frames to a single inter-prediction mode.that is, in any composite reference mode, optical-flow reference frames are excluded as reference frames.

The prediction process in 1008 may be repeated for all blocks of the current frame until the current frame is encoded.

In a decoder, performing a prediction process using optical-flow reference frames at 1008 may result from determining that the optical-flow reference frames are available for decoding a current frame in embodiments, the determination is made by examining a flag indicating that at least blocks of the current frame are encoded using optical-flow reference frame portions.

The same process of generating optical-flow reference frames for a prediction process that is part of the decoding may be performed in a decoder such as decoder 500, as performed in an encoder, for example, when the flag indicates that at least blocks of the current frame are encoded using optical-flow reference frame portions, the entire optical-flow reference frame may be generated and stored for use in the prediction process.

In FIG. 15, the pixels are shown along grid 1500, where w represents the pixel location along the -axis of grid 1500 and where y represents the pixel location along the second axis of grid 1500. grid 1500 represents the pixel location of a portion of the current frame in order to perform the prediction process at the decoder at 1008, the processing in 1006 and 1008 may be combined.

Once the reference blocks are located, all reference blocks that are spanned (i.e., overlapped) by the reference block are identified this may include extending the reference block size by half the filter length at each boundary to account for the sub-pixel interpolation filter in FIG. 15, the sub-pixel interpolation filter length L is used to extend the reference block to the boundary represented by the outer dashed line 1506. relatively often, the motion vector results in a reference block that is not perfectly aligned with the full pixel position the shaded area in FIG. 15 represents the full pixel position.

once the reference block is identified, process 1200 is performed at 1006 only on blocks within the current frame that are co-located with the identified reference block to generate co-located/optical-flow estimated reference blocks in the example of FIG. 15, which results in six optical-flow reference frame portions.

It is noted that the reference blocks of subsequent blocks, including any extended edges, may overlap with or more reference blocks identified during the decoding of the current block, in which case light stream estimates need to be performed only on any of the identified blocks to further reduce the computational requirements of the decoder.

Regardless of how the prediction block is generated in the decoder, the decoded residue from the current block of the encoded bitstream may be combined with the prediction block to form a reconstructed block, as described with respect to the decoder 500 of fig. 5.

The prediction process at 1008, whether performed after process 1200 or in conjunction with process 1200, may be repeated for all blocks of the current frame that are encoded using the optical-flow reference frame portion until the current frame is decoded. When processing blocks in decoding order, blocks that are not encoded using optical flow reference frame portions may be decoded conventionally in a prediction mode that decodes the blocks from the encoded bitstream.

The complexity of solving the optical flow equation for N pixels in a frame or block can be represented by O (N × M), where M is the number of iterations to solve the linear equation. M is not related to the number of levels or the number of values of the lagrangian parameter λ. In contrast, M is related to the computational accuracy of solving linear equations. A larger value of M results in a higher accuracy. In view of this complexity, moving from frame-level to subframe-level (e.g., block-based) estimation provides several options for reducing decoder complexity. First, and because the constraint on motion field smoothness is relaxed at block boundaries, solving the linear equation for a solution is easier to converge to a solution, resulting in a smaller M of similar precision. Second, the solution of a motion vector relates to its neighboring motion vectors due to the smoothness penalty factor. Motion vectors at block boundaries have fewer neighboring motion vectors, yielding faster computations. Third, and as described above, the optical flow need only be computed for portions of the co-located reference frame's blocks, identified by those coded blocks of the co-located reference frame, without requiring the entire frame to be used for inter-prediction.

For simplicity of explanation, each of

processes

1000, 1100, and 1200 are depicted and described as an series of steps or operations.

The above encoding and decoding aspects illustrate examples of encoding and decoding techniques, however, it is to be understood that the encoding and decoding of these terms, as used in the claims, may refer to compression, decompression, transformation, or any other processing or alteration of data.

As used herein, the term "or" is intended to mean "or" rather than "or" exclusively "or". that is, unless otherwise specified, it is clear from the context that "X includes A or B" is intended to mean any natural inclusive arrangement.

Embodiments of sending station 102 and/or receiving station 106 (and algorithms, methods, instructions, etc. stored thereon and/or included for execution thereby, executed by encoder 400 and decoder 500) may be implemented in hardware, software, or any combination thereof.

step in aspects, for example, the transmitting station 102 or receiving station 106 may be implemented using a general purpose computer or general purpose processor having a computer program that, when executed, performs any of the respective methods, algorithms, and/or instructions described herein.

For example, transmitter station 102 and receiver 106 may be implemented on computers in a video conferencing system. Alternatively, transmitting station 102 may be implemented on a server and receiving station 106 may be implemented on a device such as a handheld communication device separate from the server. In this case, transmitting station 102 may encode the content into an encoded video signal using encoder 400 and communicate the encoded video signal to a communication device. The communication device may then decode the encoded video signal using the decoder 500. Alternatively, the communication device may decode content stored locally on the communication device, e.g., content not communicated by transmitting station 102. Other suitable transmitter station 102 and receiving station 106 embodiments are also available. For example, the receiving station 106 may be a substantially stationary personal computer rather than a portable communication device, and/or a device including the encoder 400 may also include the decoder 500.

additionally, all or part of the embodiments of the disclosure may take the form of a computer program product accessible from, for example, a computer-usable or computer-readable medium such as any device that can tangibly embody, store, communicate, or transport the program for use by or in connection with any processor.

The implementation of step is summarized in the following example:

example 1: a method comprising determining a th frame to be predicted in a video sequence, determining a th reference frame from the video sequence for forward inter-prediction of the th frame, determining a second reference frame from the video sequence for backward inter-prediction of the th frame, generating an optical-flow reference frame for inter-prediction of the th frame by performing optical-flow estimation using the th reference frame and the second reference frame, and performing a prediction process of the th frame using the optical-flow reference frame.

Example 2 the method of example 1, wherein generating the optical flow reference frame comprises performing the optical flow estimation for respective pixels of the th frame by minimizing a Lagrangian function.

The method of example 1 or 2, wherein the optical flow estimates produce respective motion fields for pixels of the th frame, and generating the optical flow reference frames comprises warping the th reference frame to the th frame using the motion fields to form a th warped reference frame, warping the second reference frame to the th frame using the motion fields to form a second warped reference frame, and blending the th warped reference frame and the second warped reference frame to form an optical flow reference frame.

The method of example 3, wherein blending the th warped reference frame and the second warped reference frame comprises combining the co-located pixel values of the th warped reference frame and the second warped reference frame by scaling the co-located pixel values using distances between the th reference frame and the second reference frame and between the current frame and each of the th reference frame and the second reference frame.

The method of examples 3 or 4, wherein blending the th warped reference frame and the second warped reference frame comprises populating pixel locations of the optical flow reference frame by of combining co-located pixel values of the th warped reference frame and the second warped reference frame or using a single pixel value of of the th warped reference frame or the second warped reference frame.

Example 6 the method of any of examples 3-5, further comprising detecting an occlusion in the th reference frame using the th warped reference frame and the second warped reference frame, wherein blending the th warped reference frame and the second warped reference frame comprises populating pixel locations of the optical flow reference frame corresponding to the occlusion with pixel values from the second warped reference frame.

Example 7 the method of any of examples 1-6, wherein performing the prediction process includes using only optical-flow reference frames for single reference inter-prediction of the block of frame.

Example 8 the method of any of examples 1-7, wherein the reference frame is a reconstructed frame that is closest, in display order of the video sequence, to the frame that is available for forward inter-prediction of the th frame, and a second frame is a reconstructed frame that is closest, in the display order, to the frame that is available for backward inter-prediction of the th frame.

Example 9 the method of any of examples 1-8, wherein performing the prediction process includes determining a reference block that is co-located with the th block of the th frame within the optical-flow reference frame, and encoding a residual of the reference block and the th block.

An example 10: apparatus comprising a processor, and a non-transitory storage medium comprising instructions executable by the processor to perform a method comprising determining a th frame to be predicted in a video sequence, determining availabilities of a 1 th reference frame for forward inter prediction of the 0 th frame and a second reference frame for backward inter prediction of the 2 th frame, in response to determining the availabilities of both the 3 th and the second reference frames, generating respective motion fields for the th frame using the reference frame and the second reference frame using the optical flow estimation, warping the th reference frame to the th frame using the motion fields to form a th warped reference frame, warping the second reference frame to the 35 th frame using the motion fields to form a second warped reference frame, and blending the th and the second warped reference frames to form a blended optical flow prediction block for the warped reference frame.

Example 11 the apparatus of example 10, wherein the method further includes performing a prediction process on the th frame using the optical-flow reference frames.

Example 12 the apparatus of examples 10 or 11, wherein the method further includes using only the optical flow reference frames for single reference inter prediction of the block of frame.

Example 13 the apparatus of any of examples 11-12, wherein generating the respective motion fields includes computing outputs of lagrangian functions for respective pixels of the th frame using the th and second reference frames.

Example 14 the apparatus of example 13, wherein computing the output of the lagrangian function comprises computing a -th set of motion fields for the pixel of the current frame using a -th value of a lagrangian parameter, and using a second value of the lagrangian parameter to compute a refined set of motion fields for the pixel of the current frame using a second value of the lagrangian parameter as an input to the lagrangian function, wherein the second value of the lagrangian parameter is less than the -th value of the lagrangian parameter, and the -warped reference frame and the second warped reference frame are warped using the refined set of motion fields.

An example 15: apparatus comprising a processor, and a non-transitory storage medium including instructions executable by the processor to perform a method comprising generating optical-flow reference frames for inter-frame prediction of a 0 frame of a video sequence using a reference frame from the video sequence and a second reference frame of the video sequence by initializing motion fields of pixels of the th frame for optical-flow estimation in a 1 processing level, the processing level representing scaled-down motion within the th frame and including levels of a plurality of levels, for each level of the plurality of levels warping the th reference frame into the th frame using the motion fields to form a th warped reference frame, warping the second reference frame into the th frame using the motion fields to form a second reference frame, estimating the second reference frame using the motion fields to estimate the th frame to form a final warped reference frame, and updating the final warped reference frame using the motion fields between the second reference frame and the final reference frame to update the final reference frame by warping the second reference frame 3527 to form a final warped reference frame using the motion fields of the second reference frame, and warping the final reference frame to update the final reference frame using the motion fields between the second reference frame and the final reference frame to update the second reference field 73792.

Example 16: the apparatus of example 15, wherein the optical flow estimates use lagrangian functions for respective pixels of a frame.

Example 17 the apparatus of example 16, wherein the method further comprises, for each of the plurality of levels, initializing Lagrangian parameters in the Lagrangian function to a maximum value for a th iteration of warping the th reference frame, warping the second reference frame, estimating the motion field and updating the motion field, and performing additional iterations of warping the th reference frame, warping the second reference frame, estimating the motion field, and estimating the motion field using smaller and smaller values of a possible set of values of the Lagrangian parameters.

Example 18 the apparatus of example 16 or 17, wherein estimating the motion field comprises computing derivatives of pixel values of the warped reference frame and the second warped reference frame with respect to a horizontal axis, a vertical axis, and time, scaling down the derivatives in response to the level being different from the final level, and solving a linear equation representing the lagrangian function using the derivatives.

Example 19 the apparatus of any of examples 15-18, wherein the method further comprises inter-predicting the th frame using the optical flow reference frames.

Example 20 the apparatus of any of examples 15-19, wherein the processor and the non-transitory storage medium form a decoder.

On the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the meaning of so as to cover all such modifications and equivalent structures as is permitted under the law.

Claims

1, a method, the method comprising:

determining a frame portion of an th frame to be predicted, the th frame being in a video sequence;

determining th reference frame from the video sequence for forward inter prediction of the th frame;

determining a second reference frame from the video sequence for backward inter prediction of the th frame;

generating an optical-flow reference frame portion for inter-prediction of the frame portion by performing optical-flow estimation using the th reference frame and the second reference frame, and

performing a prediction process of the th frame portion using the optical flow reference frame.

2. The method of claim 1, wherein generating the optical flow reference frame portion comprises:

performing the optical flow estimation for respective pixels of the th frame portion by minimizing a Lagrangian function.

3. The method of claim 1 or 2, wherein the optical-flow estimation produces respective motion fields for pixels of the -th frame portion, and generating the optical-flow reference frame portion comprises:

warping pixels of the reference frame that are co-located with the frame portion to the frame portion using the motion field to form a warped reference frame portion;

warping pixels of the second reference frame co-located with the th frame portion to the th frame portion using the motion field to form a second warped reference frame portion, and

blending the th warped reference frame portion and the second warped reference frame portion to form the optical flow reference frame portion.

4. The method of claim 3, wherein blending the th warped reference frame portion and the second warped reference frame portion comprises:

combining the co-located pixel values of the th and second warped reference frame portions by scaling co-located pixel values using distances between the th reference frame and the second reference frame and between the current frame and each of the th and second reference frames.

5. The method of claim 3 or 4, wherein blending the th warped reference frame portion and the second warped reference frame portion comprises:

the pixel positions of the optical flow reference frame portion are filled by of combining the co-located pixel values of the th warped reference frame portion and the second warped reference frame portion or using a single pixel value of of the th warped reference frame portion or the second warped reference frame portion.

6. The method of any of claims 1-5, wherein the frame portion includes of a current block of the frame or the frame, and the optical-flow reference frame portion is a block when the th frame portion includes the current block and the optical-flow reference frame portion is an entire frame when the th frame portion includes the frame.

7. The method of any of claims 1-6, wherein the reference frame is a reconstructed frame that is closest to the frame in display order of the video sequence that is available for forward inter-prediction of the th frame, and the second reference frame is a reconstructed frame that is closest to the th frame in the display order that is available for backward inter-prediction of the th frame.

8. The method of any of claims 1-7, wherein:

said th frame portion is the current block to be decoded,

performing the prediction process comprises:

identifying a reference block location using a motion vector used to encode the current block;

adjusting the boundary of the reference block by a sub-pixel interpolation filter length; and

identifying blocks containing pixels within the adjusted boundaries of the reference block and generating the optical-flow reference frame portions comprises performing optical-flow estimation for the block of th frame co-located with the identified blocks without performing optical-flow estimation for the remaining blocks of th frame.

9. The method of any of claims 1-8, wherein:

said th frame portion is the current block to be encoded,

generating the optical-flow reference frame portion includes performing optical-flow estimation for each block of the th frame as the current block to generate a corresponding co-located reference block of an optical-flow reference frame, and

performing the prediction process comprises:

forming the optical-flow reference frame by combining the co-located reference blocks in their respective pixel locations;

storing the optical flow reference frame in a reference frame buffer; and

using the optical flow reference frame for a motion search for the current block.

10, an apparatus, the apparatus comprising:

a processor; and

a non-transitory storage medium comprising instructions executable by the processor to perform a method comprising:

determining an th frame to be predicted in the video sequence;

determining availability of a reference frame for forward inter prediction of the th frame and a second reference frame for backward inter prediction of the th frame;

in response to determining the availability of both the th reference frame and the second reference frame:

generating respective motion fields for pixels of a frame portion using the th reference frame and the second reference frame as inputs to an optical flow estimation process;

warping th reference frame portion to the th frame portion using the motion field to form a th warped reference frame portion, the th reference frame portion comprising pixels of the th reference frame co-located with the pixels of the th frame portion;

warping a second reference frame portion to the th frame portion using the motion field to form a second warped reference frame portion, the second reference frame portion comprising pixels of the second reference frame that are co-located with the pixels of the th frame portion;

blending the th warped reference frame portion and the second warped reference frame portion to form an optical flow reference frame portion for inter-prediction of a block of the th frame.

11. The apparatus of claim 10, wherein the method further comprises:

performing a prediction process for the block of the th frame using the optical flow reference frame portion.

12. The apparatus of claim 10 or 11, wherein the method further comprises:

using only the optical flow reference frame portion for single reference inter prediction of the block of th frame.

13. The apparatus of any of claims 10-12, wherein generating the respective motion field comprises:

computing outputs of Lagrangian functions for respective pixels of the th frame portion using the th reference frame portion and the second reference frame portion.

14. The apparatus as recited in claim 13, wherein computing the output of the lagrangian function comprises:

computing a th motion field set for the pixels of the th frame portion using an th value of a Lagrangian parameter, and

computing a refined set of motion fields for the pixels of the frame portion using a second value of the Lagrangian parameter using the motion field set as an input to the Lagrangian function, wherein the second value of the Lagrangian parameter is less than the value of the Lagrangian parameter and the warped reference frame and the second warped reference frame are warped using the refined set of motion fields.

15, an apparatus comprising:

a processor; and

generating an inter-predicted optical flow reference frame portion for a block of a th frame of a video sequence using an th reference frame from the video sequence and a second reference frame of the video sequence by:

initializing motion fields of pixels of a th frame portion for optical flow estimation in an th processing level, the th processing level representing scaled-down motion within the th frame portion and comprising levels of a plurality of levels;

for each of the plurality of levels:

warping th reference frame portion to the th frame portion using the motion field to form th warped reference frame portion;

warping a second reference frame portion to the th frame portion using the motion field to form a second warped reference frame portion;

estimating a motion field between the warped reference frame portion and the second warped reference frame portion using the optical flow estimate, and

updating the motion field for pixels of the frame portion using the motion field between the th warped reference frame portion and the second warped reference frame portion;

for a final level of the plurality of levels:

warping the th reference frame portion to the th frame portion using the updated motion field to form a final th warped reference frame portion;

warping the second reference frame portion to the th frame portion using the updated motion field to form a final second warped reference frame portion, and

blending the final th warped reference frame portion and the final second warped reference frame portion to form the optical flow reference frame portion.

16. The apparatus of claim 15, wherein said optical flow estimate uses lagrangian functions for respective pixels of said th frame portion.

17. The apparatus of claim 16, wherein the method further includes, for each of the plurality of levels:

initializing Lagrangian parameters in a Lagrangian function to a maximum value for an th iteration of warping the th reference frame portion, warping the second reference frame portion, estimating the motion field and updating the motion field, and

additional iterations of warping the th reference frame portion, warping the second reference frame portion, estimating the motion fields, and estimating the motion fields using smaller and smaller values of the set of possible values of the lagrangian parameters are performed.

18. The apparatus of claim 16 or 17, wherein estimating the motion field comprises:

calculating derivatives of pixel values of the warped reference frame portions and the second warped reference frame portion with respect to a horizontal axis, a vertical axis, and time;

scaling down the derivative in response to the level being different from the final level;

solving a linear equation representing the Lagrangian function using the derivative.

19. The apparatus of of any one of claims 15-18, wherein the method further comprises:

inter-predicting a current block of the th frame using the optical flow reference frame portion.

20. The apparatus of any of claims 15-19, wherein the processor and the non-transitory storage medium form a decoder.

21, an apparatus, the apparatus comprising:

a processor; and

performing a prediction process on the th frame portion using the optical flow reference frame.