WO2021102880A1

WO2021102880A1 - Region-of-interest aware adaptive resolution video coding

Info

Publication number: WO2021102880A1
Application number: PCT/CN2019/121847
Authority: WO
Inventors: Tsuishan CHANG; Yuchen SUN; Jian Lou
Original assignee: Alibaba Group Holding Limited
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-06-03
Also published as: CN114342407A

Abstract

Methods and systems are provided for supporting resolution-adaptive video coding of output frames for display based on reference frames of different resolutions derived from the same picture data, each frame being compressed by different methods based on regions of interest of the source picture data, conserving computing resources and improving compression efficiency. Methods and systems described herein provide a video encoder which generates a grouping of multiple frames having different resolutions from the source frame; determines a division of each frame of the grouping into multiple regions of different relative visual interest; codes a reference frame of the grouping utilizing different coding mode for at least two different regions of the reference frame; and codes an output frame of the grouping utilizing different coding modes for at least two different regions of the output frame.

Description

REGION-OF-INTEREST AWARE ADAPTIVE RESOLUTION VIDEO CODING

BACKGROUND

In conventional video coding formats, such as the H. 264/AVC (Advanced Video Coding) and H. 265/HEVC (High Efficiency Video Coding) standards, video frames in a sequence have their size and resolution recorded at the sequence-level in a header. Thus, in order to change frame resolution, a new video sequence must be generated, starting with an intra-coded frame, which carries significantly larger bandwidth costs to transmit than inter-coded frames. Consequently, although it is desirable to adaptively transmit a down-sampled, low resolution video over a network when network bandwidth becomes low, reduced or throttled, it is difficult to realize bandwidth savings while using conventional video coding formats, because the bandwidth costs of adaptively down-sampling offset the bandwidth gains.

Research has been conducted into supporting resolution changing while transmitting coded frames. In the implementation of the AV1 codec, developed by AOM, and in the development of the next-generation video codec specification, VVC, new frame types and new motion prediction coding tools are provided. In general, support of coding of frames which references previous frames having different resolutions may be accomplished in a variety of ways to achieve the goal of reducing bandwidth costs.

A consequence of implementing such techniques will be that encoders may output frames of a variety of resolutions into a bitstream, and frames may be coded for display using motion information of reference frames of any other resolution, or many other resolutions. Thus, it is desirable to optimize coding of frame data transmitted in bitstreams so as to take advantage of variable-resolution bitstream data to facilitate motion prediction and motion compensation, reduce consumption of computational resources, and improve compression efficiency.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.

FIG. 1 illustrates an example of dividing a frame into regions according to respective visual interest according to example embodiments of the present disclosure.

FIG. 2 illustrates groupings of multiple frames having different resolutions all derived from a same source frame, as generated by a video encoder according to example embodiments of the present disclosure.

FIG. 3 illustrates a region-of-interest aware coding method according to example embodiments of the present disclosure.

FIG. 4 illustrates an example system for implementing the processes and methods described herein for implementing support for region-of-interest aware coding methods in video encoders.

DETAILED DESCRIPTION

Systems and methods discussed herein are directed to support adaptive resolution change in a video encoder, and more specifically to implement region-of-interest aware coding methods which improve allocation of computational resources and compression efficiency.

According to example embodiments of the present disclosure implemented to be compatible with AVC, HEVC, VP9, and such video coding standards implementing variable-resolution frames, a frame may be subdivided into macroblocks ( “MBs” ) each having dimensions of 16x16 pixels, which may be further subdivided into partitions. According to example embodiments of the present disclosure implemented to be compatible with the HEVC standard, a frame may be subdivided into coding tree units ( “CTUs” ) , the luminance ( “luma” ) and chrominance ( “chroma” ) components of which may be further subdivided into coding tree blocks ( “CTBs” ) which are further subdivided into coding units ( “CUs” ) . According to example embodiments of the present disclosure implemented as other standards, a frame may be subdivided into units of NxN pixels, which may then be further subdivided into subunits. Each of these largest subdivided units of a frame may generally be referred to as a “block” for the purpose of this disclosure.

According to example embodiments of the present disclosure, motion prediction coding formats may refer to data formats wherein picture data of video frames are compressed by encoding with motion vector information and prediction information of a frame by the inclusion of one or more references to motion information and prediction units ( “PUs” ) of one or more other frames. Motion information may refer to data describing motion of a block structure of a frame or a unit or subunit thereof, such as motion vectors and references to blocks of a current frame or of another frame. PUs may refer to a unit or multiple subunits corresponding to a block structure among multiple block structures of a frame, such as an MB or a CTU, wherein blocks are partitioned based on the frame data and are coded according to established video codecs. Motion information corresponding to a PU may describe motion prediction as encoded by any motion vector coding tool, including, but not limited to, those described herein.

A video encoder according to motion prediction coding may obtain a frame from a video source and code the frame to obtain a reconstructed frame that may be ultimately output for display. A reconstructed frame and blocks of a reconstructed frame may be intra-coded, wherein at least some motion information of the reconstructed frame refers to motion information elsewhere in the reconstructed frame, or inter-coded, wherein at least some motion information of the reconstructed frame refers to motion information of another frame. In general, frames and blocks thereof according to example embodiments of the present disclosure may be coded according to intra-coded or inter-coded motion prediction unless either is expressly specified.

In accordance with Adaptive Resolution Change ( “ARC” ) coding techniques as proposed by Alibaba Group, a video encoder may encode frames from a video source in a compressed format in various different resolutions and transmit the encoded frames in a variable-resolution bitstream, while a video decoder may subsequently obtain coded frames from the variable-resolution bitstream and perform motion compensation using the variable-resolution frames to reconstruct the frames for display. Thus, it should be understood that video bitstreams according to example embodiments of the present disclosure may be variable-resolution by implementation of such techniques, details of which need not be reiterated herein.

Variable-resolution bitstreams according to ARC may include multiple frames of different resolutions containing picture data all derived from a same source frame. Such groupings of multiple frames may include an output frame having a resolution same as a resolution of the original frame, and this output frame may be coded to be output for display. Such groupings of multiple frames may further include any number of reference frames each having a resolution different from the resolution of the original frame, and these reference frames may be coded but may not be ultimately output for display.

A video encoder according to example embodiments of the present disclosure may generate groupings of multiple frames of different resolutions all derived from a same source frame by up-sampling or down-sampling picture data of the source frame to various different resolutions. Up-sampling or down-sampling algorithms according to example embodiments of the present disclosure may include interpolation, average, bilinear algorithms, trained algorithms, or any other suitable algorithms.

Decoders implemented according to ARC may be operative to decode an output frame with reference to any number of reference frames of different resolutions derived from the same original frame. Details of the implementation thereof need not be reiterated herein.

Example embodiments of the present disclosure provide methods of improving efficiency in compressing a frame to be output for display by coding multiple frames of different resolutions all derived from a same source frame as reference frames. According to example embodiments of the present disclosure, a frame to be coded may, prior to coding, be divided, based on picture data of the frame, into regions of pixels into multiple regions according to respective visual interest of picture data in each respective region for human viewers. FIG. 1 illustrates an example of dividing a frame 100 into regions according to respective visual interest according to example embodiments of the present disclosure.

A frame may be divided into any number of multiple regions which, for the purpose of description, may be described in their relative visual interest to each other. Visual interest of different regions may be related in terms of relative frequency of picture content elements which tend to draw the attention of human viewers. For example, a first region containing sharp edges in picture content may have more visual interest than a second region containing fuzzy edges or containing no edges in picture content. A first region containing high-frequency component picture content may have more visual interest than a second region containing low-frequency component picture content. A first region containing foreground picture content may have more visual interest than a second region containing mid-ground or background picture content. A first region containing text in its picture content may have more visual interest than a second region not containing text in its picture content.

According to these and any number of other similar criteria for visual interest, a frame may be divided into two regions, three regions, or any other number of multiple regions according to relative visual interest between each of the regions. For example, as FIG. 1 illustrates, a region 102 has a lowest visual interest; a region 104 has a higher visual interest than region 102; and a region 106 has a highest visual interest.

Division of a frame into regions may be performed based on a source frame from which a grouping of multiple frames having different resolutions are generated, and each frame of the grouping of multiple frame may be divided into correspondingly-shaped and located regions, proportional to their respective resolutions. For each frame of such a grouping of frames, the number of regions may be equal to the number of frames in the grouping of frames, counting an output frame and each reference frame. For each different grouping of frames, the number of regions may be the same or different.

The determination of a number of regions for dividing a frame may be (as shall be described subsequently) based on the number of multiple frames of different resolutions to be generated; aside from those manners specified herein, manners of division of a frame into regions and other manners of determination of a number of frames for dividing a frame shall be outside the scope of the present disclosure.

FIG. 2 illustrates

groupings

200A and 200B of multiple frames having different resolutions all derived from a same source frame, as generated by a video encoder according to example embodiments of the present disclosure. FIG. 2 illustrates boxes drawn around grouping 200A and around grouping 200B to illustrate each grouping, though these boxes should not be viewed as structuring the transmission of frames, such as in a bitstream. Though frames of the

groupings

200A and 200B are each shown as a sequence of consecutive frames, they need not necessarily be coded or output into a bitstream in that order or be output as consecutive frames.

FIG. 2 further illustrates inter-coding relationships among the

groupings

200A and 200B according to example embodiments of the present disclosure. An arrow as illustrated in FIG. 2 may indicate that a frame is coded with reference to motion information of another frame without necessarily indicating, and without limiting, ordering of the respective frames: for example, in a bitstream, predicted frames ( “P-frames” ) may only refer to previous frames, but bidirectional frames ( “B-frames” ) may refer to previous frames and subsequent frames.

FIG. 2 illustrates

frames

202, 204, and 206 in grouping 200A (in order of ascending resolution) , and

frames

208, 210, and 212 in grouping 200B (in order of ascending resolution) .

Output frames

206 and 212 are coded to be output for display, while

reference frames

202, 204, 208, and 210 are coded but may not be output for display. Any of the illustrated frames may have a resolution different from a resolution of any of the other illustrated frames. Frames illustrated as adjacent to each other in FIG. 2 may be consecutive frames, or may have additional frames therebetween which are not illustrated for the purpose of the present disclosure.

According to example embodiments of the present disclosure,

output frames

206 and 212 may each be coded with reference to each of the reference frames of their respective groupings: that is, output frame 206 is coded with reference to

reference frames

202 and 204, and output frame 212 is coded with reference to

reference frames

208 and 210. Moreover, each reference frame in a grouping may be coded with reference to at least one reference frame having a relatively smaller resolution (aside from a reference frame having a smallest resolution) . For example, reference frame 204 is coded with reference to reference frame 202, and reference frame 210 is coded with reference to reference frame 208.

Frames in a grouping may furthermore be coded with reference to frames outside of that grouping. For example, frame 202 may be coded with reference to some arbitrary frame, or may be an intra-frame coded without reference to any other frames, without limitation. Frame 208 may, likewise, be coded with reference to some arbitrary frame (as illustrated in FIG. 2, frame 206) , or may be an intra-frame coded without reference to any other frames, without limitation.

According to example embodiments of the present disclosure, a video encoder may generate a grouping of multiple frames having different resolutions all derived from a same source frame as described above, and, for each different source frame, a different grouping of multiple frames (of a same number or a different number) may be generated. The determination of a number of frames of different resolutions to generate may be (as shall be described subsequently) based on the number of regions that each frame of the grouping is divided into; aside from those manners specified herein, other manners of determination of a number of frames of different resolutions to generate, as well as manners of determination of particular different resolutions, shall be outside the scope of the present disclosure.

The video encoder may then code each of the grouping of multiple frames according to a region-of-interest aware method as described herein.

FIG. 3 illustrates a region-of-interest aware coding method 300 according to example embodiments of the present disclosure.

At step 302, a video encoder obtains a source frame from a video source.

At step 304, the video encoder generates a grouping of multiple frames having different resolutions from the source frame. As described above, each frame of the grouping may be derived from the same picture data of the source frame, and the number of the multiple frames may be the same or may be different for each different grouping of multiple frames. Each frame of the grouping may be derived by up-sampling or down-sampling picture data of the source frame to various different resolutions.

The grouping of multiple frames includes an output frame having a same resolution as the source frame. The grouping of multiple frames further includes any number of reference frames having different resolutions as the source frame. Any number of reference frames having different resolutions from each other may be generated in this manner. The number of reference frames of different resolutions to be generated may be predetermined, may be determined subject to a configurable setting of the video encoder, or may be determined subject to a division of each frame of the grouping into multiple regions as described below with regard to step 306.

At step 306, the video encoder determines multiple regions of different relative visual interest applying to each frame of the grouping. As described above, each frame of the grouping may be divided into multiple regions according to respective visual interest of picture data in each respective region for human viewers. Furthermore, as described above, the number of divisions and the manner of division may be different for each grouping. Furthermore, as described above, division of a frame into regions may be performed based on the source frame, and each frame of the grouping may be divided into correspondingly-shaped and located regions, proportional to their respective resolutions.

According to example embodiments of the present disclosure, the number of reference frames of different resolutions to be generated in a grouping may be determined based on the number of regions that each frame of the grouping is divided into, such that an additional reference frame of a different resolution is generated in the grouping for each additional region that each frame of the grouping is divided into. Alternatively, the number of regions that each frame of the grouping is divided into may be determined based on the number of reference frames of different resolutions to be generated in the grouping, such that each frame of the grouping is divided into one additional region for each additional reference frame of a different resolution to be generated in the grouping.

Step 304 and step 306 may be performed in either order relative to each other. That is, the grouping of multiple frames may be generated first, and the determination of regions performed next, with the number of regions that each frame is divided into being determined based on the number of reference frames of different resolutions generated in the grouping of multiple frames. Alternatively, the determination of groupings may be performed first, and the grouping of multiple frames generated next, with the number of reference frames of different resolutions generated being determined based on the number of regions that each frame is divided into.

At step 308, the video encoder codes a reference frame of the grouping utilizing a different coding mode for at least one region of visual interest than for a key region of visual interest.

According to example embodiments of the present disclosure, as will be described in further detail below with reference to step 310, an output frame may be coded referencing each reference frame of a different resolution. For each reference frame of a different resolution, the output frame may be coded referencing particular regions of visual interest of that reference frame. The output frame may be coded referencing as few as one region of visual interest of that reference frame. Moreover, the output frame may be coded referencing a different region of visual interest (in accordance with each reference frame being divided into correspondingly-shaped and located regions, “same” regions of visual interest may refer to correspondingly-shaped and located regions across frames of different resolutions, and “different” regions of visual interest may refer to regions across frames of different resolutions which are non-correspondingly-shaped and located) for every reference frame; these different regions of visual interest will be referred to as “key regions” herein. While the output frame may be coded referencing more portions of any reference frame than just one region of visual interest, the output frame is coded referencing at least a different key region of each reference frame.

For each reference frame, the video encoder may determine which key region will be referenced in coding the output frame. This determination may be made based on the relative resolution of each reference frame, in order, and the relative visual interest of each region of visual interest, in order. For example, for a reference frame of a lowest resolution, a key region of visual interest of lowest visual interest may be referenced in coding the output frame; for a reference frame of a second-lowest resolution, a key region of visual interest of second-lowest visual interest may be referenced in coding the output frame; and so on.

Thus, with reference to FIG. 1 and FIG. 2, assuming that the regions of interest of FIG. 1 apply to grouping 200A, then for reference frame 202, a region 102 of the reference frame 202 may be referenced in coding the output frame 206; and for reference frame 204, a region 104 of the reference frame 204 may be referenced in coding the output frame 206. Alternatively, assuming that the regions of interest of FIG. 1 apply to grouping 200B, then for reference frame 208, a region 102 of the reference frame 208 may be referenced in coding the output frame 212; and for reference frame 210, a region 104 of the reference frame 210 may be referenced in coding the output frame 212. (However, the regions of interest of FIG. 1 may not apply to both grouping 200A and grouping 200B. )

The video encoder may code the key region utilizing any suitable coding modes as supported by motion prediction coding formats. For example, the key region may be coded utilizing any suitable algorithm for intra-prediction, inter-prediction, various forms of motion vector prediction (such as adaptive motion vector prediction ( “AMVP” ) , advanced temporal motion vector prediction ( “ATMVP” ) , control-point motion vector prediction “CPMVP” ) , temporal motion vector prediction ( “TMVP” ) , spatio-temporal motion vector prediction ( “STMVP” ) , sub-block temporal motion vector prediction ( “SbTMVP” ) , and the like) , intra block copy ( “IBC” ) , and the like. Coding of the key region may generally not be limited to any one coding mode according to example embodiments of the present disclosure.

However, according to example embodiments of the present disclosure, the video encoder may code each reference frame utilizing a different coding mode for at least one region of visual interest than a coding mode utilized for coding the key region, or utilizing a different coding mode for each region of visual interest other than the key region than a coding mode utilized for coding the key region. In particular, those other regions of visual interest may be coded by a minimally computationally intensive coding mode. For example, blocks of those other regions of visual interest may be coded by skip mode or merge mode, wherein motion information of the region is copied from motion information of blocks of another reference frame. The copied motion information may further be resized and re-sampled in proportion to a difference between a resolution of the current reference frame and a resolution of the other reference frame.

Since it is known that the output frame is coded with reference to the key region, any other region of visual interest which is not utilized to code the output frame may be coded by minimally computationally intensive coding modes so as to minimize computation overhead of the coding process. According to example embodiments of the present disclosure, each region of visual interest may be coded by minimally computationally intensive coding modes.

Thus, with reference to FIG. 1 and FIG. 2, assuming that the regions of interest of FIG. 1 apply to grouping 200A, then for reference frame 202, a region 102 of the reference frame 202 may be a key region and may be coded by any suitable coding mode without limitation, while

regions

104 and 106 of the reference frame 202 may be coded by minimally computationally intensive coding modes; and for reference frame 204, a region 104 of the reference frame 204 may be a key region and may be coded by any suitable coding mode without limitation, while

regions

102 and 106 of the reference frame 204 may be coded by minimally computationally intensive coding modes. Alternatively, assuming that the regions of interest of FIG. 2 apply to grouping 200B, then for reference frame 208, a region 102 of the reference frame 208 may be a key region and may be coded by any suitable coding mode without limitation, while

regions

104 and 106 of the reference frame 208 may be coded by minimally computationally intensive coding modes; and for reference frame 210, a region 104 of the reference frame 210 may be a key region and may be coded by any suitable coding mode without limitation, while

regions

102 and 106 of the reference frame 210 may be coded by minimally computationally intensive coding modes. (However, the regions of interest of FIG. 1 may not apply to both grouping 200A and grouping 200B. )

At step 310, the video encoder codes an output frame of the grouping utilizing a different coding mode for at least one region of non-highest visual interest than for a region of highest visual interest.

The video encoder may code the region of highest visual interest utilizing any suitable coding modes as supported by motion prediction coding formats. For example, the region of highest visual interest may be coded utilizing any suitable algorithm for intra-prediction, inter-prediction, various forms of motion vector prediction (such as adaptive motion vector prediction ( “AMVP” ) , advanced temporal motion vector prediction ( “ATMVP” ) , control-point motion vector prediction “CPMVP” ) , temporal motion vector prediction ( “TMVP” ) , spatio-temporal motion vector prediction ( “STMVP” ) , sub-block temporal motion vector prediction ( “SbTMVP” ) , and the like) , intra block copy ( “IBC” ) , and the like. Coding of the region of highest visual interest may generally not be limited to any one coding mode according to example embodiments of the present disclosure.

However, according to example embodiments of the present disclosure, the video encoder may code the output frame utilizing a different coding mode for at least one region of non-highest visual interest than a coding mode utilized for coding the region of highest visual interest, or utilizing a different coding mode for each region of non-highest visual interest than a coding mode utilized for the region of highest visual interest. In particular, those regions of non-highest visual interest may be coded by a less computationally intensive coding mode than the coding mode utilized for the region of highest visual interest.

For example, blocks of a region of non-highest visual interest of the output frame may be coded by a simplified inter-coded motion prediction mode based on blocks of a key region of one of the reference frames of the grouping. As described above, the output frame is coded referencing at least a different key region of each reference frame; therefore, each region of the output frame is coded referencing a different key region of a different reference frame. Moreover, since each reference frame is derived from a same source frame as the output frame, it may be expected that residual information (generated from computing a difference between the output frame and a reference frame) is not needed to transform motion information of the reference frame to motion information of the source frame. Therefore, simplification of an inter-coded motion prediction mode may be characterized by residual information between the output frame and the reference frame not being computed in coding the output frame and not being subsequently transmitted with the output frame. Furthermore, the coded motion information may further be resized and re-sampled in proportion to a difference between a resolution of the output frame and a resolution of the reference frame. In other regards, the inter-code motion prediction mode may be implemented according to the knowledge of persons skilled in the art.

At step 312, the video encoder outputs coded frames of the grouping to a bitstream. The coded frames may include the output frame and the reference frames. The coded frames do not retain information regarding division of each frame into regions.

FIG. 4 illustrates an example system 400 for implementing the processes and methods described above for implementing support for region-of-interest aware coding methods in video encoders.

The techniques and mechanisms described herein may be implemented by multiple instances of the system 400 as well as by any other computing device, system, and/or environment. The system 400 shown in FIG. 4 is only one example of a system and is not intended to suggest any limitation as to the scope of use or functionality of any computing device utilized to perform the processes and/or procedures described above. Other well-known computing devices, systems, environments and/or configurations that may be suitable for use with the embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, game consoles, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, implementations using field programmable gate arrays ( “FPGAs” ) and application specific integrated circuits ( “ASICs” ) , and/or the like.

The system 400 may include one or more processors 402 and system memory 404 communicatively coupled to the processor (s) 402. The processor (s) 402 may execute one or more modules and/or processes to cause the processor (s) 402 to perform a variety of functions. In some embodiments, the processor (s) 402 may include a central processing unit ( “CPU” ) , a graphics processing unit ( “GPU” ) , both CPU and GPU, or other processing units or components known in the art. Additionally, each of the processor (s) 402 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems.

Depending on the exact configuration and type of the system 400, the system memory 404 may be volatile, such as RAM, non-volatile, such as ROM, flash memory, miniature hard drive, memory card, and the like, or some combination thereof. The system memory 404 may include one or more computer-executable modules 406 that are executable by the processor (s) 402.

The modules 406 may include, but are not limited to, an encoder module 408. The encoder module 408 further includes a source frame obtaining submodule 410, a frame generating submodule 412, a region determining submodule 414, a reference frame coding submodule 416, an output frame coding submodule 418, and a frame outputting submodule 420.

The encoder module 408 may be configured to perform motion prediction coding upon frames from a video source by any of the algorithms and processes described herein, including the functionality of each submodule described herein.

The source frame obtaining submodule 410 may be configured to obtain a source frame from a video source, as described above with reference to FIG. 3.

The frame generating submodule 412 may be configured to generate a grouping of multiple frames having different resolutions from the source frame, as described above with reference to FIG. 3.

The region determining submodule 414 may be configured to determine a division of each frame of the grouping into multiple regions of different relative visual interest, as described above with reference to FIG. 3.

The reference frame coding submodule 416 may be configured to code a reference frame of the grouping utilizing a different coding mode for at least one region of visual interest than for a key region of visual interest, as described above with reference to FIG. 3.

The output frame coding submodule 418 may be configured to code an output frame of the grouping utilizing a different coding mode for at least one region of non-highest visual interest than for a region of highest visual interest, as described above with reference to FIG. 3.

The frame outputting submodule 420 may be configured to output coded frames of the grouping to a bitstream, as described above with reference to FIG. 3.

The system 400 may additionally include an input/output ( “I/O” ) interface 440 for receiving sequences of frames from video source data, and for outputting reconstructed frames into a reference frame buffer and/or a transmission buffer. The system 400 may also include a communication module 450 allowing the system 400 to communicate with other devices (not shown) over a network (not shown) . The network may include the Internet, wired media such as a wired network or direct-wired connections, and wireless media such as acoustic, radio frequency ( “RF” ) , infrared, and other wireless media.

Some or all operations of the methods described above can be performed by execution of computer-readable instructions stored on a computer-readable storage medium, as defined below. The term “computer-readable instructions” as used in the description and claims, include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.

The computer-readable storage media may include volatile memory (such as random-access memory ( “RAM” ) ) and/or non-volatile memory (such as read-only memory ( “ROM” ) , flash memory, etc. ) . The computer-readable storage media may also include additional removable storage and/or non-removable storage including, but not limited to, flash memory, magnetic storage, optical storage, and/or tape storage that may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.

A non-transient computer-readable storage medium is an example of computer-readable media. Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communications media. Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media includes, but is not limited to, phase change memory ( “PRAM” ) , static random-access memory ( “SRAM” ) , dynamic random-access memory ( “DRAM” ) , other types of random-access memory ( “RAM” ) , read-only memory ( “ROM” ) , electrically erasable programmable read-only memory ( “EEPROM” ) , flash memory or other memory technology, compact disk read-only memory ( “CD-ROM” ) , digital versatile disks ( “DVD” ) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. A computer-readable storage medium employed herein shall not be interpreted as a transitory signal itself, such as a radio wave or other free-propagating electromagnetic wave, electromagnetic waves propagating through a waveguide or other transmission medium (such as light pulses through a fiber optic cable) , or electrical signals propagating through a wire.

The computer-readable instructions stored on one or more non-transitory computer-readable storage media that, when executed by one or more processors, may perform operations described above with reference to FIGS. 1-4. Generally, computer-readable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.

By the abovementioned technical solutions, the present disclosure provides resolution-adaptive video coding supported by a region-of-interest aware coding mode of a video encoder, enabling the encoder to code motion information of output frames for display based on reference frames of different resolutions derived from the same picture data, each frame being compressed by different methods based on regions of interest of the source picture data, so as to conserve computing resources and improve compression efficiency. The methods and systems described herein provide a video encoder which obtains a source frame from a video source; generates a grouping of multiple frames having different resolutions from the source frame; determines a division of each frame of the grouping into multiple regions of different relative visual interest; codes a reference frame of the grouping utilizing a different coding mode for at least one region of visual interest than for a key region of visual interest; codes an output frame of the grouping utilizing a different coding mode for at least one region of non-highest visual interest than for a region of highest visual interest; and outputs coded frames of the grouping to a bitstream.

EXAMPLE CLAUSES

A. A method comprising: obtaining a source frame from a video source; generating a grouping of multiple frames having different resolutions from the source frame; determining a plurality of regions of different relative visual interest, the plurality of regions applying to each frame of the grouping; coding a reference frame of the grouping utilizing a different coding mode for at least one region of visual interest among the plurality of regions than for a key region of visual interest among the plurality of regions; and coding an output frame of the grouping utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions than for a region of highest visual interest among the plurality of regions.

B. The method as paragraph A recites, wherein each frame of the grouping is derived by up-sampling or down-sampling picture data of the source frame to a different resolution.

C. The method as paragraph A recites, wherein the grouping comprises an output frame having a same resolution as the source frame and at least one reference frame, each reference frame having a different resolution.

D. The method as paragraph C recites, wherein resolutions of each reference frame are different from the resolution of the source frame.

E. The method as paragraph C recites, wherein a number of the reference frames of different resolutions to be generated in a grouping is determined based on a number of the plurality of regions applying to each frame of the grouping.

F. The method as paragraph C recites, wherein a number of the plurality of regions applying to each frame of the grouping is determined based on a number of the reference frames of different resolutions to be generated in the grouping.

G. The method as paragraph A recites, wherein relative visual interest of each of the plurality of regions is according to respective visual interest of picture data of the source frame in each respective region.

H. The method as paragraph A recites, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least one region of visual interest among the plurality of regions applying to that reference frame.

I. The method as paragraph H recites, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least a different key region among the plurality of regions applying to that reference frame.

J. The method as paragraph I recites, wherein, for each reference frame of a different resolution, a key region to be referenced in coding the output frame is determined based on relative resolutions of each reference frame, in order, and relative visual interest of each region of visual interest, in order.

K. The method as paragraph I recites, wherein each reference frame is coded utilizing a different coding mode for at least one region of visual interest among the plurality of regions than a coding mode utilized for coding the key region.

L. The method as paragraph K recites, wherein each reference frame is coded utilizing a different coding mode for each region of visual interest among the plurality of regions other than the key region than the coding mode utilized for coding the key region.

M. The method as paragraph K recites, wherein the different coding mode is a minimally computationally intensive coding mode.

N. The method as paragraph M recites, wherein the different coding mode is a skip mode or a merge mode.

O. The method as paragraph I recites, wherein the output frame is coded utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions than a coding mode utilized for coding the region of highest visual interest.

P. The method as paragraph O recites, wherein the output frame is coded utilizing a different coding mode for each region of non-highest visual interest among the plurality of regions than the coding mode utilized for coding the region of highest visual interest.

Q. The method as paragraph O recites, wherein the different coding mode is a less computationally intensive coding mode than the coding mode utilized for the region of highest visual interest.

R. The method as paragraph Q recites, wherein the different coding mode is an inter-coded motion prediction mode simplified by residual information between the output frame and the reference frame not being computed.

S. A system comprising: one or more processors; and memory communicatively coupled to the one or more processors, the memory storing computer-executable modules executable by the one or more processors that, when executed by the one or more processors, perform associated operations, the computer-executable modules including: an encoder module further comprising: a frame obtaining submodule configured to obtain a source frame from a video source; a frame generating submodule configured to generate a grouping of multiple frames having different resolutions from the source frame; a region determining submodule configured to determine a plurality of regions of different relative visual interest, the plurality of regions applying to each frame of the grouping; a reference frame coding submodule configured to code a reference frame of the grouping utilizing a different coding mode for at least one region of visual interest among the plurality of regions than for a key region of visual interest among the plurality of regions; and an output frame coding submodule configured to code an output frame of the grouping utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions than for a region of highest visual interest among the plurality of regions.

T. The system as paragraph S recites, wherein the frame generating submodule is configured to derive each frame of the grouping by up-sampling or down-sampling picture data of the source frame to a different resolution.

U. The system as paragraph S recites, wherein the grouping comprises an output frame having a same resolution as the source frame and at least one reference frame, each reference frame having a different resolution.

V. The system as paragraph U recites, wherein resolutions of each reference frame are different from the resolution of the source frame.

W. The system as paragraph U recites, wherein the region determining submodule is configured to determine a number of the reference frames of different resolutions to be generated in a grouping based on a number of the plurality of regions applying to each frame of the grouping.

X. The system as paragraph U recites, wherein the region determining submodule is configured to determine a number of the plurality of regions applying to each frame of the grouping based on a number of the reference frames of different resolutions to be generated in the grouping.

Y. The system as paragraph S recites, wherein relative visual interest of each of the plurality of regions is according to respective visual interest of picture data of the source frame in each respective region.

Z. The system as paragraph S recites, wherein the output frame coding submodule is configured to code the output frame referencing, for each reference frame of a different resolution, at least one region of visual interest among the plurality of regions applying to that reference frame.

AA. The system as paragraph Z recites, wherein the output frame coding submodule is configured to code the output frame referencing, for each reference frame of a different resolution, at least a different key region among the plurality of regions applying to that reference frame.

BB. The system as paragraph AA recites, wherein the output frame coding submodule is configured to determine, for each reference frame of a different resolution, a key region to be referenced in coding the output frame based on relative resolutions of each reference frame, in order, and relative visual interest of each region of visual interest, in order.

CC. The system as paragraph AA recites, wherein the reference frame coding submodule is configured to code each reference frame utilizing a different coding mode for at least one region of visual interest among the plurality of regions than a coding mode utilized for coding the key region.

DD. The system as paragraph CC recites, wherein the reference frame coding submodule is configured to code each reference frame utilizing a different coding mode for each region of visual interest among the plurality of regions other than the key region than the coding mode utilized for coding the key region.

EE. The system as paragraph CC recites, wherein the different coding mode is a minimally computationally intensive coding mode.

FF. The system as paragraph EE recites, wherein the different coding mode is a skip mode or a merge mode.

GG. The system as paragraph Z recites, wherein the output frame coding submodule is configured to code the output frame utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions than a coding mode utilized for coding the region of highest visual interest.

HH. The system as paragraph GG recites, wherein the output frame coding submodule is configured to code the output frame utilizing a different coding mode for each region of non-highest visual interest among the plurality of regions than the coding mode utilized for coding the region of highest visual interest.

II. The system as paragraph GG recites, wherein the different coding mode is a less computationally intensive coding mode than the coding mode utilized for the region of highest visual interest.

JJ. The system as paragraph II recites, wherein the different coding mode is an inter-coded motion prediction mode simplified by residual information between the output frame and the reference frame not being computed.

KK. A computer-readable storage medium storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising: obtaining a source frame from a video source; generating a grouping of multiple frames having different resolutions from the source frame; determining a plurality of regions of different relative visual interest, the plurality of regions applying to each frame of the grouping; coding a reference frame of the grouping utilizing a different coding mode for at least one region of visual interest among the plurality of regions than for a key region of visual interest among the plurality of regions; and coding an output frame of the grouping utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions than for a region of highest visual interest among the plurality of regions.

LL. The computer-readable storage medium as paragraph KK recites, wherein each frame of the grouping is derived by up-sampling or down-sampling picture data of the source frame to a different resolution.

MM. The computer-readable storage medium as paragraph KK recites, wherein the grouping comprises an output frame having a same resolution as the source frame and at least one reference frame, each reference frame having a different resolution.

NN. The computer-readable storage medium as paragraph MM recites, wherein resolutions of each reference frame are different from the resolution of the source frame.

OO. The computer-readable storage medium as paragraph MM recites, wherein a number of the reference frames of different resolutions to be generated in a grouping is determined based on a number of the plurality of regions applying to each frame of the grouping.

PP. The computer-readable storage medium as paragraph MM recites, wherein a number of the plurality of regions applying to each frame of the grouping is determined based on a number of the reference frames of different resolutions to be generated in the grouping.

QQ. The computer-readable storage medium as paragraph KK recites, wherein relative visual interest of each of the plurality of regions is according to respective visual interest of picture data of the source frame in each respective region.

RR. The computer-readable storage medium as paragraph KK recites, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least one region of visual interest among the plurality of regions applying to that reference frame.

SS. The computer-readable storage medium as paragraph RR recites, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least a different key region among the plurality of regions applying to that reference frame.

TT. The computer-readable storage medium as paragraph SS recites, wherein, for each reference frame of a different resolution, a key region to be referenced in coding the output frame is determined based on relative resolutions of each reference frame, in order, and relative visual interest of each region of visual interest, in order.

UU. The computer-readable storage medium as paragraph SS recites, wherein each reference frame is coded utilizing a different coding mode for at least one region of visual interest among the plurality of regions than a coding mode utilized for coding the key region.

VV. The computer-readable storage medium as paragraph UU recites, wherein each reference frame is coded utilizing a different coding mode for each region of visual interest among the plurality of regions other than the key region than the coding mode utilized for coding the key region.

WW. The computer-readable storage medium as paragraph UU recites, wherein the different coding mode is a minimally computationally intensive coding mode.

XX. The computer-readable storage medium as paragraph WW recites, wherein the different coding mode is a skip mode or a merge mode.

YY. The computer-readable storage medium as paragraph SS recites, wherein the output frame is coded utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions than a coding mode utilized for coding the region of highest visual interest.

ZZ. The computer-readable storage medium as paragraph YY recites, wherein the output frame is coded utilizing a different coding mode for each region of non-highest visual interest among the plurality of regions than the coding mode utilized for coding the region of highest visual interest.

AAA. The computer-readable storage medium as paragraph YY recites, wherein the different coding mode is a less computationally intensive coding mode than the coding mode utilized for the region of highest visual interest.

BBB. The computer-readable storage medium as paragraph AAA recites, wherein the different coding mode is an inter-coded motion prediction mode simplified by residual information between the output frame and the reference frame not being computed.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.

Claims

A method comprising:

generating a grouping of multiple frames having different resolutions;

determining a plurality of regions of different relative visual interest, the plurality of regions applying to each frame of the grouping;

coding a reference frame of the grouping utilizing different coding modes for at least two different regions of the reference frame; and

coding an output frame utilizing different reference frames for at least two different regions of the output frame.
The method of claim 1, wherein the grouping comprises at least one reference frame, each reference frame having a different resolution from each other.
The method of claim 1, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least one region among a plurality of regions of that reference frame.
The method of claim 3, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least a key region of that reference frame different from key regions referenced for each other reference frame.
The method of claim 4, wherein each reference frame is coded utilizing a different coding mode for at least one region among the plurality of regions of the reference frame than a coding mode utilized for coding a key region of the reference frame.
The method of claim 4, wherein the output frame is coded utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions of the output frame than a coding mode utilized for coding a region of highest visual interest of the output frame.
The method of claim 6, wherein the different coding mode is an inter-coded motion prediction mode simplified by residual information between the output frame and the reference frame not being computed.
A system comprising:

one or more processors; and

memory communicatively coupled to the one or more processors, the memory storing computer-executable modules executable by the one or more processors that, when executed by the one or more processors, perform associated operations, the computer-executable modules including:

an encoder module further comprising:

a frame generating submodule configured to generate a grouping of multiple frames having different resolutions;

a region determining submodule configured to determine a plurality of regions of different relative visual interest, the plurality of regions applying to each frame of the grouping;

a reference frame coding submodule configured to code a reference frame of the grouping utilizing different coding modes for at least two different regions of the reference frame; and

an output frame coding submodule configured to code an output frame utilizing different coding modes for at least two regions of the output frame.
The system of claim 8, wherein the grouping comprises at least one reference frame, each reference frame having a different resolution from each other.
The system of claim 8, wherein the output frame coding submodule is configured to code the output frame referencing, for each reference frame of a different resolution, at least one region among a plurality of regions of that reference frame.
The system of claim 10, wherein the output frame coding submodule is configured to code the output frame referencing, for each reference frame of a different resolution, at least a key region of that reference frame different from key regions referenced for each other reference frame.
The system of claim 11, wherein the reference frame coding submodule is configured to code each reference frame utilizing a different coding mode for at least one region of the reference frame than a coding mode utilized for coding a key region of the reference frame.
The system of claim 11, wherein the output frame coding submodule is configured to code the output frame utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions of the output frame than a coding mode utilized for coding a region of highest visual interest of the output frame.
The system of claim 13, wherein the different coding mode is an inter-coded motion prediction mode simplified by residual information between the output frame and the reference frame not being computed.
A computer-readable storage medium storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising:

generating a grouping of multiple frames having different resolutions;

determining a plurality of regions of different relative visual interest, the plurality of regions applying to each frame of the grouping;

coding a reference frame of the grouping utilizing different coding modes for at least two different regions of the reference frame; and

coding an output frame of the grouping utilizing different coding modes for at least two different regions of the output frame.
The computer-readable storage medium of claim 15, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least one region among the plurality of regions of that reference frame.
The computer-readable storage medium of claim 16, wherein the output frame is coded referencing, for each reference frame of a different resolution, at least a key region of that reference frame different from key regions referenced for each other reference frame.
The computer-readable storage medium of claim 17, wherein each reference frame is coded utilizing a different coding mode for at least one region among the plurality of regions of the reference frame than a coding mode utilized for coding a key region of the reference frame.
The computer-readable storage medium of claim 17, wherein the output frame is coded utilizing a different coding mode for at least one region of non-highest visual interest among the plurality of regions of the output frame than a coding mode utilized for coding a region of highest visual interest of the output frame.
The computer-readable storage medium of claim 19, wherein the different coding mode is an inter-coded motion prediction mode simplified by residual information between the output frame and the reference frame not being computed.