CN117099369A - System and method for template matching for adaptive MVD resolution - Google Patents

System and method for template matching for adaptive MVD resolution Download PDF

Info

Publication number
CN117099369A
CN117099369A CN202380010878.0A CN202380010878A CN117099369A CN 117099369 A CN117099369 A CN 117099369A CN 202380010878 A CN202380010878 A CN 202380010878A CN 117099369 A CN117099369 A CN 117099369A
Authority
CN
China
Prior art keywords
mvd
video block
template
video
motion vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202380010878.0A
Other languages
Chinese (zh)
Inventor
赵亮
赵欣
刘杉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent America LLC
Original Assignee
Tencent America LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/121,438 external-priority patent/US20230300363A1/en
Application filed by Tencent America LLC filed Critical Tencent America LLC
Publication of CN117099369A publication Critical patent/CN117099369A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

Various implementations described herein include methods and systems for coding video. The method comprises the following steps: receiving a Motion Vector Difference (MVD) of a video block from a video stream; searching a first template of a predicted video block of the video block in response to determining an adaptive MVD resolution mode that signals an inter prediction mode, and the first/second template being a neighboring reconstruction/prediction sample of the predicted video block/current block, and the predicted video block being a reconstructed/predicted forward or backward video block of the video block locating the first template of the predicted video block, the first template of the predicted video block being a best match to the second template of the video block; refining a Motion Vector (MV) of the video block based at least on the second template, the located first template, and the MVD; and reconstructing/processing the video block based at least on the refined MV.

Description

System and method for template matching for adaptive MVD resolution
Cross Reference to Related Applications
The present application claims priority from U.S. provisional patent application No. 63/320,488, entitled "Template Matching for Adaptive MVD Resolution", filed on 3.16 at 2022, and is a continuation of and claims priority from U.S. patent application No. 18/121,438, entitled "Systems and Methods for Template Matching for Adaptive MVD Resolutio", filed on 14 at 3.2023, the entire contents of which are incorporated herein by reference.
Technical Field
The disclosed embodiments relate generally to video coding, including but not limited to systems and methods for template matching for adaptive Motion Vector Difference (MVD) resolution.
Background
Digital video is supported by a variety of electronic devices such as digital televisions, laptop or desktop computers, tablet computers, digital video cameras, digital recording devices, digital media players, video game consoles, smart phones, video teleconferencing devices, video streaming devices, and the like. The electronic device sends and receives digital video data over a communication network or otherwise communicates digital video data and/or stores digital video data on a storage device. Because of the limited bandwidth capacity of the communication network and the limited memory resources of the storage device, video encoding may be used to compress video data according to one or more video encoding standards prior to transmitting or storing the video data.
A variety of video codec standards have been developed. For example, video coding standards include AOMedia Video 1 (AV 1), universal Video coding (VVC), joint exploration test model (JEM), high efficiency Video coding (HEVC/h.265), advanced Video coding (AVC/h.264), and Moving Picture Experts Group (MPEG) coding. Video coding typically uses prediction methods (e.g., inter-prediction, intra-prediction, etc.) that exploit redundancy inherent in video data. Video coding aims to compress video data into a form using a lower bit rate while avoiding or minimizing degradation of video quality.
HEVC (also known as h.265) is a video compression standard designed as part of the MPEG-H project. The HEVC/H.265 standard is published in the ITU-T and ISO/IEC in 2013 (version 1), 2014 (version 2), 2015 (version 3) and 2016 (version 4). Universal video coding (VVC) (also known as h.266) is a subsequent video compression standard intended as HEVC. The VVC/h.266 standard was published by ITU-T and ISO/IEC in 2020 (version 1) and 2022 (version 2). AV1 is an open video coding format designed as an alternative to HEVC. A verified version 1.0.0 of the specification with a survey table 1 was released on 1, 8, 2019.
Disclosure of Invention
The present disclosure describes advanced video coding techniques, and more particularly, template matching methods for adaptive MVD resolution.
According to some implementations, a method of video coding is performed by a computing system. The method comprises the following steps: determining whether to signal an adaptive Motion Vector Difference (MVD) resolution mode, which is an inter prediction mode having an adaptive Motion Vector Difference (MVD) pixel resolution, based on an inter prediction syntax element from the video stream; receiving a Motion Vector Difference (MVD) of a video block from a video stream; responsive to determining to signal the adaptive MVD resolution mode, searching a first template of a predicted video block of the video block, wherein the first template is a neighboring reconstructed/predicted sample of the predicted video block, and the predicted video block is a reconstructed/predicted forward or backward video block of the video block; locating a first template of a predicted video block, the first template of the predicted video block being a best match to a second template of the video block, the second template being adjacent reconstructed/predicted samples of the video block corresponding to the temporally collocated templates of the first template; refining a Motion Vector (MV) of the video block based at least on the second template, the located first template, and the MVD; and reconstructing/processing the video block based at least on the refined MV.
According to some implementations, a computing system, such as a streaming system, a server system, a personal computer system, or other electronic device, is provided. The computing system includes control circuitry and memory storing one or more instruction sets. The one or more sets of instructions include instructions for performing any of the methods described herein. In some implementations, the computing system includes an encoder component and/or a decoder component.
According to some embodiments, a non-transitory computer readable storage medium is provided. The non-transitory computer readable storage medium stores one or more sets of instructions for execution by the computing system. The one or more sets of instructions include instructions for performing any of the methods described herein.
Accordingly, methods, apparatus, and systems for coding video are disclosed. Such methods, apparatus, and systems may supplement or replace conventional methods, apparatus, and systems for coding video.
The features and advantages described in the specification are not necessarily all inclusive and, in particular, some additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims provided in this disclosure. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the subject matter described herein.
Drawings
So that the disclosure may be understood in more detail, a more particular description may be had by reference to the features of the various embodiments, some of which are illustrated in the accompanying drawings. However, the drawings illustrate only relevant features of the disclosure and therefore are not necessarily to be considered limiting, as the description may allow other useful features to be understood by those of ordinary skill in the art upon reading the disclosure.
Fig. 1 is a block diagram illustrating an example communication system in accordance with some embodiments.
Fig. 2A is a block diagram illustrating example elements of an encoder assembly, according to some embodiments.
Fig. 2B is a block diagram illustrating example elements of a decoder component according to some embodiments.
FIG. 3 is a block diagram illustrating an example server system according to some embodiments.
FIG. 4 is a diagram illustrating example Template Matching (TM) according to some embodiments.
Fig. 5 is an exemplary flowchart illustrating a method of coding video according to some embodiments.
According to common practice, the various features shown in the drawings are not necessarily drawn to scale and like reference numerals may be used to designate like features throughout the specification and drawings.
Detailed Description
Fig. 1 is a block diagram illustrating a communication system 100 according to some embodiments. The communication system 100 includes a source device 102 and a plurality of electronic devices 120 (e.g., electronic device 120-1 through electronic device 120-m) communicatively coupled to each other via one or more networks. In some implementations, the communication system 100 is a streaming system, for example, for use with applications that support video, such as video conferencing applications, digital TV applications, and media storage and/or distribution applications.
The source device 102 includes a video source 104 (e.g., camera component or media storage) and an encoder component 106. In some implementations, the video source 104 is a digital video camera (e.g., configured to create an uncompressed video sample stream). Encoder component 106 generates one or more encoded video bitstreams from the video stream. The video stream from the video source 104 may be of a high data volume compared to the encoded video bitstream 108 generated by the encoder component 106. Because the encoded video bitstream 108 is of a lower data volume (less data) than the video stream from the video source, the encoded video bitstream 108 requires less bandwidth for transmission and less storage space for storage than the video stream from the video source 104. In some implementations, the source device 102 does not include the encoder component 106 (e.g., is configured to transmit uncompressed video data to the network 110).
One or more networks 110 represent any number of networks that communicate information between source device 102, server system 112, and/or electronic devices 120, including for example, wired (e.g., connected) and/or wireless communication networks. One or more networks 110 may exchange data in a line switched path and/or a packet switched path. Representative networks include telecommunication networks, local area networks, wide area networks, and/or the internet.
One or more networks 110 include a server system 112 (e.g., a distributed/cloud computing system). In some implementations, the server system 112 is or includes a streaming server (e.g., configured to store and/or distribute video content such as an encoded video stream from the source device 102). The server system 112 includes a decoder component 114 (e.g., configured to encode and/or decode video data). In some implementations, the decoder component 114 includes an encoder component and/or a decoder component. In various embodiments, decoder element 114 is instantiated as hardware, software, or a combination thereof. In some implementations, the decoder component 114 is configured to decode the encoded video bitstream 108 and re-encode the video data using different encoding standards and/or methods to generate encoded video data 116. In some implementations, the server system 112 is configured to generate a plurality of video formats and/or encodings from the encoded video bitstream 108.
In some implementations, the server system 112 functions as a Media Aware Network Element (MANE). For example, the server system 112 may be configured to clip the encoded video bitstream 108 to customize a possibly different bitstream for one or more of the electronic devices 120. In some embodiments, the MANE is provided separately from the server system 112.
The electronic device 120-1 includes a decoder component 122 and a display 124. In some implementations, the decoder component 122 is configured to decode the encoded video data 116 to generate an outgoing video stream that may be presented on a display or other type of presentation device. In some implementations, one or more of the electronic devices 120 do not include a display component (e.g., communicatively coupled to an external display device and/or include media memory). In some implementations, the electronic device 120 is a streaming client. In some implementations, the electronic device 120 is configured to access the server system 112 to obtain the encoded video data 116.
The source device and/or the plurality of electronic devices 120 are sometimes referred to as "terminal devices" or "user devices". In some implementations, one or more of the source devices 102 and/or the electronic devices 120 are examples of server systems, personal computers, portable devices (e.g., smart phones, tablets, or laptops), wearable devices, video conferencing devices, and/or other types of electronic devices.
In an example operation of the communication system 100, the source device 102 transmits the encoded video bitstream 108 to the server system 112. For example, the source device 102 may encode a stream of pictures captured by the source device. The server system 112 receives the encoded video bitstream 108 and may decode and/or encode the encoded video bitstream 108 using the decoder component 114. For example, server system 112 may apply better encoding for network transmission and/or storage to video data. The server system 112 can transmit the encoded video data 116 (e.g., one or more encoded video bit streams) to one or more of the electronic devices 120. Each electronic device 120 may decode the encoded video data 116 to recover the video picture and optionally display the video picture.
In some embodiments, the transmission discussed above is a unidirectional data transmission. Unidirectional data transmission is sometimes used for media service applications and the like. In some embodiments, the transmission discussed above is a bi-directional data transmission. Bi-directional data transmission is sometimes used for video conferencing applications, etc. In some implementations, the encoded video bitstream 108 and/or the encoded video data 116 are encoded and/or decoded according to any of the video encoding/compression standards described herein, such as HEVC, VVC, and/or AV 1.
Fig. 2A is a block diagram illustrating example elements of encoder component 106, in accordance with some embodiments. Encoder component 106 receives a source video sequence from video source 104. In some implementations, the encoder component includes a receiver (e.g., transceiver) component configured to receive the source video sequence. In some implementations, the encoder component 106 receives the video sequence from a remote video source (e.g., a video source that is a component of a different device than the encoder component 106). Video source 104 may provide a source video sequence in the form of a stream of digital video samples that may have any suitable bit depth (e.g., 8 bits, 10 bits, or 12 bits), any color space (e.g., bt.601Y CrCb or RGB), and any suitable sampling structure (e.g., Y CrCb 4:2:0 or Y CrCb 4:4:4). In some implementations, the video source 104 is a storage device that stores previously captured/prepared video. In some implementations, the video source 104 is a camera that captures local image information as a video sequence. Video data may be provided as a plurality of individual pictures that impart motion when viewed in sequence. The picture itself may be organized as a spatial array of pixels, where each pixel may include one or more samples, depending on the sampling structure, color space, etc. used. The relationship between pixels and samples can be readily understood by one of ordinary skill in the art. The following description focuses on the sample.
The encoder component 106 is configured to encode and/or compress pictures of the source video sequence into an encoded video sequence 216 in real-time or under other temporal constraints required by the application. Implementing the appropriate encoding speed is a function of the controller 204. In some implementations, the controller 204 controls and is functionally coupled to other functional units as described below. The parameters set by the controller 204 may include rate control related parameters (e.g., lambda values for picture skipping, quantizer and/or rate distortion optimization techniques), picture size, group of pictures (GOP) layout, maximum motion vector search range, etc. Other functions of the controller 204 may be readily identified by those of ordinary skill in the art as they may involve the encoder assembly 106 being optimized for a particular system design.
In some implementations, the encoder component 106 is configured to operate in a decoding loop. In a simplified example, the coding loop includes a source encoder 202 (e.g., responsible for creating symbols, such as a symbol stream, based on input pictures and reference pictures to be encoded) and a (local) decoder 210. Decoder 210 reconstructs the symbols to create sample data (when compression between the symbols and the encoded video bitstream is lossless) in a similar manner as a (remote) decoder. The reconstructed sample stream (sample data) is input to the reference picture memory 208. The content in the reference picture memory 208 is also bit-accurate between the local encoder and the remote encoder, since decoding of the symbol stream results in a bit-accurate result that is independent of the decoder position (local or remote). In this way, the prediction part of the encoder interprets the same sample values as those interpreted by the decoder when using prediction during decoding as reference picture samples. This principle of reference picture synchronicity (and if synchronicity cannot be maintained, e.g. drift due to channel errors) is known to a person skilled in the art.
The operation of decoder 210 may be the same as the operation of a remote decoder, such as decoder component 122 described in detail below in connection with fig. 2B. However, referring briefly to fig. 2B, since symbols are available and are encoded into an encoded video sequence by the entropy encoder 214 and decoded by the parser 254 may be lossless, the entropy decoding portion of the decoder element 122, including the buffer memory 252 and the parser 254, may not be fully implemented in the local decoder 210.
It is observed at this point that any decoder technique other than the parsing/entropy decoding present in the decoder must also be present in the corresponding encoder in substantially the same functional form. For this reason, the disclosed subject matter focuses on decoder operation. The description of the encoder technique may be simplified because the encoder technique is reciprocal to the fully described decoder technique. Only in certain areas and a more detailed description is provided below.
As part of its operation, the source encoder 202 may perform motion compensated predictive encoding of an input frame with reference to one or more previously encoded frames from a video sequence designated as reference frames. In this way, the encoding engine 212 encodes differences between blocks of pixels of the input frame and blocks of pixels of a reference frame that may be selected as a prediction reference for the input frame. The controller 204 may manage the encoding operations of the source encoder 202, including, for example, the setting of parameters and sub-group parameters for encoding video data.
Decoder 210 decodes encoded video data for frames that may be designated as reference frames based on the symbols created by source encoder 202. The operation of the encoding engine 212 may advantageously be a lossy process. When the encoded video data is decoded at a video decoder (not shown in fig. 2A), the reconstructed video sequence may be a copy of the source video sequence with some errors. The decoder 210 replicates the decoding process that may be performed on the reference frames by a remote video decoder and may cause the reconstructed reference frames to be stored in the reference picture memory 208. In this way, the encoder component 106 locally stores copies of the reconstructed reference frames that have common content (no transmission errors) with the reconstructed reference frames to be obtained by the remote video decoder.
The predictor 206 may perform a predictive search for the encoding engine 212. That is, for a new frame to be encoded, the predictor 206 may search the reference picture memory 208 for sample data (as candidate reference pixel blocks) or some metadata such as reference picture motion vectors, block shapes, etc., that may be used as an appropriate prediction reference for the new picture. The predictor 206 may operate on a block of samples, pixel by pixel basis, to find an appropriate prediction reference. In some cases, the input picture may have prediction references extracted from multiple reference pictures stored in the reference picture memory 208 as determined by search results obtained by the predictor 206.
The outputs of all the above mentioned functional units may be subjected to entropy encoding in an entropy encoder 214. The entropy encoder 214 converts the symbols generated by the various functional units into an encoded video sequence by losslessly compressing the symbols according to techniques known to those of ordinary skill in the art (e.g., huffman coding, variable length coding, and/or arithmetic coding).
In some implementations, the output of the entropy encoder 214 is coupled to a transmitter. The transmitter may be configured to buffer the encoded video sequence created by the entropy encoder 214 in preparation for transmission via the communication channel 218, which communication channel 218 may be a hardware/software link to a storage device that is to store the encoded video data. The transmitter may be configured to combine the encoded video data from the source encoder 202 with other data to be transmitted, such as encoded audio data and/or an auxiliary data stream (source not shown). In some implementations, the transmitter may transmit additional data with the encoded video. The source encoder 202 may include such data as part of the encoded video sequence. The additional data may include a temporal/spatial/SNR enhancement layer, other forms of redundant data such as redundant pictures and slices, supplemental Enhancement Information (SEI) messages, visual availability information (VUI) parameter set fragments, and the like.
The controller 204 may manage the operation of the encoder assembly 106. During encoding, the controller 204 may assign a particular encoded picture type to each encoded picture, which may affect the encoding technique applied to the respective picture. For example, a picture may be allocated as an intra picture (I picture), a predicted picture (P picture), or a bi-predicted picture (B picture). Intra pictures can be encoded and decoded without using any other frames in the sequence as prediction sources. Some video codecs allow for different types of intra pictures, including, for example, independent Decoder Refresh (IDR) pictures. Those variations of the I picture and their corresponding applications and features are known to those of ordinary skill in the art and thus are not repeated here. A predicted picture may be encoded and decoded using intra prediction or inter prediction that predicts sample values for each block using at most one motion vector and a reference index. Bi-predictive pictures may be encoded and decoded using intra-or inter-prediction that predicts sample values for each block using at most two motion vectors and a reference index. Similarly, multiple prediction pictures may use more than two reference pictures and associated metadata for reconstruction of a single block.
A source picture may typically be spatially subdivided into a plurality of sample blocks (e.g., blocks of 4 x 4, 8 x 8, 4 x 8, or 16 x 16 samples, respectively) and encoded on a block-by-block basis. These blocks may be predictively encoded with reference to other (already encoded) blocks as determined by the encoding allocation applied to the respective pictures of the blocks. For example, a block of an I picture may be non-predictively encoded, or the block may be predictively encoded (spatial prediction or intra prediction) with reference to an already encoded block of the same picture. The pixel blocks of the P picture may be non-predictively encoded via spatial prediction or via temporal prediction with reference to a previously encoded reference picture. The block of B pictures may be non-predictively encoded via spatial prediction or via temporal prediction with reference to one or two previously encoded reference pictures.
Video may be captured as multiple source pictures (video pictures) in a time sequence. Intra picture prediction (often referred to simply as intra prediction) exploits spatial correlation in a given picture, while inter picture prediction exploits (temporal or other) correlation between pictures. In an example, a particular picture in encoding/decoding (which is referred to as a current picture) is partitioned into blocks. When a block in the current picture is similar to a reference block in a previously encoded and still buffered reference picture in video, the block in the current picture may be encoded by a vector called a motion vector. The motion vector points to a reference block in the reference picture and, in the case of using multiple reference pictures, may have a third dimension that identifies the reference picture.
The encoder component 106 can perform encoding operations in accordance with predetermined video encoding techniques or standards, such as any of the techniques or standards described herein. In operation of the encoder component 106, the encoder component 106 can perform various compression operations, including predictive encoding operations that exploit temporal redundancy and spatial redundancy in an input video sequence. Thus, the encoded video data may conform to a syntax specified by the video encoding technique or standard used.
Fig. 2B is a block diagram illustrating example elements of decoder component 122 according to some embodiments. Decoder element 122 in fig. 2B is coupled to channel 218 and display 124. In some implementations, the decoder component 122 includes a transmitter coupled to the loop filter 256 and configured to transmit data (e.g., via a wired or wireless connection) to the display 124.
In some implementations, the decoder component 122 includes a receiver coupled to the channel 218 and configured to receive data from the channel 218 (e.g., via a wired or wireless connection). The receiver may be configured to receive one or more encoded video sequences to be decoded by decoder component 122. In some implementations, the decoding of each encoded video sequence is independent of other encoded video sequences. Each encoded video sequence may be received from a channel 218, which channel 218 may be a hardware/software link to a storage device storing encoded video data. The receiver may receive encoded video data and other data, such as encoded audio data and/or auxiliary data streams, which may be forwarded to their respective use entities (not depicted). The receiver may separate the encoded video sequence from other data. In some implementations, the receiver receives additional (redundant) data along with the encoded video. The additional data may be included as part of the encoded video sequence. The additional data may be used by the decoder component 122 to decode the data and/or more accurately reconstruct the original video data. The additional data may be in the form of, for example, temporal, spatial or SNR enhancement layers, redundant slices, redundant pictures, forward error correction codes, and the like.
According to some implementations, decoder component 122 includes buffer memory 252, parser 254 (sometimes also referred to as an entropy decoder), scaler/inverse transform unit 258, intra picture prediction unit 262, motion compensated prediction unit 260, aggregator 268, loop filter unit 256, reference picture memory 266, and current picture memory 264. In some implementations, decoder element 122 is implemented as an integrated circuit, a series of integrated circuits, and/or other electronic circuitry. In some implementations, the decoder component 122 is implemented at least in part in software.
Buffer memory 252 is coupled between channel 218 and parser 254 (e.g., to prevent network jitter). In some embodiments, buffer memory 252 is separate from decoder element 122. In some embodiments, a separate buffer memory is provided between the output of channel 218 and decoder element 122. In some implementations, a separate buffer memory (e.g., for preventing network jitter) is provided external to the decoder element 122 in addition to the buffer memory 252 (e.g., configured to handle playback timing) internal to the decoder element 122. The buffer memory 252 may not be needed or the buffer memory 252 may be small when receiving data from a store/forward device with sufficient bandwidth and controllability or from an isochronous network. For use over best effort packet networks such as the internet, the buffer memory 252 may be required, or the buffer memory 252 may be relatively large and may advantageously be of an adaptive size and may be implemented at least in part in an operating system or similar element (not depicted) external to the decoder component 122.
Parser 254 is configured to reconstruct symbols 270 from the encoded video sequence. The symbols may include, for example, information for managing the operation of decoder component 122 and/or information for controlling a presentation device such as display 124. The control information for the presentation device may be in the form of, for example, a Supplemental Enhancement Information (SEI) message or a video availability information (VUI) parameter set fragment (not depicted). The parser 254 parses (entropy decodes) the encoded video sequence. The encoding of the encoded video sequence may be performed according to video encoding techniques or standards and may follow principles well known to those skilled in the art including: variable length coding, huffman coding, arithmetic coding with or without context sensitivity, etc. Parser 254 may extract a subset parameter set for at least one of the subsets of pixels in the video decoder from the encoded video sequence based on at least one parameter corresponding to the group. A subgroup may include a group of pictures (GOP), a picture, a tile, a slice, a macroblock, a Coding Unit (CU), a block, a Transform Unit (TU), a Prediction Unit (PU), and so on. The parser 254 may also extract information, such as transform coefficients, quantizer parameter values, motion vectors, etc., from the encoded video sequence.
The reconstruction of symbol 270 may involve a number of different units depending on the type of encoded video picture or portion thereof (e.g., inter and intra pictures, inter and intra blocks), and other factors. Which units are involved and how these units are involved may be controlled by the subgroup control information parsed from the encoded video sequence by parser 254. For clarity, the flow of such subgroup control information between the parser 254 and the underlying plurality of units is not depicted.
In addition to the functional blocks already mentioned, the decoder section 122 can be conceptually subdivided into a plurality of functional units as described below. In actual implementation operations under commercial constraints, many of these units interact tightly with each other and may be at least partially integrated with each other. However, for the purposes of describing the disclosed subject matter, the following conceptual subdivision of functional units is maintained.
The sealer/inverse transform unit 258 receives quantized transform coefficients as symbols 270 from the parser 254 as well as control information (e.g., which transform, block size, quantization factor, and/or quantization scaling matrix to use). The sealer/inverse transform unit 258 may output a block comprising sample values, which may be input into the aggregator 268.
In some cases, the output samples of sealer/inverse transform unit 258 belong to intra-coded blocks; namely: the predictive information from the previously reconstructed picture is not used, but a block of predictive information from the previously reconstructed portion of the current picture may be used. Such predictive information may be provided by intra picture prediction unit 262. The intra picture prediction unit 262 may generate a block having the same size and shape as the block under reconstruction using information that has been reconstructed around acquired from the current (partially reconstructed) picture from the current picture memory 264. The aggregator 268 may add the prediction information that the intra picture prediction unit 262 has generated to the output sample information provided by the scaler/inverse transform unit 258 on a per sample basis.
In other cases, the output samples of the scaler/inverse transform unit 258 belong to inter-coded and possibly motion compensated blocks. In this case, the motion compensated prediction unit 260 may access the reference picture memory 266 to obtain samples for prediction. After motion compensation of the acquired samples according to the symbols 270 belonging to the block, these samples may be added by an aggregator 268 to the output of the scaler/inverse transform unit 258 (in this case referred to as residual samples or residual signals) to generate output sample information. The address in the reference picture memory 266 from which the motion compensated prediction unit 260 obtains the prediction samples may be controlled by a motion vector. The motion vectors may be available to the motion compensated prediction unit 260 in the form of symbols 270, which may have, for example, X, Y and reference picture components. The motion compensation may also include interpolation of sample values obtained from the reference picture memory 266 when sub-sample accurate motion vectors are used, motion vector prediction mechanisms, and the like.
The output samples of the aggregator 268 may be subjected to various loop filtering techniques in the loop filter unit 256. The video compression techniques may include in-loop filtering techniques controlled by parameters included in the encoded video bitstream and available to the loop filter unit 256 as symbols 270 from the parser 254, but may also be responsive to meta-information obtained during decoding of previous (in decoding order) portions of the encoded picture or encoded video sequence, and to previously reconstructed and loop filtered sample values.
The output of loop filter unit 256 may be a sample stream that may be output to a rendering device such as display 124 and stored in reference picture memory 266 for use in future inter picture prediction.
Some encoded pictures, once fully reconstructed, may be used as reference pictures for future prediction. Once the encoded picture has been fully reconstructed and has been identified as a reference picture (e.g., by the parser 254), the current reference picture may become part of the reference picture memory 266 and a new current picture memory may be reallocated before the reconstruction of the subsequent encoded picture begins.
The decoder component 122 can perform decoding operations in accordance with a predetermined video compression technique, which can be recorded in a standard such as any of the standards described herein. The encoded video sequence may conform to the syntax specified by the video compression technique or standard being used in the sense that the encoded video sequence follows the syntax of the video compression technique or standard, as specified in the video compression technique document or standard and explicitly in the profile document therein. Furthermore, to comply with some video compression techniques or standards, the complexity of the encoded video sequence may be within a range as defined by the hierarchy of video compression techniques or standards. In some cases, the hierarchy limits a maximum picture size, a maximum frame rate, a maximum reconstructed sample rate (measured in units of, for example, mega samples per second), a maximum reference picture size, and so on. In some cases, the limits set by the levels may be further defined by Hypothetical Reference Decoder (HRD) specifications and metadata managed by a HRD buffer signaled in the encoded video sequence.
Fig. 3 is a block diagram illustrating a server system 112 according to some embodiments. The server system 112 includes control circuitry 302, one or more network interfaces 304, memory 314, a user interface 306, and one or more communication buses 312 for interconnecting these components. In some implementations, the control circuit 302 includes one or more processors (e.g., a CPU, GPU, and/or DPU). In some implementations, the control circuitry includes one or more Field Programmable Gate Arrays (FPGAs), hardware accelerators, and/or one or more integrated circuits (e.g., application specific integrated circuits).
The network interface 304 may be configured to interface with one or more communication networks (e.g., wireless, wired, and/or optical networks). The communication network may be a local area network, wide area network, metropolitan area network, in-vehicle and industrial network, real-time network, delay tolerant network, and the like. Examples of communication networks include local area networks such as ethernet, wireless LAN, cellular networks (including GSM, 3G, 4G, 5G, LTE, etc.), television cable or wireless wide area digital networks (including cable television, satellite television, and terrestrial broadcast television), vehicular and industrial (including CANBus), and the like. Such communications may be uni-directional receive-only (e.g., broadcast television), uni-directional transmit-only (e.g., CANBus to some CANBus devices), or bi-directional (e.g., to other computer systems using a local or wide area digital network). Such communications may include communications to one or more cloud computing networks.
The user interface 306 includes one or more output devices 308 and/or one or more input devices 310. The input device 310 may include one or more of the following: a keyboard, a mouse, a touch pad, a touch screen, a data glove, a joystick, a microphone, a scanner, a camera device, and the like. The output devices 308 may include one or more of the following: an audio output device (e.g., a speaker), a visual output device (e.g., a display), etc.
Memory 314 may include high-speed random access memory (e.g., DRAM, SRAM, DDR RAM, and/or other random access solid state memory devices) and/or non-volatile memory (e.g., one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, and/or other non-volatile solid state memory devices). Memory 314 optionally includes one or more storage devices remotely located from control circuit 302. The memory 314 or alternatively a non-volatile solid state memory device within the memory 314 includes a non-transitory computer readable storage medium. In some implementations, the memory 314 or a non-transitory computer readable storage medium of the memory 314 stores the following programs, modules, instructions, and data structures, or a subset or superset thereof:
an operating system 316 that includes applications for handling various basic system services and for performing hardware-related tasks;
a network communication module 318 for connecting the server system 112 to other computing devices via one or more network interfaces 304 (e.g., via wired and/or wireless connections);
a coding module 320 for performing various functions with respect to encoding and/or decoding data, such as video data. In some implementations, the decode module 320 is an example of the decoder component 114. The decode module 320 includes, but is not limited to, one or more of the following:
An o decoding module 322 for performing various functions related to decoding encoded data, such as those previously described with respect to decoder component 122; and
an omicron encoding module 340 to perform various functions related to encoding data, such as those previously described with respect to encoder component 106; and
a picture memory 352 for storing pictures and picture data, e.g., for use with coding module 320. In some implementations, the picture memory 352 includes one or more of the following: reference picture memory 208, buffer memory 252, current picture memory 264, and reference picture memory 266.
In some implementations, decoding module 322 includes parsing module 324 (e.g., configured to perform various functions previously described with respect to parser 254), transform module 326 (e.g., configured to perform various functions previously described with respect to sealer/inverse transform unit 258), prediction module 328 (e.g., configured to perform various functions previously described with respect to motion compensated prediction unit 260 and/or intra picture prediction unit 262), and filter module 330 (e.g., configured to perform various functions previously described with respect to loop filter 256).
In some implementations, the encoding module 340 includes a code module 342 (e.g., configured to perform various functions previously described with respect to the source encoder 202, and/or the encoding engine 212) and a prediction module 344 (e.g., configured to perform various functions previously described with respect to the predictor 206). In some implementations, the decoding module 322 and/or the encoding module 340 include a subset of the modules shown in fig. 3. For example, both the decoding module 322 and the encoding module 340 use a shared prediction module.
Each of the above identified modules stored in memory 314 corresponds to a set of instructions for performing the functions described herein. The above identified modules (e.g., sets of instructions) need not be implemented as separate software programs, applications or modules, and thus various subsets of these modules may be combined or otherwise rearranged in various embodiments. For example, the decode module 320 optionally does not include separate decode and encode modules, but rather uses the same set of modules to perform both sets of functions. In some implementations, the memory 314 stores a subset of the modules and data structures identified above. In some implementations, the memory 314 stores additional modules and data structures not described above, such as an audio processing module.
In some implementations, the server system 112 includes web or hypertext transfer protocol (HTTP) servers, file Transfer Protocol (FTP) servers, and web pages and applications implemented using Common Gateway Interface (CGI) scripts, PHP Hypertext Preprocessors (PHPs), dynamic server pages (ASPs), hypertext markup language (HTML), extensible markup language (XML), java, javaScript, asynchronous JavaScript and XML (AJAX), XHP, javelin, wireless Universal Resource Files (WURFL), and the like.
Although fig. 3 illustrates a server system 112 according to some embodiments, fig. 3 is intended more as a functional description of various features that may be present in one or more server systems rather than a structural schematic of the embodiments described herein. In practice, and as recognized by one of ordinary skill in the art, items shown separately may be combined, while some items may be separated. For example, some of the items shown separately in fig. 3 may be implemented on a single server, and a single item may be implemented by one or more servers. The actual number of servers used to implement server system 112 and how features are allocated among them will vary from implementation to implementation and, optionally, depend in part on the amount of data traffic that the server system processes during peak usage periods and during average usage periods.
In some implementations, a prediction block (PB or Coded Block (CB), also referred to as PB when not yet partitioned into prediction blocks) obtained from any of the partitioning schemes may become a single block for coding via intra prediction or inter prediction. For inter prediction of the current PB, a residual between the current block and the prediction block is generated, encoded, and included in the encoded bitstream.
In some implementations, inter prediction may be implemented, for example, in a single reference mode or a composite reference mode. In some implementations, a skip flag may be included first in the bitstream of the current block (or at a higher level) to indicate whether the current block is inter coded and not skipped. If the current block is inter coded, another flag may be further included in the bitstream as a signal indicating whether the single reference mode or the composite reference mode is used for prediction of the current block. For a single reference mode, one reference block may be used to generate a prediction block of the current block. For a composite reference mode, two or more reference blocks may be used to generate a prediction block by, for example, weighted averaging. The composite reference pattern may be referred to as more than one reference pattern, two reference patterns, or multiple reference patterns. The reference block or blocks may be identified using the reference frame index or indices and additionally using a corresponding motion vector or vectors that indicate one or more shifts in position (e.g., in horizontal and vertical pixels) between the one or more reference blocks and the current block. For example, an inter prediction block of a current block may be generated as a prediction block in a single reference mode from a single reference block identified by one motion vector in a reference frame, whereas for a composite reference mode, a prediction block may be generated by a weighted average of two reference blocks of two reference frames indicated by two reference frame indices and two corresponding motion vectors. One or more motion vectors may be encoded and included in the bitstream in various ways.
In some implementations, the encoding or decoding system may maintain a Decoded Picture Buffer (DPB). Some pictures/pictures may be kept in the DPB to wait to be displayed (in the decoding system) and some pictures/pictures in the DPB may be used as reference frames for implementing inter-prediction (in the decoding system or the encoding system). In some implementations, the reference frames in the DPB may be marked as short-term references or long-term references to the current picture being encoded or decoded. For example, the short-term reference frame may include a frame used for inter-prediction of blocks in the current frame or in a predefined number (e.g., 2) of subsequent video frames nearest to the current frame in decoding order. The long-term reference frames may include frames in the DPB that may be used to predict image blocks in frames that are more than a predefined number of frames from the current frame in decoding order. Information about such tags for short-term and long-term reference frames may be referred to as a reference picture set (Reference Picture Set, RPS) and may be added to the header of each frame in the encoded bitstream. Each frame in the encoded video stream may be identified by a picture order counter (Picture Order Counter, POC) that is numbered according to the playback sequence in an absolute manner or in relation to a group of pictures starting from, for example, an I-frame.
In some example implementations, one or more reference picture lists including identifications of short-term reference frames and long-term reference frames for inter-prediction may be formed based on information in the RPS. For example, a single picture reference list may be formed for unidirectional inter prediction, the single picture reference list being denoted as L0 reference (or reference list 0), and two picture reference lists may be formed for bidirectional inter prediction, the two picture reference lists being denoted as L0 (or reference list 0) and L1 (or reference list 1) for each of the two prediction directions. The reference frames included in the L0 list and the L1 list may be ordered in various predetermined ways. The length of the L0 list and the length of the L1 list may be signaled in the video bitstream. Unidirectional inter prediction may be performed in a single reference mode or in a composite reference mode when multiple references for generating a prediction block by weighted average in the composite prediction mode are on the same side of a block to be predicted. The bi-directional inter prediction may be a compound mode only, because the bi-directional inter prediction involves at least two reference blocks.
In some implementations, a Merge Mode (MM) for inter prediction may be implemented. In general, for merge mode, one or more of the motion vectors in the single reference prediction or the motion vectors in the composite reference prediction of the current PB may be derived from the other one or more motion vectors, rather than being independently calculated and signaled. For example, in an encoding system, one or more current motion vectors of a current PB may be represented by one or more differences between the one or more current motion vectors and other one or more encoded motion vectors (referred to as reference motion vectors). Such one or more differences in the one or more motion vectors, instead of the entirety of the one or more current motion vectors, may be encoded and included in the bitstream and may be linked to one or more reference motion vectors. Accordingly, in a decoding system, one or more motion vectors corresponding to the current PB may be derived based on one or more decoded motion vector differences and one or more decoded reference motion vectors linked thereto. As a specific form of general Merge Mode (MM) inter prediction, such inter prediction based on one or more motion vector differences may be referred to as merge mode with motion vector differences (MMVD). Thus, MM in general or MMVD in particular can be implemented to exploit the correlation between motion vectors associated with different PB to improve coding efficiency. For example, neighboring PB may have similar motion vectors, and thus MVDs may be small and may be efficiently encoded. For another example, motion vectors may be correlated in time (between frames) for similarly positioned/placed blocks in space.
In some example implementations, an MM flag may be included in the bitstream during the encoding process for indicating whether the current PB is in merge mode. Additionally or alternatively, an MMVD flag may be included during the encoding process and signaled in the bitstream to indicate whether the current PB is in MMVD mode. MM and/or MMVD flags or indicators may be set at a PB level, a Coding Block (CB) level, a Coding Unit (CU) level, a Coding Tree Block (CTB) level, a Coding Tree Unit (CTU) level, a slice level, a picture level, etc. For a particular example, both the MM flag and the MMVD flag may be included for the current CU, and the MMVD flag may be signaled immediately after the skip flag and the MM flag to specify whether the MMVD mode is used for the current CU.
In some example implementations of MMVD, a list of reference motion vector (Reference Motion Vector, RMV) candidates or MV predictor candidates for motion vector prediction may be formed for the block being predicted. The list of RMV candidates may include a predetermined number (e.g., 2) of MV predictor candidate blocks whose motion vectors may be used to predict the current motion vector. The RMV candidate block may include a block selected from neighboring blocks and/or temporal blocks in the same frame (e.g., the same located block in a previous frame or a subsequent frame of the current frame). These options represent blocks at spatial or temporal positions relative to the current block, which may have similar or identical motion vectors as the current block. The size of the list of MV predictor candidates may be predetermined. For example, the list may include two or more candidates. In order to be on the list of RMV candidates, a candidate block may for example be required to have the same reference frame (or frame) as the current block, must be present (e.g. a boundary check needs to be performed when the current block is close to the edge of the frame), and must have been encoded during the encoding process and/or decoded during the decoding process. In some implementations, if available and the above condition is met, the list of merge candidates may be first populated with spatially adjacent blocks (scanned in a particular predefined order), and then populated with time blocks if space is still available in the list. For example, neighboring RMV candidate blocks may be selected from the left and top blocks of the current block. The list of RMV predictor candidates may be dynamically formed into a dynamic reference list (Dynamic Reference List, DRL) at various levels (sequence, picture, frame, slice, superblock, etc.). The DRL may be signaled in the bitstream.
In some implementations, the actual MV predictor candidates used as reference motion vectors for predicting the motion vector of the current block may be signaled. In the case where the RMV candidate list includes two candidates, a one-bit flag, referred to as a merge candidate flag, may be used to indicate selection of a reference merge candidate. For the current block being predicted in the compound mode, each of the plurality of motion vectors predicted using the MV predictor may be associated with a reference motion vector from the merge candidate list. The encoder may determine which of the RMV candidates more closely predicts the current encoded block and signal the selection as a index reference in the DRL.
In some example implementations of MMVD, after an RMV candidate is selected and used as a base Motion Vector Predictor (MVP) for a motion vector to be predicted, a motion vector difference (Motion Vector Difference, MVD, or Δmv, which represents the difference between the motion vector to be predicted and a reference candidate motion vector) may be calculated in the encoding system. Such MVDs may include information representing the magnitude of the MV difference and the direction of the MV difference, both of which may be signaled in the bitstream. The motion difference magnitude and motion difference direction may be signaled in various ways.
In some example implementations of MMVD, a distance index may be used to specify the magnitude information of the motion vector difference and to indicate one of a set of predefined offsets representing the predefined motion vector difference from the starting point (reference motion vector). The MV offset according to the signaled index can then be added to the horizontal component or the vertical component of the starting (reference) motion vector. Whether the horizontal component or the vertical component of the reference motion vector should be offset can be determined by the direction information of the MVD. An example predefined relationship between the distance index and the predefined offset is specified in table 1.
TABLE 1 exemplary relationship of distance index and predefined MV offset
In some example implementations of MMVD, a direction index may be further signaled and used to represent the direction of the MVD relative to the reference motion vector. In some implementations, the direction may be limited to either one of a horizontal direction and a vertical direction. Example 2 bit direction indices are shown in table 2. In the example of table 2, the interpretation of MVDs may vary according to the information of the start/reference MVs. For example, when the start/reference MV corresponds to a uni-directional prediction block or to a bi-directional prediction block in which two reference frame lists point to the same side of the current picture (i.e., the POC of both reference pictures are greater than the POC of the current picture or are both less than the POC of the current picture), the symbols in table 2 may specify the symbol (direction) of the MV offset added to the start/reference MV. When the start/reference MV corresponds to a bi-prediction block having two reference pictures at different sides of the current picture (i.e., the POC of one reference picture is greater than the POC of the current picture and the POC of the other reference picture is less than the POC of the current picture), and the difference between the reference POC in picture reference list 0 and the current frame is greater than the difference between the reference POC in picture reference list 1 and the current frame, the symbol in table 2 may specify a symbol of MV offset added to the reference MV corresponding to the reference picture in picture reference list 0, and the symbol of offset of MV corresponding to the reference picture in picture reference list 1 may have an opposite value (opposite symbol of offset). Otherwise, if the difference between the reference POC in picture reference list 1 and the current frame is greater than the difference between the reference POC in picture reference list 0 and the current frame, the symbols in table 2 may specify the symbol of the MV offset added to the reference MV associated with picture reference list 1 and the symbol of the offset of the reference MV associated with picture reference list 0 has the opposite value.
Table 2-example implementation of the sign of MV offset specified by direction index
Direction IDX 00 01 10 11
X axis (horizontal) + N/A N/A
y-axis (vertical) N/A N/A +
In some example implementations, the MVD may be scaled according to the differences in POC in each direction. If the difference of POCs in the two lists is the same, no scaling is required. Otherwise, if the difference of POC in reference list 0 is greater than the difference of POC of reference list 1, the MVD of reference list 1 is scaled. If the POC difference with reference to list 1 is greater than list 0, the MVDs of list 0 may be scaled in the same manner. If the starting MV is unidirectional prediction, the MVD is added to the available or reference MV.
In some example implementations of MVD coding and signaling for bi-directional composite prediction, symmetric MVD coding may be implemented in addition to or instead of separately coding and signaling two MVDs, such that only one MVD needs to be signaled and another MVD may be derived from the signaled MVD. In such implementations, motion information including reference picture indices for both list 0 and list 1 is signaled. However, only the MVDs associated with, for example, reference list 0 are signaled, and the MVDs associated with reference list 1 are not signaled but are derived. Specifically, at the slice level, a flag called "mvd_l1_zero_flag" may be included in the bitstream, which is used to indicate whether reference list 1 is not signaled in the bitstream. If the flag is 1, indicating that reference list 1 is equal to zero (and thus not signaled), then the bi-prediction flag, referred to as "BiDirPredFlag," may be set to 0, meaning that there is no bi-prediction. Otherwise, if mvd_l1_zero_flag is zero, if the nearest reference picture in list 0 and the nearest reference picture in list 1 form a forward and backward pair of reference pictures or a backward and forward pair of reference pictures, biDirPredFlag may be set to 1 and both list 0 reference picture and list 1 reference picture are short-term reference pictures. Otherwise, biDirPredFlag is set to 0. A BiDirPredFlag of 1 may indicate that a symmetric mode flag is additionally signaled in the bitstream. When BiDirPredFlag is 1, the decoder may extract the symmetric mode flag from the bitstream. For example, a symmetric mode flag may be signaled at the CU level (if needed), and may indicate whether a symmetric MVD coding mode is being used for the corresponding CU. When the symmetric mode flag is 1, it indicates that a symmetric MVD coding mode is used, and signals only the reference picture indices of both list 0 and list 1 (referred to as "mvp_l0_flag" and "mvp_l1_flag") and the MVD associated with list 0 (referred to as "MVD 0"), and another motion vector difference "MVD1" is to be derived instead of signaled. For example, MVD1 may be given as-MVD 0. Thus, in the example symmetric MVD mode, only one MVD is signaled. In some other example implementations of MV prediction, a coordination scheme may be used to implement common merge mode MMVD and some other types of MV prediction for both single reference mode and compound reference mode MV prediction. Various syntax elements may be used to signal the manner in which MVs of the current block are predicted.
For example, for a single reference mode, the following MV prediction modes may be signaled:
NEARMV-one of the motion vector predictors (motion vector predictor, MVP) indicated by the DRL (dynamic reference list) index in the list is used directly without any MVD.
NEWMV-uses one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index as a reference and applies an increment to the MVP (e.g., uses MVD).
GLOBALMV-motion vectors based on frame-level global motion parameters are used.
Also, for a composite reference inter prediction mode using two reference frames corresponding to two MVs to be predicted, the following MV prediction modes may be signaled:
near_near—for each of the two MVs to be predicted, one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index is used without MVD.
New_new-for predicting the first motion vector of the two motion vectors, using one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index as a reference MV without MVD; for predicting the second of the two motion vectors, one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index is used as a reference MV, in combination with the further signaled Δmv (MVD).
New_nearmv-for predicting the second of the two motion vectors, one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index is used as a reference MV without MVD; for predicting the first of the two motion vectors, one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index is used as a reference MV, in combination with the further signaled Δmv (MVD).
New_new v-uses one of the Motion Vector Predictors (MVPs) in the list signaled by the DRL index as a reference MV and uses this reference MV in combination with the further signaled Δmv to predict each of the two MVs.
Global_global MV-based on its frame level GLOBAL motion parameters, MV from each reference is used.
Thus, the above term "NEAR" refers to MV prediction using a reference MV as a general merge mode without MVD, and the term "NEW" refers to MV prediction involving using a reference MV as in MMVD mode and offsetting the reference MV with signaled MVD. For composite inter prediction, both the above reference base motion vector and motion vector delta may generally be different or independent between the two references, even though the above reference base motion vector and motion vector delta may be correlated and such correlation may be utilized to reduce the amount of information needed to signal the two motion vector delta. In such a case, joint signaling of the two MVDs may be implemented and indicated in the bitstream.
The above Dynamic Reference List (DRL) may be used to hold a set of indexed motion vectors that are dynamically maintained and considered as candidate motion vector predictors.
In some example implementations, a predefined resolution of MVDs may be allowed. For example, a motion vector precision (or accuracy) of 1/8 pixel may be allowed. The MVDs described in the various MV prediction modes may be constructed and signaled in various ways. In some implementations, the one or more motion vector differences described above in reference frame list 0 or list 1 can be signaled using various syntax elements.
For example, a syntax element called "mv_joint" may specify which components of the motion vector difference associated therewith are non-zero. For MVD, syntax elements are signaled for all non-zero component combinations. For example, mv_joint has the following values:
0 may indicate that there is no non-zero MVD along the horizontal or vertical direction;
1 may indicate that there is a non-zero MVD along the horizontal direction only;
2 may indicate that there is a non-zero MVD along the vertical direction only;
3 may indicate that there is a non-zero MVD along both the horizontal and vertical directions.
When the "mv_joint" syntax element of the MVD signals that there is no non-zero MVD component, then additional MVD information may not be signaled. However, if the "mv_join" syntax signals that there are one or two non-zero components, additional syntax elements may be further signaled for each of the non-zero MVD components as described below.
For example, a syntax element called "mv_sign" may be used to additionally specify whether the corresponding motion vector difference component is positive or negative.
For another example, a syntax element called "mv_class" may be used to specify the level of motion vector difference in a predefined set of levels of the corresponding non-zero MVD component. For example, a predefined level of motion vector differences may be used to divide the continuous magnitude space of motion vector differences into non-overlapping ranges, where each range corresponds to a MVD level. Thus, the signaled MVD level indicates the amplitude range of the corresponding MVD component. In the example implementation shown in table 3 below, the higher level corresponds to a motion vector difference with a range of larger magnitudes. In table 3, the symbol (n, m) is used to represent a range of motion vector differences greater than n pixels and less than or equal to m pixels.
TABLE 3 amplitude level of motion vector differences
/>
In some other examples, a syntax element called "mv_bit" may also be used to specify an integer portion of the offset between the non-zero motion vector difference component and the starting amplitude of the MV level amplitude range that is signaled accordingly. The number of bits required in "mv_bit" to signal the entire range of each MVD level may vary with the function of the MV level. For example, mv_class0 and mv_class1 in the implementation of table 3 may only require a single bit to indicate an integer pixel offset of 1 or 2 from a starting MVD of 0; each higher mv_class in the example implementation of table 3 may progressively require one more bit for "mv_bit" than the previous mv_class.
In some other examples, the first 2 fractional bits of the motion vector difference of the corresponding non-zero MVD component may also be specified using a syntax element called "mv_fr", while the third fractional bit (high resolution bit) of the motion vector difference of the corresponding non-zero MVD component may be specified using a syntax element called "mv_hp". Two bits of "mv_fr" essentially provide 1/4 pixel MVD resolution, while the "mv_hp" bits may further provide 1/8 pixel resolution. In some other implementations, more than one "mv_hp" bit may be used to provide a finer MVD pixel resolution than 1/8 pixel. In some example implementations, additional flags may be signaled at one or more of the various levels to indicate whether MVD resolutions of 1/8 pixel or higher are supported. If the MVD resolution is not applied to a particular coding unit, the syntax elements above for the corresponding unsupported MVD resolution may not be signaled.
In some example implementations above, the fractional resolution may be independent of different levels of MVD. In other words, a predefined number of "mv_fr" and "mv_hp" bits for signaling the fractional MVD of the non-zero MVD component may be used to provide similar options for motion vector resolution, regardless of the magnitude of the motion vector difference.
However, in some other example implementations, resolution of motion vector differences in various MVD amplitude levels may be differentiated. In particular, high resolution MVDs of large MVD magnitudes at higher MVD levels may not provide a statistically significant improvement in compression efficiency. Thus, the MVDs may be encoded at a reduced resolution (integer pixel resolution or fractional pixel resolution) for a larger MVD amplitude range corresponding to a higher MVD amplitude level. Likewise, the MVD may be encoded at a reduced resolution (integer pixel resolution or fractional pixel resolution) for generally larger MVD values. Such MVD level dependent or MVD amplitude dependent MVD resolution may generally be referred to as an adaptive MVD resolution, an amplitude dependent adaptive MVD resolution or an amplitude dependent MVD resolution. The term "resolution" may also be referred to as "pixel resolution". The adaptive MVD resolution may be implemented in various ways as described below by way of example implementations to achieve overall better compression efficiency. In particular, since processing the MVD resolution of a large magnitude or high level MVD at a level similar to that of a low magnitude or low level MVD in a non-adaptive manner may not significantly increase the statistical observation of the inter prediction residual coding efficiency of blocks having large magnitude or high level MVDs, the number of signaling bits reduced by targeting a less accurate MVD may be greater than the additional bits required to encode an inter prediction residual due to such a less accurate MVD. In other words, using a higher MVD resolution for a large magnitude or high level MVD may not produce much coding gain compared to using a lower MVD resolution.
In some general example implementations, the pixel resolution or precision of the MVD may or may not increase as the MVD level increases. The pixel resolution of the MVD is reduced by the amount of the MVD corresponding to the roughness (or a larger step size from one MVD level to the next). In some implementations, the correspondence between MVD pixel resolution and MVD levels may be specified, predefined, or preconfigured, and thus may not need to be signaled in the encoded bitstream.
In some example implementations, the MV levels of table 3 may each be associated with a different MVD pixel resolution.
In some example implementations, each MVD may be associated with a single allowable resolution. In some other implementations, one or more MVD levels may be associated with two or more selectable MVD pixel resolutions. Thus, the signal in the bitstream with the current MVD component of such MVD level may be followed by additional signaling for indicating the selection of the selectable pixel resolution for the current MVD component.
In some example implementations, the adaptively allowed MVD pixel resolution may include, but is not limited to, 1/64 pixels, 1/32 pixels, 1/16 pixels, 1/8 pixels, 1/4 pixels, 1/2 pixels, 1 pixel, 2 pixels, 4 pixels (in descending order of resolution). Thus, each of the ascending MVD levels may be associated with one of these resolutions in a non-ascending manner. In some implementations, the MVD levels may be associated with the above two or more resolutions, and the higher resolution may be lower than or equal to the lower resolution of the previous MVD level. For example, if mv_class_3 of table 3 can be associated with alternative 1-pixel and 2-pixel resolutions, then mv_class_4 of table 3 would be associated with the highest resolution of 2 pixels. In some other implementations, the highest allowable resolution of the MV level may be higher than the lowest allowable resolution of the previous (lower) MV level. However, the average of the allowed resolutions at the ascending MV level may only be non-ascending.
In some implementations, when fractional pixel resolution above 1/8 pixel is allowed, "mv_fr" and "mv_hp" signaling can be correspondingly extended to a total of more than 3 fractional bits.
In some example implementations, the fractional pixel resolution may only allow MVD levels that are lower than or equal to the threshold MVD level. For example, the fractional pixel resolution may only allow MVD-CLASS 0 of table 3 and not all other MV levels. Likewise, the fractional pixel resolution may only allow MVD levels that are lower than or equal to any of the other MV levels of table 3. For other MVD levels above the threshold MVD level, only integer pixel resolution of MVDs is allowed. In this way, for signaled MVDs having MVD levels greater than or equal to the threshold MVD level, it may not be necessary to signal fractional resolution signaling such as one or more of the "mv-fr" and/or "mv-hp" bits. For MVD levels with resolutions below 1 pixel, the number of bits in the "mv-bit" signaling may be further reduced. For example, for MV_CLASS_5 in Table 3, the range of MVD pixel offsets is (32, 64), thus requiring 5 bits to signal the entire range with 1 pixel resolution, however, if MV_CLASS_5 is associated with 2-pixel MVD resolution (lower resolution than 1 pixel resolution), 4 bits may be required for "MV-bit" instead of 5 bits, and neither of "MV-fr" and "MV-hp" need to be signaled after "mv_class" is signaled as MV-CLASS_5.
In some example implementations, the fractional pixel resolution may only allow MVDs having integer values below the threshold integer pixel value. For example, fractional pixel resolution may only allow MVDs of less than 5 pixels. Corresponding to this example, the fractional resolution may allow mv_class_0 and mv_class_1 of table 3, but not all other MV levels. For another example, the fractional pixel resolution may only allow MVDs of less than 7 pixels. Corresponding to this example, the fractional resolution may allow mv_class_0 and mv_class_1 of table 3 (having a range below 5 pixels) and not allow mv_class_3 and higher (having a range above 5 pixels). For MVDs belonging to mvclass_2 whose pixel range contains 5 pixels, the fractional pixel resolution of the MVD may be allowed according to the "MV-bit" value or may be allowed according to the "MV-bit" value. Fractional pixel resolution may be allowed if the "m-bit" value is signaled as 1 or 2 (such that the integer portion of the signaled MVD is 5 or 6, calculated as the beginning of the pixel range of mv_class_2 with an offset of 1 or 2 as indicated by "m-bit"). Otherwise, if the "mv-bit" value is signaled as 3 or 4 (such that the integer portion of the signaled MVD is 7 or 8), fractional pixel resolution may not be allowed.
In some other implementations, only a single MVD value may be allowed for MV levels that are equal to or higher than the threshold MV level. For example, such a threshold MV level may be mv_class_2. Thus, mvclass 2 and above may only be allowed to have a single MVD value and no fractional pixel resolution. A single allowable MVD value for these MV levels may be predefined. In some examples, the allowed single value may be the higher end of the corresponding range of these MV levels in table 3. For example, mv_class_2 through mv_class_10 may be higher than or equal to the threshold level of mv_class_2, and the individual allowable MVD values for these levels may be predefined as 8, 16, 32, 64, 128, 256, 512, 1024, and 2048, respectively. In some other examples, the allowed single value may be the middle value of the corresponding range of these MV levels in table 3. For example, mv_class_2 through mv_class_10 may be above a level threshold, and the individual allowable MVD values for these levels may be predefined as 3, 6, 12, 24, 48, 96, 192, 384, 768, and 1536, respectively. Any other value within the range may also be defined as a single allowable resolution for the corresponding MVD level.
In the above implementation, when the signaled "mv_class" is equal to or higher than the predefined MVD level threshold, only "mv_class" signaling is sufficient to determine the MVD value. The magnitude and direction of the MVD will then be determined using "mv_class" and "mv_sign".
Thus, when the MVD is signaled for only one reference frame (from reference frame list 0 or list 1, but not both) or jointly for both reference frames, the precision (or resolution) of the MVD may depend on the associated level of motion vector difference and/or the magnitude of the MVD in table 3.
In some other implementations, the pixel resolution or precision of the MVD may or may not decrease as the MVD amplitude increases. For example, the pixel resolution may depend on the integer portion of the MVD amplitude. In some implementations, only fractional pixel resolution may be allowed for MVD magnitudes less than or equal to the amplitude threshold. For a decoder, the integer part of the MVD amplitude may first be extracted from the bitstream. The pixel resolution may then be determined, and then a decision may be made as to whether any fractional MVDs are present in the bitstream and need to be parsed (e.g., if fractional pixel resolution is not allowed for a particular extracted MVD integer magnitude, the fractional MVD bits may not be included in the bitstream that needs to be extracted). The above example implementation related to adaptive MVD pixel resolution related to MVD level is applied to the adaptive MVD pixel resolution related to MVD amplitude. For certain examples, MVD levels above or including an amplitude threshold may be allowed to have only one predefined value.
The various example implementations above apply to single reference modes. These implementations are also applied to example new_new, near_new and/or new_new modes in composite prediction under MMVD. These implementations are generally applied to the adaptive resolution of any MVD.
In some example implementations, the adaptive MVD resolution is described further below. For NEW_NEARMV and NEAR_NEWMV modes, the accuracy of the MVD depends on the associated level and magnitude of the MVD.
In some examples, the fractional MVD is only allowed if the MVD amplitude is equal to or less than one pixel.
In some examples, only one MVD value is allowed when the value of the associated MV level is equal to or greater than mv_class_1, and the MVD value in each MV level is derived as 4, 8, 16, 32, 64 for MV level 1 (mv_class_1), 2 (mv_class_2), 3 (mv_class_3), 4 (mv_class_4), or 5 (mv_class_5).
Table 4 shows the MVD values allowed in each MV level.
Table 4-adaptive MVD in each MV amplitude level
MV grade Amplitude of MVD
MV_CLASS_0 (0,1],{2}
MV_CLASS_1 {4}
MV_CLASS_2 {8}
MV_CLASS_3 {16}
MV_CLASS_4 {32}
MV_CLASS_5 {64}
MV_CLASS_6 {128}
MV_CLASS_7 {256}
MV_CLASS_8 {512}
MV_CLASS_9 {1024}
MV_CLASS_10 {2048}
In some examples, where the current block is encoded as new_NEARMV or near_NEWMV mode, one context is used to signal mv_join or mv_class. Otherwise, another context is used to signal mv_join or mv_class.
In some example implementations, the improvement in adaptive MVD resolution is described below.
In some examples, a new inter-coding mode (named amymv) is added to the single reference case. When the AMVDMV mode is selected, an Adaptive MVD (AMVD) is indicated to be applied to the signal MVD.
In some examples, a flag (named amvd_flag) is added in the join_newmv mode to indicate whether AMVD is applied to the JOINT MVD coding mode. When adaptive MVD resolution is applied to the joint MVD coding mode, the MVDs of the two reference frames will be signaled jointly and the accuracy of the MVDs is implicitly determined by the MVD magnitudes. Otherwise, the MVDs of two (or more) reference frames will be signaled jointly and conventional MVD coding applied.
In some implementations, template Matching (TM) is described further below.
FIG. 4 is a diagram illustrating example Template Matching (TM) according to some embodiments.
In some examples, neighboring reconstructed (or predicted) samples of the current block, the forward predicted block, and/or the backward predicted block are also referred to as templates of the current block, the forward predicted block, and/or the backward predicted block, respectively. For example, the neighboring reconstructed samples 402 of the current block 404 are templates of the current block, the neighboring reconstructed samples 404 of the backward prediction block P0 406 are templates of the backward prediction block P0, and the neighboring reconstructed samples 408 of the forward prediction block P1 410 are templates of the forward prediction block P1. As shown in fig. 4, the templates indicating neighboring reconstructed (or predicted) samples are shown as texture parts, e.g., as top and left samples of a block.
In some examples, the template matching method employs correlations between pixels in the block and pixels in the template. The TM approach is very close to the Block Matching (BM) approach. In the BM method, the corresponding block of pixels in the reference frame is used to find the best match for a given block of pixels in the frame. The pixel values of the block being encoded/decoded are compared to the pixel values of each block in the reference frame and the block with the closest match is selected. Pixels in the current block are to be predicted based on the closest matching block of pixels in the reference frame. In contrast to the BM method, in the TM method, the template pixels on the top and left of the current block are to be predicted, instead of the block itself, and the template pixels are used to find the best match. In some examples, the search process for the best match of the template of the current block (in the reference frame) is performed at both the encoder and decoder sides, so the best match Motion Vector (MV) of the template is not transmitted to the decoder. Once the best matching template for the current block is found, neighboring blocks of the reference template (such as P0 and P1 in fig. 4) are used as predictors for the current block.
In some implementations, when the adaptive MVD resolution method is applied, the accuracy of the MVD depends on the magnitude of the MVD. The accuracy of the MVD decreases as the magnitude of the MVD increases. Thus, when adaptive MVD resolution is applied, the prediction may be less accurate for large MVDs.
In some embodiments, the methods disclosed herein can be used alone or in any order in combination. Further, each of the methods (or embodiments), encoder, and decoder may be implemented by a processing circuit (e.g., one or more processors or one or more integrated circuits). In one example, one or more processors execute a program stored in a non-transitory computer readable medium. The term block may be interpreted as a prediction block, a coding block, or a coding unit (i.e., CU).
In the present disclosure, the direction of the reference frame may be determined by whether the reference frame precedes the current frame in display order or follows the current frame in display order.
In this disclosure, the description of the maximum or highest precision of MVD signaling refers to the finest granularity of MVD precision. For example, 1/16 pixel MVD signaling represents a higher level of precision than that of 1/8 pixel MVD signaling.
In this disclosure, the description of the finest MVD resolution allowed refers to the resolution at which the MVD is being signaled. For example, when adaptive MVD resolution is applied, the MVD may be signaled in 1/4 pixels. However, when template matching is also applied, the actual MVD for motion compensation can be refined to 1/8 pixel or higher precision without additional signaling.
In some implementations, a Motion Vector Predictor (MVP) and a Motion Vector Difference (MVD) are two important parameters for representing a Motion Vector (MV) of a current block. In the inter prediction mode, MVP and MVD are used to represent motion vectors of a current block with respect to a reference block in a previous/subsequent frame.
For example, MVP is typically calculated by using motion vectors of neighboring blocks in the same frame or by using motion vectors of corresponding blocks in a reference frame. The goal of MVP is to predict the motion of the current block based on the motion of neighboring blocks or corresponding blocks in the reference frame.
For example, MVD is the difference between the motion vector of the current block and MVP. The MVD represents a deviation between an actual motion vector of a current block and a predicted motion vector based on a neighboring block or a corresponding block in a reference frame. The MVD is typically encoded along with a motion vector predictor and transmitted to a decoder to enable the decoder to reconstruct the motion vector of the current block.
In some aspects/embodiments, template matching may be used to further refine the MVs of the current block when adaptive MVD resolution is applied. The starting point of the refinement with the template-matched MV is the MV of the current block, which is the sum of the MVP and MVD of the current block. MV refinement by template matching is performed at both the encoder and decoder sides, so the difference between the refined MV and the starting point of MV refinement is not signaled in the bitstream.
Fig. 5 is an exemplary flowchart illustrating a method 500 of coding video according to some embodiments. The method 500 may be performed at a computing system (e.g., the server system 112, the source device 102, or the electronic device 120) having control circuitry and memory storing instructions for execution by the control circuitry. In some implementations, the method 500 may be performed by executing instructions stored in a memory of a computing system (e.g., the memory 314). The method 500 may be performed by an encoder (e.g., encoder 106) and/or a decoder (e.g., decoder 122).
Referring to fig. 5, in one aspect, a video decoder (e.g., decoder 122 in fig. 2B) and/or a video encoder (e.g., encoder 106 in fig. 2B) determines whether to signal an adaptive Motion Vector Difference (MVD) resolution mode, which is an inter prediction mode with an adaptive Motion Vector Difference (MVD) pixel resolution, based on an inter prediction syntax element from a video stream (510).
The video decoder/encoder receives a Motion Vector Difference (MVD) from a video block of the video stream (520).
The video decoder/encoder searches for a first template of a predicted video block of the video block in response to determining to signal the adaptive MVD resolution mode, wherein the first template is a neighboring reconstructed/predicted sample of the predicted video block and the predicted video block is a reconstructed/predicted forward or backward video block of the video block (530).
The video decoder/encoder locates a first template of a predicted video block that is a best match to a second template of the video block, the second template being adjacent reconstructed/predicted samples of the video block corresponding to the temporally collocated template of the first template (540).
The video decoder/encoder refines a Motion Vector (MV) of the video block based at least on the second template, the located first template, and the MVD (550).
The video decoder/encoder reconstructs/processes video blocks based at least on the refined MVs (560).
In one embodiment and/or any combination of embodiments disclosed herein, when applying the adaptive MVD resolution, the template-matched search region size may depend on the magnitude of the MVD (or the associated MV magnitude level) of the current block. For example, searching for a first template (530) of a predicted video block of a video block includes determining a search region size based on a magnitude of the MVD and searching based on the search region size.
In one embodiment and/or any combination of embodiments disclosed herein, the search region size increases monotonically or remains constant with increasing magnitude of the MVD for template matching. For example, determining the search area size based on the magnitude of the MVD includes increasing the search area as the magnitude of the MVD increases or leaving the search area unchanged as the magnitude of the MVD increases.
In one embodiment and/or any combination of embodiments disclosed herein, the search area size is the same for all MVDs within one MV level. For example, determining the search region size based on the magnitude of the MVD includes determining the same search region when the MVD is in the same MV level.
In one embodiment, when the MV level of an MVD is equal to or greater than a threshold (such as mv_class_1), the search area size is the same for all MVDs in one MV level. For example, determining the search area size based on the magnitude of the MVD includes determining the same search area when the magnitude of the MVD is equal to or greater than a threshold value.
In one embodiment and/or any combination of embodiments disclosed herein, the accuracy/granularity of MV refinement within a given search region of template matching may depend on the magnitude of the MVD or the associated MV level. The precision may include, but is not limited to, 1/64 pixel precision, 1/32 pixel precision, 1/16 pixel precision, 1/8 pixel precision, 1 / 4 Pixel accuracy, 1 / 2 Pixel precision, 1 pixel precision, 2 pixel precision, 3 pixel precision, 4 pixel precision, and n pixel precision, where n is an integer.
In one embodiment and/or any combination of embodiments disclosed herein, only fractional-precision MV refinement by template matching is allowed when the magnitude of the MVD is equal to or less than one threshold or the associated MV level is equal to or less than another threshold. For example, determining the refinement granularity of MVs based on the magnitude of MVDs includes achieving fractional-precision MV refinement only when the magnitude of MVDs is equal to or less than a threshold. In one example, only fractional precision MV refinement by template matching is allowed when the magnitude of the MVD is equal to or less than 1 pixel sample. In one example, only fractional-precision MV refinement by template matching is allowed when the associated MV level is equal to or less than mv_class_0.
In one embodiment and/or any combination of embodiments disclosed herein, the precision/granularity of MV refinement with template matching may become monotonically coarser as the magnitude of the MVD increases. For example, refining the MVs of the video block (550) includes determining a refinement granularity of the MVs based on the magnitude of the MVDs.
In one embodiment and/or any combination of embodiments disclosed herein, the finest MVD resolution allowed when applying the adaptive MVD resolution depends on whether template matching is applied. For example, when an adaptive MVD resolution mode is signaled, the finest MVD resolution allowed depends on whether a template matching mode is signaled. In one example, the finest MVD resolution allowed when template matching is applied is lower than when template matching is not applied. In one example, when an adaptive MVD resolution is applied, if the finest MVD resolution allowed when template matching is not applied is 1/8 pixel, the finest allowed MVD resolution when template matching is applied is 1/1 or 1/2 pixel.
In one embodiment and/or any combination of embodiments disclosed herein, MV refinement for template matching is limited to certain predefined directions, such as horizontal, vertical, or diagonal directions. For example, refining the MVs of the video blocks (550) includes restricting the MVs to one or more predetermined directions during refinement. The predefined search direction may be signaled in a high level syntax such as sequence level, frame level, and/or slice level.
In one embodiment and/or any combination of embodiments disclosed herein, the search direction of MV refinement with template matching may depend on the direction of the MVD. For example, searching for a first template (530) of a predicted video block of a video block includes determining a search direction based on a direction of the MVD and searching based on the search direction. In one example, where the direction of the MVD is in the horizontal or vertical direction, the search direction of MV refinement with template matching is also limited to the horizontal or vertical direction. In another example, the search direction of MV refinement with template matching may be the same as or perpendicular to the direction of MVD.
In one embodiment and/or any combination of embodiments disclosed herein, a high level syntax may be signaled to indicate whether template matching applies to the adaptive MVD resolution. For example, prior to searching, based on a second syntax element from the video stream, it is determined whether a template matching pattern is signaled for one or more video blocks, and searching is performed in response to determining that the template matching pattern is signaled. In one example, the high level syntax may be signaled in a sequence level, a frame level, and/or a slice level. The second syntax element is signaled, for example, at one or more of a sequence level, a frame level, and/or a slice level.
Although fig. 5 shows multiple logic stages in a particular order, the stages that are not order dependent may be reordered and other stages may be combined or split. Some reordering or other groupings not specifically mentioned will be apparent to those of ordinary skill in the art, and thus the ordering and groupings presented herein are not exhaustive. Furthermore, it should be appreciated that the stages may be implemented in hardware, firmware, software, or any combination thereof.
In another aspect, some implementations include a computing system (e.g., server system 112) including control circuitry (e.g., control circuitry 302) and memory (e.g., memory 314) coupled to the control circuitry, the memory storing one or more sets of instructions configured to be executed by the control circuitry, the one or more sets of instructions including instructions for performing any of the methods described herein.
In yet another aspect, some implementations include a non-transitory computer-readable storage medium storing one or more sets of instructions for execution by control circuitry of a computing system, the one or more sets of instructions including instructions for performing any of the methods described herein.
It will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the claims. As used in the description of the embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term "if" may be interpreted as meaning "when the prerequisite is true" or "after the prerequisite is true" or "in response to determining that the prerequisite is true" or "in response to detecting that the prerequisite is true" depending on the context. Similarly, the phrase "if it is determined that the prerequisite is true" or "when it is determined that the prerequisite is true" may be interpreted in context to mean "after it is determined that the prerequisite is true" or "in response to determining that the prerequisite is true" or "after it is detected that the prerequisite is true" or "in response to detecting that the prerequisite is true".
The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to practice the invention.

Claims (20)

1. A method of decoding a video stream, the method performed at a computing system having a memory and control circuitry, the method comprising:
determining whether to signal an adaptive Motion Vector Difference (MVD) resolution mode based on a value of an inter prediction syntax element from the video stream, the adaptive MVD resolution mode being an inter prediction mode having an adaptive Motion Vector Difference (MVD) pixel resolution;
receiving a motion vector difference MVD from a video block of the video stream;
responsive to determining to signal the adaptive MVD resolution mode, searching a first template of a predicted video block of the video block, wherein the first template is a neighboring reconstructed sample of the predicted video block and the predicted video block is a reconstructed forward or backward video block of the video block;
Locating a first template of the predicted video block that is a best match to a second template of the video block that is a neighboring reconstructed sample of the video block that corresponds to a temporally collocated template of the first template;
refining a Motion Vector (MV) of the video block based at least on the second template, the located first template, and the MVD; and
reconstructing the video block based at least on the refined MVs.
2. The method of claim 1, wherein searching for the first template of the predicted video block of the video block comprises determining a search region size based on the magnitude of the MVD and searching based on the search region size.
3. The method of claim 2, wherein determining the search area size based on the magnitude of the MVD comprises increasing a search area as the magnitude of the MVD increases or leaving the search area unchanged as the magnitude of the MVD increases.
4. The method of claim 2, wherein determining the search region size based on the magnitude of the MVD comprises determining the same search region when the MVDs are in the same MV level.
5. The method of claim 2, wherein determining the search region size based on the magnitude of the MVD comprises determining the same search region when the magnitude of the MVD is equal to or greater than a threshold value.
6. The method of claim 1, wherein refining MVs of the video block comprises determining a refinement granularity of the MVs based on a magnitude of the MVDs.
7. The method of claim 6, wherein determining the refinement granularity of the MVs based on the magnitude of the MVDs comprises achieving fractional-precision MV refinement only when the magnitude of the MVDs is equal to or less than a threshold value.
8. The method of claim 1, wherein refining MVs of the video block comprises restricting the MVs to one or more predetermined directions during refinement.
9. The method of claim 1, wherein searching for the first template of the predicted video block of the video block comprises determining a search direction based on the direction of the MVD and searching based on the search direction.
10. The method of claim 1, wherein prior to searching, determining whether to signal a template matching pattern for one or more video blocks based on a second syntax element from the video stream, and searching is performed in response to determining to signal the template matching pattern.
11. The method of claim 10, wherein the second syntax element is signaled at one or more of a sequence level, a frame level, and/or a slice level.
12. The method of claim 10, wherein when signaling the adaptive MVD resolution mode, the finest MVD resolution allowed depends on whether the template matching mode is signaled.
13. A computing system comprising a memory for storing computer instructions and control circuitry in communication with the memory, wherein the control circuitry, when executing the computer instructions, is configured to cause the computing system to perform a method of decoding a video stream, the method comprising:
determining whether to signal an adaptive Motion Vector Difference (MVD) resolution mode based on a value of an inter prediction syntax element from the video stream, the adaptive MVD resolution mode being an inter prediction mode having an adaptive Motion Vector Difference (MVD) pixel resolution;
receiving a Motion Vector Difference (MVD) from a video block of the video stream;
responsive to determining to signal the adaptive MVD resolution mode, searching a first template of a predicted video block of the video block, wherein the first template is a neighboring reconstructed sample of the predicted video block and the predicted video block is a reconstructed forward or backward video block of the video block;
Locating a first template of the predicted video block that is a best match to a second template of the video block that is a neighboring reconstructed sample of the video block that corresponds to a temporally collocated template of the first template;
refining a Motion Vector (MV) of the video block based at least on the second template, the located first template, and the MVD; and
reconstructing the video block based at least on the refined MVs.
14. The computing system of claim 13, wherein searching the first template of predicted video blocks of the video block comprises determining a search region size based on the magnitude of the MVD and searching based on the search region size.
15. The computing system of claim 14, wherein determining the search area size based on the magnitude of the MVD comprises increasing a search area as the magnitude of the MVD increases or leaving the search area unchanged as the magnitude of the MVD increases.
16. The computing system of claim 14, wherein determining the search region size based on the magnitude of the MVD comprises determining the same search region when the MVDs are in the same MV level.
17. The computing system of claim 14, wherein determining the search region size based on the magnitude of the MVD comprises determining the same search region when the magnitude of the MVD is equal to or greater than a threshold value.
18. The computing system of claim 13, wherein refining MVs of the video block comprises determining a refinement granularity of the MVs based on a magnitude of the MVDs.
19. The computing system of claim 13, wherein searching for the first template of the predicted video block of the video block comprises determining a search direction based on the direction of the MVD and searching based on the search direction.
20. A non-transitory computer-readable medium storing computer instructions that, when executed by control circuitry of a computing system, cause the computing system to perform a method of decoding a video stream, the method comprising:
determining whether to signal an adaptive Motion Vector Difference (MVD) resolution mode based on a value of an inter prediction syntax element from the video stream, the adaptive MVD resolution mode being an inter prediction mode having an adaptive Motion Vector Difference (MVD) pixel resolution;
receiving a motion vector difference MVD from a video block of the video stream;
Responsive to determining to signal the adaptive MVD resolution mode, searching a first template of a predicted video block of the video block, wherein the first template is a neighboring reconstructed sample of the predicted video block and the predicted video block is a reconstructed forward or backward video block of the video block;
locating a first template of the predicted video block that is a best match to a second template of the video block that is a neighboring reconstructed sample of the video block that corresponds to a temporally collocated template of the first template;
refining a Motion Vector (MV) of the video block based at least on the second template, the located first template, and the MVD; and
reconstructing the video block based at least on the refined MVs.
CN202380010878.0A 2022-03-16 2023-03-15 System and method for template matching for adaptive MVD resolution Pending CN117099369A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/320,488 2022-03-16
US18/121,438 2023-03-14
US18/121,438 US20230300363A1 (en) 2022-03-16 2023-03-14 Systems and methods for template matching for adaptive mvd resolution
PCT/US2023/015310 WO2023177747A1 (en) 2022-03-16 2023-03-15 Systems and methods for template matching for adaptive mvd resolution

Publications (1)

Publication Number Publication Date
CN117099369A true CN117099369A (en) 2023-11-21

Family

ID=88775828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202380010878.0A Pending CN117099369A (en) 2022-03-16 2023-03-15 System and method for template matching for adaptive MVD resolution

Country Status (1)

Country Link
CN (1) CN117099369A (en)

Similar Documents

Publication Publication Date Title
CN113287307B (en) Method and apparatus for video decoding, computer device and medium
CN110944185B (en) Video decoding method and device, computer equipment and storage medium
CN113557728A (en) Video coding and decoding method and device
US20230300363A1 (en) Systems and methods for template matching for adaptive mvd resolution
CN112514385B (en) Video decoding method and device, computer equipment and computer readable medium
CN113196745A (en) Video coding and decoding method and device
KR20230135641A (en) Joint coding for adaptive motion vector difference resolution.
CN115428445A (en) Method and apparatus for video encoding
CN117063471A (en) Joint signaling method for motion vector difference
US20240195983A1 (en) Mmvd signaling improvement
CN113228631A (en) Video coding and decoding method and device
US20230362402A1 (en) Systems and methods for bilateral matching for adaptive mvd resolution
US20230412816A1 (en) Bilateral matching for compound reference mode
CN116830581A (en) Improved signaling method and apparatus for motion vector differences
CN115486075A (en) Video coding and decoding method and device
CN117099369A (en) System and method for template matching for adaptive MVD resolution
US20240031596A1 (en) Adaptive motion vector for warped motion mode of video coding
US20230388535A1 (en) Systems and methods for combining subblock motion compensation and overlapped block motion compensation
US20240129474A1 (en) Systems and methods for cross-component geometric/wedgelet partition derivation
US20240171734A1 (en) Systems and methods for warp extend and warp delta signaling
US20240129461A1 (en) Systems and methods for cross-component sample offset filter information signaling
CN110636296B (en) Video decoding method, video decoding device, computer equipment and storage medium
CN118044200A (en) System and method for temporal motion vector prediction candidate derivation
KR20240116515A (en) Adaptive motion vectors for warped motion modes in video coding.
CN118044193A (en) System and method for sub-block motion vector coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40098259

Country of ref document: HK