AU2020201694A1

AU2020201694A1 - Method, apparatus and system for encoding and decoding a coding unit tree

Info

Publication number: AU2020201694A1
Application number: AU2020201694A
Authority: AU
Inventors: Christopher James ROSEWARNE
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2021-09-23

Abstract

METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING A CODING TREE UNIT A system and method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising: determining the first plurality of coding units for the primary colour channel (1860) and the second plurality of coding units for the at least one secondary colour channel (1864) according to decoded split flags of the first coding tree and the second coding tree; decoding, (1860) for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and decoding the first plurality of coding units by applying (1870) the corresponding selected kernel and a DCT 2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by applying (1880) either one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel. 24493980_1 4/20 E CD 0 LO -f" co) -o c o0 LO.a C.) E E cr o 0oa u" Cr) CCr) U)i 1cu NC mcu CD =c a coC .a)C L00a % 00 C) a)LR C cyl 12-a c~l Eco)

Description

4/20 E CD 0 LO -f" co)

-o c o0 LO.a C.) E E cr o u" 0oa

Cr) CCr)

U)i

1cu CD =c NC a mcu coC .a)C L00a

%

00 C) a)LR C cyl 12-a c~l Eco)

METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING A CODING TREE UNIT TECHNICAL FIELD

[0001] The present invention relates generally to digital video signal processing and, in particular, to a method, apparatus and system for encoding and decoding a block of video samples. The present invention also relates to a computer program product including a computer readable medium having recorded thereon a computer program for encoding and decoding a block of video samples.

BACKGROUND

[0002] Many applications for video coding currently exist, including applications for transmission and storage of video data. Many video coding standards have also been developed and others are currently in development. Recent developments in video coding standardisation have led to the formation of a group called the "Joint Video Experts Team" (JVET). The Joint Video Experts Team (JVET) includes members of Study Group 16, Question 6 (SG16/Q6) of the Telecommunication Standardisation Sector (ITU-T) of the International Telecommunication Union (ITU), also known as the "Video Coding Experts Group" (VCEG), and members of the International Organisations for Standardisation / International Electrotechnical Commission Joint Technical Committee 1 / Subcommittee 29 / Working Group 11 (ISO/IEC JTC1/SC29/WG11), also known as the "Moving Picture Experts Group" (MPEG).

[0003] The Joint Video Experts Team (JVET) issued a Call for Proposals (CfP), with responses analysed at its 10 th meeting in San Diego, USA. The submitted responses demonstrated video compression capability significantly outperforming that of the current state-of-the-art video compression standard, i.e.: "high efficiency video coding" (HEVC). On the basis of this outperformance it was decided to commence a project to develop a new video compression standard, to be named 'versatile video coding' (VVC). VVC is anticipated to address ongoing demand for ever-higher compression performance, especially as video formats increase in capability (e.g., with higher resolution and higher frame rate) and address increasing market demand for service delivery over WANs, where bandwidth costs are relatively high. Use cases such as immersive video necessitate real-time encoding and decoding of such higher formats, for example cube-map projection (CMP) may use an 8K format even though a final rendered 'viewport' utilises a lower resolution. VVC must be implementable in contemporary silicon

24493980_1 processes and offer an acceptable trade-off between the achieved performance versus the implementation cost. The implementation cost can be considered for example, in terms of one or more of silicon area, CPU processor load, memory utilisation and bandwidth. Higher video formats may be processed by dividing the frame area into sections and processing each section in parallel. A bitstream constructed from multiple sections of the compressed frame that is still suitable for decoding by a "single-core" decoder, i.e., frame-level constraints, including bit-rate, are apportioned to each section according to application needs.

[0004] Video data includes a sequence of frames of image data, each frame including one or more colour channels. Generally, one primary colour channel and two secondary colour channels are needed. The primary colour channel is generally referred to as the 'luma' channel and the secondary colour channel(s) are generally referred to as the 'chroma' channels. Although video data is typically displayed in an RGB (red-green-blue) colour space, this colour space has a high degree of correlation between the three respective components. The video data representation seen by an encoder or a decoder is often using a colour space such as YCbCr. YCbCr concentrates luminance, mapped to 'luma' according to a transfer function, in a Y (primary) channel and chroma in Cb and Cr (secondary) channels. Due to the use of a decorrelated YCbCr signal, the statistics of the luma channel differ markedly from those of the chroma channels. A primary difference is that after quantisation, the chroma channels contain relatively few significant coefficients for a given block compared to the coefficients for a corresponding luma channel block. Moreover, the Cb and Cr channels may be sampled spatially at a lower rate (subsampled) compared to the luma channel, for example half horizontally and half vertically - known as a '4:2:0 chroma format'. The 4:2:0 chroma format is commonly used in 'consumer' applications, such as intemet video streaming, broadcast television, and storage on Blu-Ray T M disks. Subsampling the Cb and Cr channels at half-rate horizontally and not subsampling vertically is known as a '4:2:2 chroma format'. The 4:2:2 chroma format is typically used in professional applications, including capture of footage for cinematic production and the like. The higher sampling rate of the 4:2:2 chroma format makes the resulting video more resilient to editing operations such as colour grading. Prior to distribution to consumers, 4:2:2 chroma format material is often converted to the 4:2:0 chroma format and then encoded for distribution to consumers. In addition to chroma format, video is also characterised by resolution and frame rate. Example resolutions are ultra-high definition (UHD) with a resolution of 3840x2160 or '8K' with a resolution of 7680x4320 and example frame rates are 60 or 120Hz. Luma sample rates may range from approximately 500 mega samples per second to several giga samples per second. For the 4:2:0 chroma format, the

24493980_1 sample rate of each chroma channel is one quarter the luma sample rate and for the 4:2:2 chroma format, the sample rate of each chroma channel is one half the luma sample rate.

[0005] The VVC standard is a 'block based' codec, in which frames are firstly divided into a square array of regions known as 'coding tree units' (CTUs). Where a frame is not integer divisible into CTUs the CTUs along the left and bottom edge may be truncated in size to match the frame size. CTUs generally occupy a relatively large area, such as 128 x128 luma samples. However, CTUs at the right and bottom edge of each frame may be smaller in area. Associated with each CTU is a 'coding tree' which may be a single tree for both the luma channel and the chroma channels (a 'shared tree') and may include 'forks' into separate trees (or 'dual trees') each for the luma channel and the chroma channels. A coding tree defines a decomposition of the area of the CTU into a set of blocks, also referred to as 'coding units' (CUs). The CUs are processed for encoding or decoding in a particular order. Separate coding trees for luma and chroma generally commence at the 64x64 luma sample granularity, above this a shared tree exists. As a consequence of the use of the 4:2:0 chroma format, a separate coding tree structure commencing at 64x64 luma sample granularity includes a collocated chroma coding tree with 32x32 chroma sample area. The designation 'unit' indicates applicability across all colour channels of the coding tree from which the block is derived. A single coding tree results in coding units having a luma coding block and two chroma coding blocks. The luma branch of a separate coding tree results in a coding units, each having a luma coding block, and the chroma branch of a separate coding tree results in a coding units, each having a pair of chroma blocks. The above-mentioned CUs are also associated with 'prediction units' (PUs), and 'transform units' (TUs), each of which apply to all colour channels of the coding tree from which the CU is derived. Similarly, coding blocks are associated with prediction blocks (PBs) and transform blocks (TBs), each of which apply to a single colour channel. A single tree with CUs spanning the colour channels of 4:2:0 chroma format video data result in chroma coding blocks having half the width and height of the corresponding luma coding blocks.

[0006] Notwithstanding the above distinction between 'units' and 'blocks', the term 'block' may be used as a general term for areas or regions of a frame for which operations are applied to all colour channels.

[0007] For each CU a 'prediction unit' (or 'PU') of the contents (sample values) of the corresponding area of frame data is generated. Further, a representation of the difference (or 'spatial domain' residual) between the prediction and the contents of the area as seen at input to

24493980_1 the encoder is formed. The difference in each colour channel may be transformed and coded as a sequence of residual coefficients, forming one or more TUs for a given CU. The applied transform may be a Discrete Cosine Transform (DCT) or other transform, applied to each block of residual values. This transform is applied separably, that is the two-dimensional transform is performed in two passes. The block is firstly transformed by applying a one-dimensional transform to each row of samples in the block. Then, the partial result is transformed by applying a one-dimensional transform to each column of the partial result to produce a final block of transform coefficients that substantially decorrelates the residual samples. Transforms of various sizes are supported by the VVC standard, including transforms of rectangular-shaped blocks, with each side dimension being a power of two. Transform coefficients are quantised for entropy encoding into a bitstream. An additional non-separable transform stage may also be applied. Finally, transform application may be bypassed.

[0008] VVC features an intra-frame prediction and inter-frame prediction. Intra-frame prediction involves the use of previously processed samples in a frame being used to generate a prediction of a current block of samples in the frame. Inter-frame prediction involves generating a prediction of a current block of samples in a frame using a block of samples obtained from a previously decoded frame. The block of samples obtained from a previously decoded frame is offset from the spatial location of the current block according to a motion vector, which often has filtering being applied. Intra-frame prediction blocks can be (i) a uniform sample value ("DC intra prediction"), (ii) a plane having an offset and horizontal and vertical gradient ("planar intra prediction"), (iii) a population of the block with neighbouring samples applied in a particular direction ("angular intra prediction") or (iv) the result of a matrix multiplication using neighbouring samples and selected matrix coefficients. Further discrepancy between a predicted block and the corresponding input samples may be corrected to an extent by encoding a 'residual' into the bitstream. The residual is generally transformed from the spatial domain to the frequency domain to form residual coefficients (in a 'primary transform' domain), which may be further transformed by application of a 'secondary transform' (to produce residual coefficients in a 'secondary transform domain'). Residual coefficients are quantised according to a quantisation parameter, resulting in a loss of accuracy of the reconstruction of the samples produced at the decoder but with a reduction in bitrate in the bitstream.

[0009] Implementation costs, for example any of multipler count, memory usage, level of accuracy, and efficiency of communication and the like are also important.

24493980_1

SUMMARY

[00010] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

[00011] One aspect of the present invention provides a method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising: determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree; decoding, for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and decoding the first plurality of coding units by applying the corresponding selected kernel and a DCT-2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by applying either one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

[00012] In another aspect, the at least one secondary colour channel comprises two secondary colour channels, the DCT-2 transform is applied to one of the two secondary colour channels and the bypass is applied to the other of the two secondary colour channels.

[00013] In another aspect, the decoded index is located in the bitstream immediately after a last position of transform blocks of each of the coding units for the primary colour channel.

[00014] Another aspect of the present invention provides a method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units for primary colour channel and the second plurality of coding units for least one secondary colour channel, the method comprising: determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree; decoding indices to select kernels for the first plurality of coding units only, wherein each index selects one of a plurality of kernels for a corresponding one of the first plurality of coding units; and decoding the first plurality of coding units by applying the selected kernel and a DCT-2

24493980_1 transform to residual coefficients of the corresponding coding units and decoding the second plurality of coding units applying one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units.

[00015] Another aspect of the present invention provides a method of decoding a coding tree unit of a bitstream of video data, each coding unit of the coding tree unit being for a primary colour channel and for at least one secondary colour channel, the method comprising: determining the coding unit according to decoded split flags of the primary colour channel and the at least one secondary channel; decoding, for the coding units, an index to select a kernel for the primary colour channel, the decoded index being located immediately after a last position of transform blocks of the primary colour channel in the bitstream; and decoding the coding unit by applying the corresponding selected kernel and a DCT-2 transform to residual coefficients of a transform block for the primary colour channel and applying one of a DCT-2 transform or a bypass operation to residual coefficients of the transform blocks for the at least one secondary colour channel.

[00016] 61n another aspect, the index one of the first plurality of coding units is determined if presence of a last residual coefficient of the transform block of the coding unit is equal to or less than a threshold last position.

[00017] In another aspect, the threshold last position is 7 if the transform block has a size of 4x4 or 8x8 residual coefficients.

[00018] In another aspect, the threshold last position is 15 if the transform block has a size other than 4x4 or 8x8 residual coefficients.

[00019] Another aspect of the present invention provides a method of encoding afirst plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, into a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising: determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel; determining, for each of the first plurality of coding units only, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; encoding the first plurality of coding units into the bitstream by applying a DCT-2 transform followed by the corresponding selected kernel to residual coefficients for each coding

24493980_1 unit; and encoding the second plurality of coding units into the bitstream by applying one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

[00020] In another aspect, the determined index is encoded into the bitstream immediately after a last position of transform blocks of each of the coding units for the primary colour channel.

[00021] In another aspect, the method further comprises determining an intra prediction mode for each of the first plurality of coding units, the intra prediction mode selected from a set of prediction modes, the set being extended by an integer number of modes based on a VTM-8.0 Variable Video Coding software reference model.

In another aspect, the method further comprises determining an intra prediction mode for each of the first plurality of coding units, the intra prediction mode selected from a set of of prediction modes of: const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 4, 4}, for transform blocks sized 4x4, 4x8, 4x16, 4x32, 4x64, 4x128;

constuint8_tg_aucntraModeNumFastUseMPM_2D[6][6]= {5,5,5,5,5,4}, fortransform blockssized 8x4, 8x8, 8x16, 8x32, 8x64, 8x128;

const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 5, 4}, for transform blocks sized 16x4, 16x8, 16x16, 16x32, 16x64, 16x128;

const uint8_t g_auclntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 5, 4}, for transform blocks sized 32x4, 32x8, 32x16, 32x32, 32x64, 32x128;

const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {4, 5, 5, 5, 5, 4}, for transform blocks sized 64x4, 64x8, 64x16, 64x32, 64x64, 64x128; or

const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = 3, 3, 3, 3, 3, 4}, for transform blocks sized 128x4, 128x8, 128x16, 128x32, 128x64, 128x128.

[00022] Another aspect of the present invention provides a non-transitory computer readable medium having a computer program stored thereon to implement a method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising: determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one

24493980_1 secondary colour channel according to decoded split flags of the first coding tree and the second coding tree; decoding, for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and decoding the first plurality of coding units by applying the corresponding selected kernel and a DCT-2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by selecting either one of a set consisting of a DCT-2 transform or a bypass operation to apply to residual coefficients of the coding units for the at least one secondary colour channel.

[00023] Another aspect of the present invention provides a system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising: determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree; decoding, for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and decoding the first plurality of coding units by applying the corresponding selected kernel and a DCT-2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by applying either one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

[00024] Another aspect of the present invention provides a video decoder, configured to: receive a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, determine the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree; decode, for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and decode the first plurality of coding units by applying the corresponding selected kernel and a DCT-2

24493980_1 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by applying either one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

[00025] Other aspects are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

[00026] At least one embodiment of the present invention will now be described with reference to the following drawings and appendices, in which:

[00027] Fig. 1 is a schematic block diagram showing a video encoding and decoding system;

[00028] Figs. 2A and 2B form a schematic block diagram of a general purpose computer system upon which one or both of the video encoding and decoding system of Fig. 1 may be practiced;

[00029] Fig. 3 is a schematic block diagram showing functional modules of a video encoder;

[00030] Fig. 4 is a schematic block diagram showing functional modules of a video decoder;

[00031] Fig. 5 is a schematic block diagram showing the available divisions of a block into one or more blocks in the tree structure of versatile video coding;

[00032] Fig. 6 is a schematic illustration of a dataflow to achieve permitted divisions of a block into one or more blocks in a tree structure of versatile video coding;

[00033] Figs. 7A and 7B show an example division of a coding tree unit (CTU) into a number of coding units (CUs);

[00034] Figs. 8A, 8B, 8C, and 8D show forward and inverse non-separable secondary transforms performed according to different sizes of transform blocks;

[00035] Fig. 9 shows a set of regions of application of the secondary transform for transform blocks of various sizes;

[00036] Fig. 10 shows a syntax structure for a bitstream with multiple slices, each of which includes multiple coding units;

24493980_1

[00037] Fig. 11 shows a syntax structure for a bitstream with a shared tree for luma and chroma coding units of a coding tree unit;

[00038] Fig. 12 shows a syntax structure for a bitstream with a separate tree for luma and chroma coding units of a coding tree unit;

[00039] Fig. 13 shows a method for encoding a frame into a bitstream including one or more slices as sequences of coding units using a shared tree for luma and chroma;

[00040] Fig. 14 shows a method for encoding a coding unit into a bitstream using a shared tree for luma and chroma;

[00041] Fig. 15 shows a method for decoding a frame from a bitstream as sequences of coding units arranged into slices;

[00042] Fig. 16 shows a method for decoding a coding unit from a bitstream using a shared coding tree for luma and chroma;

[00043] Fig. 17 shows a method for encoding a frame into a bitstream including one or more slices as sequences of coding units using a separate tree for luma and chroma; and

[00044] Fig. 18 shows a method for decoding a frame from a bitstream including one or more slices as sequences of coding units using a separate tree for luma and chroma.

DETAILED DESCRIPTION INCLUDING BEST MODE

[00045] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

[00046] The syntax of the bitstream format of a video compression standard is defined as a hierarchy of 'syntax structures'. Each syntax structure defines a set of syntax elements, some of which may be conditional on others. Compression efficiency is improved when the syntax only allows combinations of syntax elements that correspond to useful combinations of tools. Additionally, complexity is also reduced by prohibiting combinations of syntax elements that,

24493980_1 although possible for implementation, are deemed to offer insufficient compression advantage for the resulting implementation cost.

[00047] Fig. 1 is a schematic block diagram showing functional modules of a video encoding and decoding system 100. The system 100 signals primary and secondary transform parameters such that compression efficiency gain is achieved.

[00048] The system 100 includes a source device 110 and a destination device 130. A communication channel 120 is used to communicate encoded video information from the source device 110 to the destination device 130. In some arrangements, the source device 110 and destination device 130 may either or both comprise respective mobile telephone handsets or "smartphones", in which case the communication channel 120 is a wireless channel. In other arrangements, the source device 110 and destination device 130 may comprise video conferencing equipment, in which case the communication channel 120 is typically a wired channel, such as an internet connection. Moreover, the source device 110 and the destination device 130 may comprise any of a wide range of devices, including devices supporting over the-air television broadcasts, cable television applications, internet video applications (including streaming) and applications where encoded video data is captured on some computer-readable storage medium, such as hard disk drives in a file server.

[00049] As shown in Fig. 1, the source device 110 includes a video source 112, a video encoder 114 and a transmitter 116. The video source 112 typically comprises a source of captured video frame data (shown as 113), such as an image capture sensor, a previously captured video sequence stored on a non-transitory recording medium, or a video feed from a remote image capture sensor. The video source 112 may also be an output of a computer graphics card, for example displaying the video output of an operating system and various applications executing upon a computing device, for example a tablet computer. Examples of source devices 110 that may include an image capture sensor as the video source 112 include smart-phones, video camcorders, professional video cameras, and network video cameras.

[00050] The video encoder 114 converts (or 'encodes') the captured frame data (indicated by an arrow 113) from the video source 112 into a bitstream (indicated by an arrow 115) as described further with reference to Fig. 3. The bitstream 115 is transmitted by the transmitter 116 over the communication channel 120 as encoded video data (or "encoded video information"). It is also possible for the bitstream 115 to be stored in a non-transitory storage device 122, such as a "Flash" memory or a hard disk drive, until later being transmitted over the

24493980_1 communication channel 120, or in-lieu of transmission over the communication channel 120. For example, encoded video data may be served upon demand to customers over a wide area network (WAN) for a video streaming application.

[00051] The destination device 130 includes a receiver 132, a video decoder 134 and a display device 136. The receiver 132 receives encoded video data from the communication channel 120 and passes received video data to the video decoder 134 as a bitstream (indicated by an arrow 133). The video decoder 134 then outputs decoded frame data (indicated by an arrow 135) to the display device 136 for reproduction. The decoded frame data 135 has the same chroma format as the frame data 113. Examples of the display device 136 include a cathode ray tube, a liquid crystal display, such as in smart-phones, tablet computers, computer monitors or in stand-alone television sets. It is also possible for the functionality of each of the source device 110 and the destination device 130 to be embodied in a single device, examples of which include mobile telephone handsets and tablet computers. Decoded frame data may be further transformed before presentation to a user. For example, a 'viewport' having a particular latitude and longitude may be rendered from decoded frame data using a projection format to represent a 3600 view of a scene.

[00052] Notwithstanding the example devices mentioned above, each of the source device 110 and destination device 130 may be configured within a general purpose computing system, typically through a combination of hardware and software components. Fig. 2A illustrates such a computer system 200, which includes: a computer module 201; input devices such as a keyboard 202, a mouse pointer device 203, a scanner 226, a camera 227, which may be configured as the video source 112, and a microphone 280; and output devices including a printer 215, a display device 214, which may be configured as the display device 136, and loudspeakers 217. An external Modulator-Demodulator (Modem) transceiver device 216 may be used by the computer module 201 for communicating to and from a communications network 220 via a connection 221. The communications network 220, which may represent the communication channel 120, may be a (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 221 is a telephone line, the modem 216 may be a traditional "dial-up" modem. Alternatively, where the connection 221 is a high capacity (e.g., cable or optical) connection, the modem 216 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 220. The transceiver device 216 may provide the functionality of the transmitter 116

24493980_1 and the receiver 132 and the communication channel 120 may be embodied in the connection 221.

[00053] The computer module 201 typically includes at least one processor unit 205, and a memory unit 206. For example, the memory unit 206 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 201 also includes a number of input/output (1/0) interfaces including: an audio-video interface 207 that couples to the video display 214, loudspeakers 217 and microphone 280; an I/O interface 213 that couples to the keyboard 202, mouse 203, scanner 226, camera 227 and optionally a joystick or other human interface device (not illustrated); and an interface 208 for the external modem 216 and printer 215. The signal from the audio-video interface 207 to the computer monitor 214 is generally the output of a computer graphics card. In some implementations, the modem 216 may be incorporated within the computer module 201, for example within the interface 208. The computer module 201 also has a local network interface 211, which permits coupling of the computer system 200 via a connection 223 to a local-area communications network 222, known as a Local Area Network (LAN). As illustrated in Fig. 2A, the local communications network 222 may also couple to the wide network 220 via a connection 224, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 211 may comprise an EthemetTM circuit card, a BluetoothTM wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 211. The local network interface 211 may also provide the functionality of the transmitter 116 and the receiver 132 and communication channel 120 may also be embodied in the local communications network 222.

[00054] The I/O interfaces 208 and 213 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 209 are provided and typically include a hard disk drive (HDD) 210. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g. CD-ROM, DVD, Blu ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the computer system 200. Typically, any of the HDD 210, optical drive 212, networks 220 and 222 may also be configured to operate as the video source 112, or as a destination for decoded video data to be stored for reproduction via the display 214. The source

24493980_1 device 110 and the destination device 130 of the system 100 may be embodied in the computer system 200.

[00055] The components 205 to 213 of the computer module 201 typically communicate via an interconnected bus 204 and in a manner that results in a conventional mode of operation of the computer system 200 known to those in the relevant art. For example, the processor 205 is coupled to the system bus 204 using a connection 218. Likewise, the memory 206 and optical disk drive 212 are coupled to the system bus 204 by connections 219. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun SPARCstations, Apple MacTM or alike computer systems.

[00056] Where appropriate or desired, the video encoder 114 and the video decoder 134, as well as methods described below, may be implemented using the computer system 200. In particular, the video encoder 114, the video decoder 134 and methods to be described, may be implemented as one or more software application programs 233 executable within the computer system 200. In particular, the video encoder 114, the video decoder 134 and the steps of the described methods are effected by instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer system 200. The software instructions 231 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user.

[00057] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 200 from the computer readable medium, and then executed by the computer system 200. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 200 preferably effects an advantageous apparatus for implementing the video encoder 114, the video decoder 134 and the described methods.

[00058] The software 233 is typically stored in the HDD 210 or the memory 206. The software is loaded into the computer system 200 from a computer readable medium, and executed by the computer system 200. Thus, for example, the software 233 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 225 that is read by the optical disk drive 212.

24493980_1

[00059] In some instances, the application programs 233 may be supplied to the user encoded on one or more CD-ROMs 225 and read via the corresponding drive 212, or alternatively may be read by the user from the networks 220 or 222. Still further, the software can also be loaded into the computer system 200 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 200 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-ray DiscTM, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 201. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of the software, application programs, instructions and/or video data or encoded video data to the computer module 401 include radio or infra-red transmission channels, as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

[00060] The second part of the application program 233 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 214. Through manipulation of typically the keyboard 202 and the mouse 203, a user of the computer system 200 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 217 and user voice commands input via the microphone 280.

[00061] Fig. 2B is a detailed schematic block diagram of the processor 205 and a "memory" 234. The memory 234 represents a logical aggregation of all the memory modules (including the HDD 209 and semiconductor memory 206) that can be accessed by the computer module 201 in Fig. 2A.

[00062] When the computer module 201 is initially powered up, a power-on self-test (POST) program 250 executes. The POST program 250 is typically stored in a ROM 249 of the semiconductor memory 206 of Fig. 2A. A hardware device such as the ROM 249 storing software is sometimes referred to as firmware. The POST program 250 examines hardware

24493980_1 within the computer module 201 to ensure proper functioning and typically checks the processor 205, the memory 234 (209, 206), and a basic input-output systems software (BIOS) module 251, also typically stored in the ROM 249, for correct operation. Once the POST program 250 has run successfully, the BIOS 251 activates the hard disk drive 210 of Fig. 2A. Activation of the hard disk drive 210 causes a bootstrap loader program 252 that is resident on the hard disk drive 210 to execute via the processor 205. This loads an operating system 253 into the RAM memory 206, upon which the operating system 253 commences operation. The operating system 253 is a system level application, executable by the processor 205, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

[00063] The operating system 253 manages the memory 234 (209, 206) to ensure that each process or application running on the computer module 201 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the computer system 200 of Fig. 2A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 234 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 200 and how such is used.

[00064] As shown in Fig. 2B, the processor 205 includes a number of functional modules including a control unit 239, an arithmetic logic unit (ALU) 240, and a local or internal memory 248, sometimes called a cache memory. The cache memory 248 typically includes a number of storage registers 244-246 in a register section. One or more internal busses 241 functionally interconnect these functional modules. The processor 205 typically also has one or more interfaces 242 for communicating with external devices via the system bus 204, using a connection 218. The memory 234 is coupled to the bus 204 using a connection 219.

[00065] The application program 233 includes a sequence of instructions 231 that may include conditional branch and loop instructions. The program 233 may also include data 232 which is used in execution of the program 233. The instructions 231 and the data 232 are stored in memory locations 228, 229, 230 and 235, 236, 237, respectively. Depending upon the relative size of the instructions 231 and the memory locations 228-230, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 230. Alternately, an instruction may be segmented into a number of parts each of

24493980_1 which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 228 and 229.

[00066] In general, the processor 205 is given a set of instructions which are executed therein. The processor 205 waits for a subsequent input, to which the processor 205 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 202, 203, data received from an external source across one of the networks 220, 202, data retrieved from one of the storage devices 206, 209 or data retrieved from a storage medium 225 inserted into the corresponding reader 212, all depicted in Fig. 2A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 234.

[00067] The video encoder 114, the video decoder 134 and the described methods may use input variables 254, which are stored in the memory 234 in corresponding memory locations 255, 256, 257. The video encoder 114, the video decoder 134 and the described methods produce output variables 261, which are stored in the memory 234 in corresponding memory locations 262, 263, 264. Intermediate variables 258 may be stored in memory locations 259, 260, 266 and 267.

[00068] Referring to the processor 205 of Fig. 2B, the registers 244, 245, 246, the arithmetic logic unit (ALU) 240, and the control unit 239 work together to perform sequences of micro operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 233. Each fetch, decode, and execute cycle comprises:

a fetch operation, which fetches or reads an instruction 231 from a memory location 228, 229, 230;

a decode operation in which the control unit 239 determines which instruction has been fetched; and

an execute operation in which the control unit 239 and/or the ALU 240 execute the instruction.

24493980_1

[00069] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 239 stores or writes a value to a memory location 232.

[00070] Each step or sub-process in the method of Figs. 13 to 18, to be described, is associated with one or more segments of the program 233 and is typically performed by the register section 244, 245, 247, the ALU 240, and the control unit 239 in the processor 205 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 233.

[00071] Fig. 3 is a schematic block diagram showing functional modules of the video encoder 114. Fig. 4 is a schematic block diagram showing functional modules of the video decoder 134. Generally, data passes between functional modules within the video encoder 114 and the video decoder 134 in groups of samples or coefficients, such as divisions of blocks into sub-blocks of a fixed size, or as arrays. The video encoder 114 and video decoder 134 may be implemented using a general-purpose computer system 200, as shown in Figs. 2A and 2B, where the various functional modules may be implemented by dedicated hardware within the computer system 200, by software executable within the computer system 200 such as one or more software code modules of the software application program 233 resident on the hard disk drive 205 and being controlled in its execution by the processor 205. Alternatively, the video encoder 114 and video decoder 134 may be implemented by a combination of dedicated hardware and software executable within the computer system 200. The video encoder 114, the video decoder 134 and the described methods may alternatively be implemented in dedicated hardware, such as one or more integrated circuits performing the functions or sub functions of the described methods. Such dedicated hardware may include graphic processing units (GPUs), digital signal processors (DSPs), application-specific standard products (ASSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or one or more microprocessors and associated memories. In particular, the video encoder 114 comprises modules 310-386 and the video decoder 134 comprises modules 420-496 which may each be implemented as one or more software code modules of the software application program 233.

[00072] Although the video encoder 114 of Fig. 3 is an example of a versatile video coding (VVC) video encoding pipeline, other video codecs may also be used to perform the processing stages described herein. The video encoder 114 receives captured frame data 113, such as a series of frames, each frame including one or more colour channels. The frame data 113

24493980_1 includes two-dimensional arrays of luma ('luma channel') and chroma ('chroma channel') samples arranged in a 'chroma format', for example 4:0:0, 4:2:0, 4:2:2, or 4:4:4 chroma format. A block partitioner 310 firstly divides the frame data 113 into CTUs, generally square in shape and configured such that a particular size for the CTUs is used. The size of the CTUs may be 64x64, 128x128, or 256x256 luma samples for example.

[00073] The block partitioner 310 further divides each CTU into one or more CUs according to either a shared coding tree or to a luma coding tree and a chroma coding tree at a point where a shared coding tree splits into luma and chroma branches. The luma channel may also be referred to as a primary colour channel. Each chroma channel may also be referred to as a secondary colour channel. The CUs have a variety of sizes, and may include both square and non-square aspect ratios. Operation of the block partitioner 310 is further described with reference to Figs. 13, 14 and 17. However, in the VVC standard, CUs/CBs, PUs/PBs, and TUs/TBs always have side lengths that are powers of two. Thus, a current CU, represented as 312, is output from the block partitioner 310, progressing in accordance with an iteration over the one or more blocks of the CTU, in accordance with the shared tree or the luma coding tree and the chroma coding tree of the CTU. Options for partitioning CTUs into CBs are further described below with reference to Figs. 5 and 6.

[00074] The CTUs resulting from the first division of the frame data 113 may be scanned in raster scan order and may be grouped into one or more 'slices'. A slice may be an 'intra' (or 'I') slice. An intra slice (I slice) contains no inter-predicted CUs, e.g. intra prediction is used only. Alternatively, a slice may be uni- or bi-predicted ('P' or 'B' slice, respectively), indicating additional availability of one or two reference blocks for predicting a CU, known as 'uni-prediction' and 'bi-prediction', respectively.

[00075] In an I slice, the coding tree of each CTU may diverge below the 64x64 level into two separate coding trees, one for luma and another for chroma. Use of separate trees allows different block structure to exist between luma and chroma within a luma 64x64 area of a CTU. For example, a large chroma CB may be collocated with numerous smaller luma CBs and vice versa. In a P or B slice, a single coding tree of a CTU defines a block structure common to luma and chroma. The resulting blocks of the single tree may be intra predicted or inter predicted.

[00076] For each CTU, the video encoder 114 operates in two stages. In the first stage (referred to as a 'search' stage), the block partitioner 310 tests various potential configurations

24493980_1 of a coding tree. Each potential configuration of a coding tree has associated 'candidate' CUs. The first stage involves testing various candidate CUs to select CUs providing relatively high compression efficiency with relatively low distortion. The testing generally involves a Lagrangian optimisation whereby a candidate CU is evaluated based on a weighted combination of the rate (coding cost) and the distortion (error with respect to the input frame data 113). The 'best' candidate CUs (the CUs with the lowest evaluated rate/distortion) are selected for subsequent encoding into the bitstream 115. Included in evaluation of candidate CUs is an option to use a CU for a given area or to further split the area according to various splitting options and code each of the smaller resulting areas with further CUs, or split the areas even further. As a consequence, both the coding tree and the CUs themselves are selected in the search stage.

[00077] The video encoder 114 produces a prediction block (PU), indicated by an arrow 320, for each CU, for example the CU 312. The PU 320 is a prediction of the contents of the associated CU 312. A subtracter module 322 produces a difference, indicated as 324 (or 'residual', referring to the difference being in the spatial domain), between the PU 320 and the CU 312. The difference 324 is a block-sized array of differences between corresponding samples in the PU 320 and the CU 312 and is produced for each colour channel of the CU 312. When primary and (optionally) secondary transforms are to be performed, the difference 324 is transformed in modules 326 and 330 to be passed to a quantiser module 334 for quantisation via a multiplexer 333. When the transform is to be skipped, the difference 324 is passed directly to the quantiser module 334 for quantisation via the multiplexer 333. The selection between transform and transform skip is independently made for each TB associated with the CU 312. The resulting quantised residual coefficients are represented as a TB (for each colour channel of the CU 312), indicated by an arrow 336. The PU 320 and associated TB 336 are typically chosen from one of many possible candidate CUs, for example based on evaluated cost or distortion.

[00078] A candidate CU is a CU resulting from one of the prediction modes available to the video encoder 114 for the associated PB and the resulting residual. When combined with the predicted PB in the video decoder 114, addition of the TB 336 after conversion back to the spatial domain reduces the difference between a decoded CU and the original CU 312, at the expense of additional signalling in a bitstream.

24493980_1

[00079] Each candidate coding block (CU), that is prediction block (PU) in combination with one transform block (TB) per colour channel of the CU, thus has an associated coding cost (or 'rate') and an associated difference (or 'distortion'). The distortion of the CU is typically estimated as a difference in sample values, such as a sum of absolute differences (SAD) or a sum of squared differences (SSD). The estimate resulting from each candidate PU may be determined by a mode selector 386 using the difference 324 to determine a prediction mode 387. The prediction mode 387 indicates the decision to use a particular prediction mode for the current CU, for example intra-frame prediction or inter-frame prediction. For intra predicted CUs belonging to a shared coding tree, independent intra prediction modes are specified for the luma PB vs the chroma PBs. For intra-predicted CUs belonging to luma or chroma branches of a dual coding tree, one intra prediction mode applies to the luma PB or the chroma PBs, respectively. Estimation of the coding costs associated with each candidate prediction mode and corresponding residual coding can be performed at significantly lower cost than entropy coding of the residual. Accordingly, a number of candidate modes can be evaluated to determine an optimum mode in a rate-distortion sense even in a real-time video encoder.

[00080] Lagrangian or similar optimisation processing can be employed to both select an optimal partitioning of a CTU into CBs (by the block partitioner 310) as well as the selection of a best prediction mode from a plurality of possible prediction modes. Through application of a Lagrangian optimisation process of the candidate modes in the mode selector module 386, the intra prediction mode 387, a secondary transform index 388, and a primary transform type 389, and transform skip flags 390 (one for each TB) with the lowest cost measurement is selected.

[00081] In the second stage of operation of the video encoder 114 (referred to as a 'coding' stage), an iteration over the determined coding tree(s) of each CTU is performed in the video encoder 114. For a CTU using separate trees, for each 64x64 luma region of the CTU, a luma coding tree is firstly encoded followed by a chroma coding tree. Within the luma coding tree only luma CBs are encoded and within the chroma coding tree only chroma CBs are encoded. For a CTU using a shared tree, a single tree describes the CUs, i.e., the luma CBs and the chroma CBs according to the common block structure of the shared tree.

[00082] The entropy encoder 338 supports both variable-length coding of syntax elements and arithmetic coding of syntax elements. Portions of the bitstream such as 'parameter sets', for example a sequence parameter set (SPS), a picture parameter set (PPS), and a picture header

24493980_1

(PH) use a combination of fixed-length codewords and variable-length codewords. Slices have a slice header that uses variable length coding followed by slice data, which uses arithmetic coding. The picture header defines parameters specific to the current slice, such as picture-level quantisation parameter offsets. The slice data includes the syntax elements of each CTU in the slice. Use of variable length coding and arithmetic coding requires sequential parsing within each portion of the bitstream. The portions may be delineated with a start code to form 'network abstraction layer units' or 'NAL units'. Arithmetic coding is supported using a context-adaptive binary arithmetic coding process. Arithmetically coded syntax elements consist of sequences of one or more 'bins'. Bins, like bits, have a value of '0' or '1'.However, bins are not encoded in the bitstream 115 as discrete bits. Bins have an associated predicted (or 'likely' or 'most probable') value and an associated probability, known as a 'context'. When the actual bin to be coded matches the predicted value, a 'most probable symbol' (MPS) is coded. Coding a most probable symbol is relatively inexpensive in terms of consumed bits in the bitstream 115, including costs that amount to less than one discrete bit. When the actual bin to be coded mismatches the likely value, a 'least probable symbol' (LPS) is coded. Coding a least probable symbol has a relatively high cost in terms of consumed bits. The bin coding techniques enable efficient coding of bins where the probability of a '0' versus a '1' is skewed. For a syntax element with two possible values (that is, a 'flag'), a single bin is adequate. For syntax elements with many possible values, a sequence of bins is needed.

[00083] The presence of later bins in the sequence may be determined based on the value of earlier bins in the sequence. Additionally, each bin may be associated with more than one context. The selection of a particular context can be dependent on earlier bins in the syntax element, the bin values of neighbouring syntax elements (i.e. those from neighbouring blocks) and the like. Each time a context-coded bin is encoded, the context that was selected for that bin (if any) is updated in a manner reflective of the new bin value. As such, the binary arithmetic coding scheme is said to be adaptive.

[00084] Also supported by the video encoder 114 are bins that lack a context ('bypass bins'). Bypass bins are coded assuming an equiprobable distribution between a '0' and a '1'. Thus, each bin has a coding cost of one bit in the bitstream 115. The absence of a context saves memory and reduces complexity, and thus bypass bins are used where the distribution of values for the particular bin is not skewed. One example of an entropy coder employing context and adaption is known in the art as CABAC (context adaptive binary arithmetic coder) and many variants of this coder have been employed in video coding.

24493980_1

[00085] The entropy encoder 338 encodes the primary transform type 389, one transform skip flag (i.e., 390) for each TB of the current CU and, if applicable to the current CU, the secondary transform index 388, using a combination of context-coded and bypass-coded bins, and the intra prediction mode 387. The secondary transform index 388 is signalled when the residual associated with the transform block includes significant residual coefficients only in those coefficient positions subject to transforming into primary coefficients by application of a secondary transform.

[00086] A multiplexer module 384 outputs the PB 320 from an intra-frame prediction module 364 according to the determined best intra prediction mode, selected from the tested prediction mode of each candidate CB. The candidate prediction modes need not include every conceivable prediction mode supported by the video encoder 114. Intra prediction falls into three types. "DC intra prediction" involves populating a PB with a single value representing the average of nearby reconstructed samples. "Planar intra prediction" involves populating a PB with samples according to a plane, with a DC offset and a vertical and horizontal gradient being derived from the nearby reconstructed neighbouring samples. The nearby reconstructed samples typically include a row of reconstructed samples above the current PB, extending to the right of the PB to an extent and a column of reconstructed samples to the left of the current PB, extending downwards beyond the PB to an extent. "Angular intra prediction" involves populating a PB with reconstructed neighbouring samples filtered and propagated across the PB in a particular direction (or 'angle'). In VVC 65 angles are supported, with rectangular blocks able to utilise additional angles, not available to square blocks, to produce a total of 87 angles. A fourth type of intra prediction is available to chroma PBs, whereby the PB is generated from collocated luma reconstructed samples according to a 'cross-component linear model' (CCLM) mode. Three different CCLM modes are available, each mode using a different model derived from the neighbouring luma and chroma samples. The derived model is used to generate a block of samples for the chroma PB from the collocated luma samples.

[00087] Where previously reconstructed samples are unavailable, for example at the edge of the frame, a default half-tone value of one half the range of the samples is used. For example, for 10-bit video a value of 512 is used. As no previously samples are available for a CB located at the top-left position of a frame, angular and planar intra-prediction modes produce the same output as the DC prediction mode, i.e. a flat plane of samples having the half-tone value as magnitude.

24493980_1

[00088] For inter-frame prediction a prediction block 382 is produced using samples from one or two frames preceding the current frame in the coding order frames in the bitstream by a motion compensation module 380 and output as the PB 320 by the multiplexer module 384. Moreover, for inter-frame prediction, a single coding tree is typically used for both the luma channel and the chroma channels. The order of coding frames in the bitstream may differ from the order of the frames when captured or displayed. When one frame is used for prediction, the block is said to be 'uni-predicted' and has one associated motion vector. When two frames are used for prediction, the block is said to be 'bi-predicted' and has two associated motion vectors. For a P slice, each CU may be intra predicted or uni-predicted. For a B slice, each CU may be intra predicted, uni-predicted, or bi-predicted. Frames are typically coded using a 'group of pictures' structure, enabling a temporal hierarchy of frames. Frames may be divided into multiple slices, each of which encodes a portion of the frame. A temporal hierarchy of frames allows a frame to reference a preceding and a subsequent picture in the order of displaying the frames. The images are coded in the order necessary to ensure the dependencies for decoding each frame are met.

[00089] The samples are selected according to a motion vector 378 and reference picture index. The motion vector 378 and reference picture index applies to all colour channels and thus inter prediction is described primarily in terms of operation upon PUs rather than PBs, i.e. the decomposition of each CTU into one or more inter-predicted blocks is described with a single coding tree. Inter prediction methods may vary in the number of motion parameters and their precision. Motion parameters typically comprise a reference frame index, indicating which reference frame(s) from lists of reference frames are to be used plus a spatial translation for each of the reference frames, but may include more frames, special frames, or complex affine parameters such as scaling and rotation. In addition, a pre-determined motion refinement process may be applied to generate dense motion estimates based on referenced sample blocks.

[00090] Having determined and selected the PU 320, and subtracted the PU 320 from the original sample block at the subtractor 322, a residual with lowest coding cost, represented as 324, is obtained and subjected to lossy compression. The lossy compression process comprises the steps of transformation, quantisation and entropy coding. The forward primary transform module 326 applies a forward transform to the difference 324, converting the difference 324 from the spatial domain to the frequency domain, and producing primary transform coefficients represented by an arrow 328 according to the primary transform type 389. The largest primary transform size in one dimension is either a 32-point DCT-2 or a

24493980_1

64-point DCT-2 transform. If the CB being encoded is larger than the largest supported primary transform size expressed as a block size, i.e. 64x64 or 32x32, the primary transform 326 is applied in a tiled manner to transform all samples of the difference 324. Where each application of the transform operates on a TB of the difference 324 larger than 32x32, e.g. 64x64, all resulting primary transform coefficients 328 outside of the upper-left 32x32 area of the TB are set to zero, i.e. discarded. For TBs of sizes up to 32x32 the primary transform type 389 may indicate application of a combination of DST-7 and DCT-8 transforms horizontally and vertically. The remaining primary transform coefficients 328 are passed to a forward secondary transform module 330.

[00091] The secondary transform module 330 produces secondary transform coefficients 332 in accordance with the secondary transform index 388. The secondary transform coefficients 332 are quantised by the module 334 according to a quantisation parameter associated with the CB to produce residual coefficients 336. When the transform skip flag 390 indicates transform skip is enabled for a TB, the difference 324 is passed to the quantiser 334 via the multiplexer 333.

[00092] The forward primary transform of the module 326 is typically separable, transforming a set of rows and then a set of columns of each TB. The forward primary transform module 326 uses either a type-II discrete cosine transform (DCT-2) in the horizontal and vertical directions, or, for luma TBS, combinations of a type-VII discrete sine transform (DST-7) and a type-VIII discrete cosine transform (DCT-8) in either horizontal or vertical directions, according to the primary transform type 389. Use of combinations of a DST-7 and DCT-8 is referred to as 'multi transform selection set' (MTS) in the VVC standard. When DCT-2 is used the largest TB size is either 32x32 or 64x64, configurable in the video encoder 114 and signalled in the bitstream 115. Regardless of the configured maximum DCT-2 transform size, only coefficients in the upper-left 32x32 region of a TB are encoded into the bitstream 115. Any significant coefficients outside of the upper-left 32x32 region of the TB are discarded (or 'zeroed out') and are not encoded in the bitstream 115. MTS is only available for CUs of size up to 32x32 and only coefficients in the upper-left 16x16 region of the associated luma TB are coded. Individual TBs of the CU are either transformed or bypassed according to corresponding transform skip flags 390.

[00093] The forward secondary transform of the module 330 is generally a non-separable transform, which is only applied for the residual of intra-predicted CUs and may nonetheless also be bypassed. The forward secondary transform operates either on 16 samples (arranged as

24493980_1 the upper-left 4x4 sub-block of the primary transform coefficients 328) or 48 samples (arranged as three 4x4 sub-blocks in the upper-left 8x8 coefficients of the primary transform coefficients 328) to produce a set of secondary transform coefficients. The set of secondary transform coefficients may be fewer in number than the set of primary transform coefficients from which they are derived. Due to application of the secondary transform to only a set of coefficients adjacent to each other and including the DC coefficient, the secondary transform is referred to as a 'low frequency non-separable secondary transform' (LFNST).

[00094] The residual coefficients 336 are supplied to the entropy encoder 338 for encoding in the bitstream 115. Typically, the residual coefficients of each TB with at least one significant residual coefficient of the TU are scanned to produce an ordered list of values, according to a scan pattern. The scan pattern generally scans the TB as a sequence of 4x4 'sub-blocks', providing a regular scanning operation at the granularity of 4x4 sets of residual coefficients, with the arrangement of sub-blocks dependent on the size of the TB. The scan within each sub block and the progression from one sub-block to the next typically follow a backward diagonal scan pattern.

[00095] As described above, the video encoder 114 needs access to a frame representation corresponding to the decoded frame representation seen in the video decoder 134. Thus, the residual coefficients 336 are passed to a dequantiser 340 to produce dequantised residual coefficients 342. The dequantised residual coefficients 342 are passed through an inverse secondary transform module 344, operating in accordance with the secondary transform index 388 to produce intermediate inverse transform coefficients, represented by an arrow 346. The intermediate inverse transform coefficients 346 are passed to an inverse primary transform module 348 to produce residual samples, represented by an arrow 399, of the TU. The dequantised residual coefficients 342 are output by a multiplexer 349 as residual samples 350 if the transform skip 390 indicates the transform bypass is to be performed. Otherwise, the multiplexer 349 outputs the residual samples 399 as the residual samples 350.

[00096] The types of inverse transform performed by the inverse secondary transform module 344 correspond with the types of forward transform performed by the forward secondary transform module 330. The types of inverse transform performed by the inverse primary transform module 348 correspond with the types of primary transform performed by the primary transform module 326. A summation module 352 adds the residual samples 350 and the PU 320 to produce reconstructed samples (indicated by an arrow 354) of the CU.

24493980_1

[00097] The reconstructed samples 354 are passed to a reference sample cache 356 and an in loop filters module 368. The reference sample cache 356, typically implemented using static RAM on an ASIC (thus avoiding costly off-chip memory access) provides minimal sample storage needed to satisfy the dependencies for generating intra-frame PBs for subsequent CUs in the frame. The minimal dependencies typically include a 'line buffer' of samples along the bottom of a row of CTUs, for use by the next row of CTUs and column buffering the extent of which is set by the height of the CTU. The reference sample cache 356 supplies reference samples (represented by an arrow 358) to a reference sample filter 360. The sample filter 360 applies a smoothing operation to produce filtered reference samples (indicated by an arrow 362). The filtered reference samples 362 are used by the intra-frame prediction module 364 to produce an intra-predicted block of samples, represented by an arrow 366. For each candidate intra prediction mode the intra-frame prediction module 364 produces a block of samples, that is 366. The block of samples 366 is generated by the module 364 using techniques such as DC, planar or angular intra prediction according to the intra prediction mode 387.

[00098] The in-loop filters module 368 applies several filtering stages to the reconstructed samples 354. The filtering stages include a 'deblocking filter' (DBF) which applies smoothing aligned to the CU boundaries to reduce artefacts resulting from discontinuities. Another filtering stage present in the in-loop filters module 368 is an 'adaptive loop filter' (ALF), which applies a Wiener-based adaptive filter to further reduce distortion. A further available filtering stage in the in-loop filters module 368 is a 'sample adaptive offset' (SAO) filter. The SAO filter operates by firstly classifying reconstructed samples into one or multiple categories and, according to the allocated category, applying an offset at the sample level.

[00099] Filtered samples, represented by an arrow 370, are output from the in-loop filters module 368. The filtered samples 370 are stored in a frame buffer 372. The frame buffer 372 typically has the capacity to store several (for example up to 16) pictures and thus is stored in the memory 206. The frame buffer 372 is not typically stored using on-chip memory due to the large memory consumption required. As such, access to the frame buffer 372 is costly in terms of memory bandwidth. The frame buffer 372 provides reference frames (represented by an arrow 374) to a motion estimation module 376 and the motion compensation module 380.

[000100] The motion estimation module 376 estimates a number of 'motion vectors' (indicated as 378), each being a Cartesian spatial offset from the location of the present CB, referencing a

24493980_1 block in one of the reference frames in the frame buffer 372. A filtered block of reference samples (represented as 382) is produced for each motion vector. The filtered reference samples 382 form further candidate modes available for potential selection by the mode selector 386. Moreover, for a given CU, the PU 320 may be formed using one reference block ('uni-predicted') or may be formed using two reference blocks ('bi-predicted'). For the selected motion vector, the motion compensation module 380 produces the PB 320 in accordance with a filtering process supportive of sub-pixel accuracy in the motion vectors. As such, the motion estimation module 376 (which operates on many candidate motion vectors) may perform a simplified filtering process compared to that of the motion compensation module 380 (which operates on the selected candidate only) to achieve reduced computational complexity. When the video encoder 114 selects inter prediction for a CU the motion vector 378 is encoded into the bitstream 115.

[000101] Although the video encoder 114 of Fig. 3 is described with reference to versatile video coding (VVC), other video coding standards or implementations may also employ the processing stages of modules 310-386. The frame data 113 (and bitstream 115) may also be read from (or written to) memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray diskTM or other computer readable storage medium. Additionally, the frame data 113 (and bitstream 115) may be received from (or transmitted to) an external source, such as a server connected to the communications network 220 or a radio-frequency receiver.

[000102] The video decoder 134 is shown in Fig. 4. Although the video decoder 134 of Fig. 4 is an example of a versatile video coding (VVC) video decoding pipeline, other video codecs may also be used to perform the processing stages described herein. As shown in Fig. 4, the bitstream 133 is input to the video decoder 134. The bitstream 133 may be read from memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray diskTM or other non-transitory computer readable storage medium. Alternatively, the bitstream 133 may be received from an external source such as a server connected to the communications network 220 or a radio frequency receiver. The bitstream 133 contains encoded syntax elements representing the captured frame data to be decoded.

[000103] The bitstream 133 is input to an entropy decoder module 420. The entropy decoder module 420 extracts syntax elements from the bitstream 133 by decoding sequences of 'bins' and passes the values of the syntax elements to other modules in the video decoder 134. The entropy decoder module 420 uses variable-length and fixed length decoding to decode SPS,

24493980_1

PPS or slice header an arithmetic decoding engine to decode syntax elements of the slice data as a sequence of one or more bins. Each bin may use one or more 'contexts', with a context describing probability levels to be used for coding a 'one' and a 'zero' value for the bin. Where multiple contexts are available for a given bin, a 'context modelling' or 'context selection' step is performed to choose one of the available contexts for decoding the bin.

[000104] The entropy decoder module 420 applies an arithmetic coding algorithm, for example 'context adaptive binary arithmetic coding' (CABAC), to decode syntax elements from the bitstream 133. The decoded syntax elements are used to reconstruct parameters within the video decoder 134. Parameters include residual coefficients (represented by an arrow 424), a quantisation parameter, a secondary transform index 474, and mode selection information such as an intra prediction mode (represented by an arrow 458). The mode selection information also includes information such as motion vectors, and the partitioning of each CTU into one or more CUs. Parameters are used to generate PBs, typically in combination with sample data from previously decoded CBs.

[000105] The residual coefficients 424 are passed to a dequantiser module 428. The dequantiser module 428 performs inverse quantisation (or 'scaling') on the residual coefficients 424, that is, in the primary transform coefficient domain, to create reconstructed transform coefficients, represented by an arrow 432. The reconstructed transform coefficients 432 are passed to an inverse secondary transform module 436. The inverse secondary transform module 436 performs either a secondary transform is applied or no operation is performed (bypass) according to a secondary transform type 474, decoded from the bitstream 113 by the entropy decoder 420 in accordance with methods described with reference to Figs. 15, 16 and 18. The inverse secondary transform module 436 produces reconstructed transform coefficients 440, that is primary transform domain coefficients.

[000106] The reconstructed transform coefficients 440 are passed to an inverse primary transform module 444. The module 444 transforms the coefficients 440 from the frequency domain back to the spatial domain according to a primary transform type 476 (or 'mtsidx'), decoded from the bitstream 133 by the entropy decoder 420. The result of operation of the module 444 is a block of residual samples, represented by an arrow 499. When a transform skip flag 478 for a given TB of the CU indicates bypassing of the transform, a multiplexer 449 outputs reconstructed transform coefficients 432 as residual samples 488 to the summation module 450. Otherwise, the multiplexer 449 outputs the residual samples 499 as the residual

24493980_1 samples 488. The block of residual samples 448 is equal in size to the corresponding CB. The residual samples 448 are supplied to a summation module 450. At the summation module 450 the residual samples 448 are added to a decoded PB (represented as 452) to produce a block of reconstructed samples, represented by an arrow 456. The reconstructed samples 456 are supplied to a reconstructed sample cache 460 and an in-loop filtering module 488. The in-loop filtering module 488 produces reconstructed blocks of frame samples, represented as 492. The frame samples 492 are written to a frame buffer 496, from which the frame data 135 is later output.

[000107] The reconstructed sample cache 460 operates similarly to the reconstructed sample cache 356 of the video encoder 114. The reconstructed sample cache 460 provides storage for reconstructed samples needed to intra predict subsequent CBs without resorting to accessing the memory 206 (for example by using the data 232 instead, which is typically on-chip memory). Reference samples, represented by an arrow 464, are obtained from the reconstructed sample cache 460 and supplied to a reference sample filter 468 to produce filtered reference samples indicated by arrow 472. The filtered reference samples 472 are supplied to an intra frame prediction module 476. The module 476 produces a block of intra-predicted samples, represented by an arrow 480, in accordance with the intra prediction mode parameter 458 signalled in the bitstream 133 and decoded by the entropy decoder 420. The block of samples 480 is generated using modes such as DC, planar or angular intra prediction, according to the intra prediction mode 458.

[000108] When the prediction mode of a CB is indicated to use intra prediction in the bitstream 133, the intra-predicted samples 480 form the decoded PB 452 via a multiplexor module 484. Intra prediction produces a prediction block (PB) of samples, that is, a block in one colour component, derived using 'neighbouring samples' in the same colour component. The neighbouring samples are samples adjacent to the current block and by virtue of being preceding in the block decoding order have already been reconstructed. Where luma and chroma blocks are collocated, the luma and chroma blocks may use different intra prediction modes. However, the two chroma CBs share the same intra prediction mode.

[000109] When the prediction mode of the CB is indicated to be inter prediction in the bitstream 133, a motion compensation module 434 produces a block of inter-predicted samples, represented as 438, using a motion vector (decoded from the bitstream 133 by the entropy decoder 420) and reference frame index to select and filter a block of samples 498 from a frame

24493980_1 buffer 496. The block of samples 498 is obtained from a previously decoded frame stored in the frame buffer 496. For bi-prediction, two blocks of samples are produced and blended together to produce samples for the decoded PB 452. The frame buffer 496 is populated with filtered block data 492 from an in-loop filtering module 488. As with the in-loop filtering module 368 of the video encoder 114, the in-loop filtering module 488 applies any of the DBF, the ALF and SAO filtering operations. Generally, the motion vector is applied to both the luma and chroma channels, although the filtering processes for sub-sample interpolation in the luma and chroma channel are different.

[000110] Fig. 5 is a schematic block diagram showing a collection 500 of available divisions or splits of a region into one or more sub-regions in each node of the coding tree structure of versatile video coding. The divisions shown in the collection 500 are available to the block partitioner 310 of the encoder 114 to divide each CTU into one or more CUs or CBs according to a coding tree, as determined by the Lagrangian optimisation, as described with reference to Fig. 3.

[000111] Although the collection 500 shows only square regions being divided into other, possibly non-square sub-regions, it should be understood that the collection 500 is showing the potential divisions of a parent node in a coding tree into child nodes in the coding tree and not requiring the parent node to correspond to a square region. If the containing region is non square, the dimensions of the blocks resulting from the division are scaled according to the aspect ratio of the containing block. Once a region is not further split, that is, at a leaf node of the coding tree, a CU occupies that region.

[000112] The process of subdividing regions into sub-regions terminates when the resulting sub-regions reach a minimum CU size, generally 4x4 luma samples. In addition to constraining CUs to prohibit block areas smaller than a predetermined minimum size, for example 16 samples, CUs are constrained to have a minimum width or height of four. Other minimums, both in terms of width and height or in terms of width or height are also possible. The process of subdivision may also terminate prior to the deepest level of decomposition, resulting in a CUs larger than the minimum CU size. It is possible for no splitting to occur, resulting in a single CU occupying the entirety of the CTU. A single CU occupying the entirety of the CTU is the largest available coding unit size. Due to use of subsampled chroma formats, such as 4:2:0, arrangements of the video encoder 114 and the video decoder 134 may terminate splitting of regions in the chroma channels earlier than in the luma channels, including in the case of a shared coding tree defining the block structure of the luma and chroma channels. When

24493980_1 separate coding trees are used for luma and chroma, constraints on available splitting operations ensure a minimum chroma CU area of 16 samples, even though such CUs are collocated with a larger luma area, e.g., 64 luma samples.

[000113] At the leaf nodes of the coding tree exist CUs. For example, a leaf node 510 contains one CU. At the non-leaf nodes of the coding tree exist a split into two or more further nodes, each of which could be a leaf node that forms one CU, or a non-leaf node containing further splits into smaller regions. At each leaf node of the coding tree, one CB exists for each colour channel of the coding tree. Splitting terminating at the same depth for both luma and chroma in a shared tree results in one CU having three collocated CBs.

[000114] A quad-tree split 512 divides the containing region into four equal-size regions as shown in Fig. 5. Compared to HEVC, versatile video coding (VVC) achieves additional flexibility with additional splits, including a horizontal binary split 514 and a vertical binary split 516. Each of the splits 514 and 516 divides the containing region into two equal-size regions. The division is either along a horizontal boundary (514) or a vertical boundary (516) within the containing block.

[000115] Further flexibility is achieved in versatile video coding with addition of a ternary horizontal split 518 and a ternary vertical split 520. The ternary splits 518 and 520 divide the block into three regions, bounded either horizontally (518) or vertically (520) along1/4and3/4 of the containing region width or height. The combination of the quad tree, binary tree, and ternary tree is referred to as 'QTBTTT'. The root of the tree includes zero or more quadtree splits (the 'QT' section of the tree). Once the QT section terminates, zero or more binary or ternary splits may occur (the 'multi-tree' or 'MT' section of the tree), finally ending in CBs or CUs at leaf nodes of the tree. Where the tree describes all colour channels, the tree leaf nodes are CUs. Where the tree describes the luma channel or the chroma channels, the tree leaf nodes are CBs.

[000116] Compared to HEVC, which supports only the quad tree and thus only supports square blocks, the QTBTTT results in many more possible CU sizes, particularly considering possible recursive application of binary tree and/or ternary tree splits. When only quad-tree splitting is available, each increase in coding tree depth corresponds to a reduction in CU size to one quarter the size of the parent area. In VVC, the availability of binary and ternary splits means that the coding tree depth no longer corresponds directly to CU area. The potential for unusual (non-square) block sizes can be reduced by constraining split options to eliminate splits that

24493980_1 would result in a block width or height either being less than four samples or in not being a multiple of four samples.

[000117] Fig. 6 is a schematic flow diagram illustrating a data flow 600 of a QTBTTT (or 'coding tree') structure used in versatile video coding. The QTBTTT structure is used for each CTU to define a division of the CTU into one or more CUs. The QTBTTT structure of each CTU is determined by the block partitioner 310 in the video encoder 114 and encoded into the bitstream 115 or decoded from the bitstream 133 by the entropy decoder 420 in the video decoder 134. The data flow 600 further characterises the permissible combinations available to the block partitioner 310 for dividing a CTU into one or more CUs, according to the divisions shown in Fig. 5.

[000118] Starting from the top level of the hierarchy, that is at the CTU, zero or more quad-tree divisions are first performed. Specifically, a quad-tree (QT) split decision 610 is made by the block partitioner 310. The decision at 610 returning a '1' symbol indicates a decision to split the current node into four sub-nodes according to the quad-tree split 512. The result is the generation of four new nodes, such as at 620, and for each new node, recursing back to the QT split decision 610. Each new node is considered in raster (or Z-scan) order. Alternatively, if the QT split decision 610 indicates that no further split is to be performed (returns a '0' symbol), quad-tree partitioning ceases and multi-tree (MT) splits are subsequently considered.

[000119] Firstly, an MT split decision 612 is made by the block partitioner 310. At 612, a decision to perform an MT split is indicated. Returning a '0' symbol at decision 612 indicates that no further splitting of the node into sub-nodes is to be performed. If no further splitting of a node is to be performed, then the node is a leaf node of the coding tree and corresponds to a CU. The leaf node is output at 622. Alternatively, if the MT split 612 indicates a decision to perform an MT split (returns a '1' symbol), the block partitioner 310 proceeds to a direction decision 614.

[000120] The direction decision 614 indicates the direction of the MT split as either horizontal ('H' or '0') or vertical ('V' or '1').The block partitioner 310 proceeds to a decision 616 if the decision 614 returns a '0' indicating a horizontal direction. The block partitioner 310 proceeds to a decision 618 if the decision 614 returns a '1'indicating a vertical direction.

[000121] At each of the decisions 616 and 618, the number of partitions for the MT split is indicated as either two (binary split or 'BT' node) or three (ternary split or 'TT') at the BT/TT split. That is, a BT/TT split decision 616 is made by the block partitioner 310 when the

24493980_1 indicated direction from 614 is horizontal and a BT/TT split decision 618 is made by the block partitioner 310 when the indicated direction from 614 is vertical.

[000122] The BT/TT split decision 616 indicates whether the horizontal split is the binary split 514, indicated by returning a '0', or the ternary split 518, indicated by returning a '1'. When the BT/TT split decision 616 indicates a binary split, at a generate HBT CTU nodes step 625 two nodes are generated by the block partitioner 310, according to the binary horizontal split 514. When the BT/TT split 616 indicates a ternary split, at a generate HTT CTU nodes step 626 three nodes are generated by the block partitioner 310, according to the ternary horizontal split 518.

[000123] The BT/TT split decision 618 indicates whether the vertical split is the binary split 516, indicated by returning a '0', or the ternary split 520, indicated by returning a '1'. When the BT/TT split 618 indicates a binary split, at a generate VBT CTU nodes step 627 two nodes are generated by the block partitioner 310, according to the vertical binary split 516. When the BT/TT split 618 indicates a ternary split, at a generate VTT CTU nodes step 628 three nodes are generated by the block partitioner 310, according to the vertical ternary split 520. For each node resulting from steps 625-628 recursion of the data flow 600 back to the MT split decision 612 is applied, in a left-to-right or top-to-bottom order, depending on the direction 614. As a consequence, the binary tree and ternary tree splits may be applied to generate CUs having a variety of sizes.

[000124] Figs. 7A and 7B provide an example division 700 of a CTU 710 into a number of CUs or CBs. An example CU 712 is shown in Fig. 7A. Fig. 7A shows a spatial arrangement of CUs in the CTU 710. The example division 700 is also shown as a coding tree 720 in Fig. 7B.

[000125] At each non-leaf node in the CTU 710 of Fig. 7A, for example nodes 714, 716 and 718, the contained nodes (which may be further divided or may be CUs) are scanned or traversed in a 'Z-order' to create lists of nodes, represented as columns in the coding tree 720. For a quad-tree split, the Z-order scanning results in top left to right followed by bottom left to right order. For horizontal and vertical splits, the Z-order scanning (traversal) simplifies to a top-to-bottom scan and a left-to-right scan, respectively. The coding tree 720 of Fig. 7B lists all nodes and CUs ordered according to the Z-order traversal of the coding tree. Each split generates a list of two, three or four new nodes at the next level of the tree until a leaf node (CU) is reached.

24493980_1

[000126] Having decomposed the image into CTUs and further into CUs by the block partitioner 310, and using the CUs to generate each residual block (324) as described with reference to Fig. 3, residual blocks are subject to forward transformation and quantisation by the video encoder 114. The resulting TBs 336 are subsequently scanned to form a sequential list of residual coefficients, as part of the operation of the entropy coding module 338. An equivalent process is performed in the video decoder 134 to obtain TBs from the bitstream 133.

[000127] Figs. 8A, 8B, 8C, and 8D show examples of forward and inverse non-separable secondary transforms that are performed according to different sizes of transform blocks (TBs). Fig. 8A shows a set of relationships 800 between primary transform coefficients 802 and secondary transform coefficients 804 for a 4x4 TB size. The primary transform coefficients 802 consist of 4x4 coefficients, while the secondary transform coefficients 804 consist of eight coefficients. The eight secondary transform coefficients are arranged in a pattern 806. The pattern 806 corresponds to the eight positions, adjacent in a backward diagonal scan of the TB and including the DC (top-left) position. The remaining eight positions shown in Fig. 8A in the backward diagonal scan are not populated by performing a forward secondary transform and thus remain zero-valued. A forward non-separable secondary transform 810 for 4x4 TBs therefore receives sixteen primary transform coefficients, and produces as output eight secondary transform coefficients. The forward secondary transform 810 for 4x4 TBs can therefore be represented by an 8x16 matrix of weights. Similarly, an inverse secondary transform 812 can be represented by a 16x8 matrix of weights.

[000128] Fig. 8B shows a set of relationships 818 between primary transform coefficients and secondary transform coefficients for 4xN and Nx4 TB sizes, where N is greater than 4. In both cases, a top-left 4x4 sub-block of primary coefficients 820 is associated with a top-left 4x4 sub block of secondary transform coefficients 824. In the video encoder 114, the forward non separable secondary transform 830 takes sixteen primary transform coefficients and produces as output sixteen secondary transform coefficients. Remaining primary transform coefficients 822 are not populated by the forward secondary transform and thus remain zero-valued. After the forward non-separable secondary transform 830 is performed, coefficient positions 826, associated with the coefficients 822, are not populated and thus remain zero-valued.

[000129] The forward secondary transform 830 for 4xN or Nx4 TBs can be represented by a 16x16 matrix of weights. The matrix representing the forward secondary transform 830 is defined as A. Similarly, a corresponding inverse secondary transform 832 can be represented

24493980_1 by a 16x16 matrix of weights. The matrix representing the inverse secondary transform 832 is defined as B.

[000130] The storage requirement of the non-separable transform kernel is further reduced by reusing parts of A for the forward secondary transform 810 and the inverse secondary transform 812 for 4x4 TBs. The first eight rows of A are used for the forward secondary transform 810, and the transpose of the first eight rows of A are used for the inverse secondary transform 812.

[000131] Fig. 8C shows a relationship 855 between primary transform coefficients 840 and secondary transform coefficients 842 for TBs of size 8x8. The primary transform coefficients 840 consist of 8x8 coefficients, while the secondary transform coefficients 842 consist of eight transform coefficients. The eight secondary transform coefficients 842 are arranged in a pattern corresponding to eight consecutive positions in a backward diagonal scan of the TB, the eight consecutive positions including the DC (top-left) coefficient of the TB. The remaining secondary transform coefficients in the TB are all zeroes and thus do not need to be scanned. The forward non-separable secondary transform 850 for an 8x8 TB takes forty eight primary transform coefficients as input, corresponding to three 4x4 sub-blocks, and produces eight secondary transform coefficients. The forward secondary transform 850 for an 8x8 TB can be represented by an 8x48 matrix of weights. A corresponding inverse secondary transform 852 for an 8x8 TB can be represented by a 48x8 matrix of weights.

[000132] Fig. 8D shows a relationship 875 between primary transform coefficients 860 and secondary transform coefficients 862 for TBs of size greater than 8x8. A top-left 8x8 block of primary coefficients 860 (arranged as four 4x4 sub-blocks) is associated with a top-left 4x4 sub-block of secondary transform coefficients 862. In the video encoder 114, a forward non separable secondary transform 870 operates on forty-eight primary transform coefficients to produce sixteen secondary transform coefficients. Remaining primary transform coefficients 864 are zeroed out. Secondary transform coefficient positions 866 outside of the top-left 4x4 sub-block of secondary transform coefficients 862 are not populated and remain as zeroes.

[000133] The forward secondary transform 870 for TBs of size greater than 8x8 can be represented by a 16x48 matrix of weights. A matrix representing the forward secondary transform 870 is defined as F. Similarly, a corresponding inverse secondary transform 872 can be represented by a 48x16 matrix of weights. A matrix representing the inverse secondary

24493980_1 transform 872 is defined as G. As described above with reference to matrices A and B, F desirably has the property of orthogonality. The property of orthogonality means G = FT, and only F needs be stored in the video encoder 114 and video decoder 134. An orthogonal matrix can be described as a matrix in which the rows have orthogonality.

[000134] The storage requirement of the non-separable transform kernel is further reduced by reusing parts of F for the forward secondary transform 850 and the inverse secondary transform 852 for 8x8 TBs. The first eight rows of F are used for the forward secondary transform 810, and the transpose of the first eight rows of F are used for the inverse secondary transform 812.

[000135] Non-separable secondary transforms may achieve coding improvement over the use of separable primary transforms alone, because non-separable secondary transforms are able to sparsify two-dimensional features in the residual signal, such as angular features. As angular features in the residual signal may be dependent on the type of intra prediction mode 387 selected, it is advantageous for the non-separable secondary transform matrix to be adaptively selected depending on the intra prediction mode. As described above, intra prediction modes consist of "intra-DC", "intra-planar", "intra-angular" modes, and "matrix intra prediction" modes. The intra prediction mode parameter 458 takes the value of 0 when intra-DC prediction is used. The intra prediction mode parameter 458 takes the value of1 when intra-planar prediction is used. The intra prediction mode parameter 458 takes a value between 2 and 66 inclusive, when intra-angular prediction on square TBs is used.

[000136] Fig. 9 shows a set 900 of transform blocks available in the versatile video coding (VVC) standard. Fig. 9 also shows the application of the secondary transform to a subset of residual coefficients from transform blocks of the set 900. Fig. 9 shows TBs with widths and heights ranging from four to 32. However TBs of width and/or height 64 are possible but are not shown for ease of reference.

[000137] A 16-point secondary transform 952 (shown with darker shading) is applied to a 4x4 set of coefficients. The 16-point secondary transform 952 is applied to TBs with a width or a height of four, e.g., a 4x4 TB 910, an 8x4 TB 912, a 16x4 TB 914, a 32x4 TB 916, a 4x8 TB 920, a 4x16 TB 930, and a 4x32 TB 940. The 16-point secondary transform 952 is also applied to TBs of size 4x64 and a 64x4 (not shown in Fig. 9). For TBs with a width or height of four but with more than 16 primary coefficients, the 16-point secondary transform is applied only to the upper-left 4x4 sub-block of the TB and other sub-blocks are required to have zero

24493980_1 valued coefficients in order for the secondary transform to be applied. Generally application of a 16-point secondary transform results in 8 or 16 secondary transform coefficients, as described with reference to Figs. 8 to 8D. The secondary transform coefficients are packed into the TB for encoding into the top-left sub-block of the TB.

[000138] For transform sizes with a width and height greater than four, a 48-point secondary transform 950 (shown with lighter shading) is available for application to three 4x4 sub-blocks of residual coefficients in the upper-left 8x8 region of the transform block, as shown in Fig. 9. The 48-point secondary transform 950 is applied to an 8x8 transform block 922, a 16x8 transform block 924, a 32x8 transform block 926, an 8x16 transform block 932, a 16x16 transform block 934, a 32x16 transform block 936, an 8x32 transform block 942, a 16x32 transform block 944, and a 32x32 transform block 946, in each case in the region shown with light shading and a dashed outline. The 48-point secondary transform 950 is also applicable to TBs of size 8x64, 16x64, 32x64, 64x64, 64x32, 64x16 and 64x8 (not shown). Application of a 48-point secondary transform kernel generally results in the production of fewer than 48 secondary transform coefficients. For example, 8 or 16 secondary transform coefficients may be produced, as described with reference to Figs. 8B to 8D. Primary transform coefficients not subject to the secondary transform ('primary-only coefficients'), for example coefficients 966 of the TB 934, are required to be zero-valued in order for the secondary transform to be applied. After application of the 48-point secondary transform 950 in a forward direction, the region which may contain significant coefficients is reduced from 48 coefficients to 16 coefficients, further reducing the number of coefficient positions which may contain significant coefficients. For the inverse secondary transform, decoded significant coefficients present are transformed to produce coefficients any of which may be significant in a region which are then subject to the primary inverse transform. Only the upper-left 4x4 sub-block may contain significant coefficients when a secondary transform reduces one or more sub-blocks to a set of 16 secondary transform coefficients. A last significant coefficient position located at any coefficient position for which secondary transform coefficients may be stored indicates either application of a secondary transform or only a primary transform was applied.

[000139] When the last significant coefficient position indicates a secondary transform coefficient position in a TB, a signalled secondary transform index (i.e., 388 or 474) is needed to distinguish between applying a secondary transform kernel or bypassing the secondary transform. Although application of secondary transforms to TBs of various sizes in Fig. 9 has been described from the perspective of the video encoder 114, a corresponding inverse process is performed in the video decoder 134. The video decoder 134 firstly decodes a last significant

24493980_1 coefficient position. If the decoded last significant coefficient position indicates potential application of a secondary transform, the secondary transform index 474 is decoded to determine whether to apply or bypass the inverse secondary transform.

[000140] Fig. 10 shows a syntax structure 1000 for a bitstream 1001 with multiple slices. Each of the slices includes multiple coding units. The bitstream 1001 may be produced by the video encoder 114, e.g. as the bitstream 115, or may be parsed by the video decoder 134, e.g. as the bitstream 133. The bitstream 1001 is divided into portions, for example network abstraction layer (NAL) units, with delineation achieved by preceding each NAL unit with a NAL unit header such as 1008. A sequence parameter set (SPS) 1010 defines sequence-level parameters, such as a profile (set of tools) used for encoding and decoding the bitstream, chroma format, sample bit depth, and frame resolution. Parameters are also included in the set 1010 that constrain the application of different types of split in the coding tree of each CTU.

[000141] A picture parameter set (PPS) 1012 defines sets of parameters applicable to zero or more frames. A picture header (PH) 1015 defines parameters applicable to the current frame. Parameters of the PH 1015 may include a list of CU chroma QP offsets, one of which may be applied at the CU level to derive a quantisation parameter for use by chroma blocks from the quantisation parameter of a collocated luma CB.

[000142] The picture header 1015 and a sequence of slices forming one picture is known as an access unit (AU), such as AU 0 1014. The AU 0 1014 includes three slices, such as slices 0 to 2. Slice 1 is marked as 1016. As with other slices, slice1 (1016) includes a slice header 1018 and slice data 1020.

[000143] Fig. 11 shows a syntax structure 1100 for slice data (such as the slice data 1104 corresponding to 1020) of the bitstream 1001 (e.g. 115 or 133) with a shared coding tree for luma and chroma coding units of a coding tree unit, such as a CTU 1110. The CTU 1110 includes one or more CUs. An example is labelled as a CU 1114. The CU 1114 includes a signalled prediction mode 1116 followed by a transform tree 1118. When the size of the CU 1114 does not exceed the maximum transform size (either 32x32 or 64x64 in the luma channel) then the transform tree 1118 includes one transform unit, shown as a TU 1124. The presence of each TB in the TU 1124 is dependent on a corresponding 'coded block flag' (CBF), shown in the example of Fig. 11 as one of coded block flags 1123. The coded block flags 1123 includes a luma coded block flag 1123a, a Cb coded block flag 1123b, and a Cr coded block flag 1123c. When a TB is present, the corresponding CBF is equal to one and at least one

24493980_1 residual coefficient in the TB is nonzero. When a TB is absent, the corresponding CBF is equal to zero and all residual coefficients in the TB are zero. When a coded block flag indicates no significant residual coefficients are present for a TB, the corresponding transform skip flag is not coded. For example, a zero-valued luma coded block flag 1123a suppresses coding of transform skip flag 1126, a zero-valued Cb coded block flag 1123b suppresses coding of transform skip flag 1130, and a zero-valued Cr coded block flag 1123c suppresses coding of transform skip flag 1134. When a transform skip flag (such as one of 1126, 1130 and 1134) is not coded, the value is inferred to be zero in the video decoder 134. A transform skip flag value of zero normally indicates application of a transform (for example DCT-2). However when the transform skip flag is inferred to be zero due to a corresponding zero-valued coded block flag, there is no need to apply a transform as there are no significant coefficients associated with the TB. Notwithstanding the absence of application of a transform, an inferred zero-valued transform skip flag due to a zero-valued coded block flag is different to a 'transform skip' operation, which involves bypassing the transform and propagating residual coefficients, for example from the dequantiser 428 to the multiplexor 449. Use of a shared coding tree results in the CU1114 including a luma TB 1128, a first chroma TB 1132, and a second chroma TB 1136 each may use transform skip, as signalled by transform skip flags 1126, 1130, and 1134, respectively. When a 4:2:0 chroma format is in use, the corresponding maximum chroma transform sizes are half of the luma maximum transform size in each direction. That is, maximum luma transform sizes of 32x32 or 64x64 result in maximum chroma transform sizes of 16x16 or 32x32, respectively. When a 4:4:4 chroma format is in use, the chroma maximum transform size is the same as the luma maximum transform size. When a 4:2:2 chroma format is in use, the chroma maximum transform size is half horizontally and the same vertically as the luma transform size, that is, for maximum luma transform sizes of 32x32 and 64x64 the maximum chroma transform sizes are 16x32 and 32x64, respectively.

[000144] If the prediction mode 1116 indicates usage of intra prediction for the CU 1114, a luma intra prediction mode and a chroma intra prediction mode are specified.

[000145] A coding mode in which a single chroma TB is sent to specify the chroma residual both for Cb and Cr channels is available, known as a 'joint CbCr' coding mode. When the joint CbCr coding mode is enabled, a single chroma TB is encoded.

Irrespective of colour channel, each coded TB includes a last position followed by one or more residual coefficients. For example, the luma TB 1128 includes a last position 1140 and residual coefficients 1144. The last position 1140 indicates the last significant residual coefficient

24493980_1 position in the TB when considering coefficients in the diagonal scan pattern, used to serialise the array of coefficients of a TB, in a forward direction (i.e. from the DC coefficient onwards). A last significant residual coefficient position, for example 1140 for the luma TB 1128, is only present when the corresponding TB is not using transform skip. Application of a secondary transform (by the module 330) requires that the luma TB is not transform skipped and that significant coefficients are present only within the secondary transformed region of coefficients. That is, the last position must be within one of 806, 824, 842, or 862 as described with reference to Figs. 8A-D. Additionally, the luma residual must include at least one AC coefficient, resulting in a requirement that the luma TB 1128 last position is at a position other than (0, 0). When these conditions are met a secondary transform index 1142 (corresponding to the index 388) is present, signalling to either bypass the secondary transform or selecting one of two possible secondary transform kernels. Placement of the secondary transform index 1142 immediately following the luma last position 1140 enables latency in implementations for encoding or decoding a CU to be reduced. The latency is reduced by placement of the secondary transform index 1142 immediately following the luma last position 1140 as the need to perform or bypass the secondary transform is known prior to decoding luma residual coefficients 1144.

Placement of the secondary transform index 1142 immediately following the luma last position 1140 is possible as there is no dependency on checking the last position of the first chroma TB 1132 or the second chroma TB 1136 in terms of at least one AC coefficient being present across all TBs of the CU 1114 or the respective last positions being within regions 806, 824, 842, or 862. When the secondary transform index 1142 indicates application of the secondary transform, i.e. to the luma TB 1128, the two chroma transform skip flags, i.e. 1130 and 1134, are not coded and instead use of the DCT-2 transform is inferred for each chroma channel. In effect the index for each one of the first plurality of coding units is determined if presence of a last residual coefficient of the transform block of the corresponding coding unit meets a threshold last position, for example is equal to or less than the threshold last position. As described in relation to Figs. 8A to 8D the threshold last position is 7 if the transform block has a size of 4x4 or 8x8. The threshold last position is 15 for transform blocks of other sizes (suchas4x8,4x16,4x32,4x64,8x4,8x16,8x32,8x64, 16x4, 16x8, 16x16, 16x32, 16x64, 32x4,32x8,32x16,32x32,32x64,64x4,64x8,64x16,64x32,64x64).

[000146] The two TBs 1132 and 1136 for the chroma channels each have a corresponding last position syntax element used in the same manner as described for the luma TB 1128. If the last positions of each of the TBs for the CU, that is 1128, 1132, and 1136, indicate that only

24493980_1 coefficients in the secondary transform domain are significant for each TB in the CU, that is all remaining coefficients that would only be subject to primary transformation are zero, the secondary transform index 1142 may be signalled to specify whether or not to apply a secondary transform. Conditioning on the signalling of the secondary transform index 1142 is described with reference to Figs. 14 and 16 to 18. A known placement of the secondary transform index 1120 after all CUs in the CTU 1110, as shown in dotted lines in Fig. 11, can also be used. The placement of the secondary transform index 1120 after all CUs in the CTU 1110 results in the video decoder 134 needing to buffer residual coefficients of the luma TB 1128, the first chroma TB 1132, and the second chroma TB 1134 before knowing whether a secondary transform is to be applied or not, increasing complexity and latency compared to the index 1142 being located immediately after a last position of transform blocks for the luma (primary) channel. Locating the secondary transform index 1142 within the luma TB 1128, such as adjacent the last position flag 1140 as shown in Fig. 11, can accordingly improve latency.

[000147] If a secondary transform is to be applied, the secondary transform index 1142 indicates which kernel is selected. Generally, two kernels are available in a 'candidate set' of kernels. Generally, there are four candidate sets, with one candidate set selected using the intra prediction mode of the block. The luma intra prediction mode is used to select the candidate set for the luma block and the chroma intra prediction mode is used to select the candidate set for the two chroma blocks. As described with reference to Figs. 8A-8D, the selected kernels also depend on the TB size, with different kernels for 4x4, 4xN/Nx4, and other size TBs. When the 4:2:0 chroma format is in use, the chroma TBs are generally half the width and height of the corresponding luma TBs, resulting in different selected kernels for chroma blocks when luma TBs of width or height of eight are used. For luma blocks of sizes 4x4, 4x8, 8x4, the one-to one correspondence of luma to chroma blocks in the shared coding tree is altered to avoid the presence of small-sized chroma blocks, such as 2x2, 2x4, or 4x2.

[000148] The secondary transform index 1142 indicates the following for example: Index value (not apply), one (apply first kernel of the candidate set), or two (apply second kernel of the candidate set). For the luma CB of the CU 1114, the primary transform type is also signalled as either (i) DCT-2 horizontally and vertically, (ii) transform skip horizontally and vertically, or (iii) combinations of DST-7 and DCT-8 horizontally and vertically, according to an MTS index 1122. The MTS index 1122 is only signalled when significant coefficients are present in the uppermost-leftmost 16x16 region of the luma TB1128. Additionally, the MTS index is only signalled when the secondary transform index1142 indicates no application of a secondary

24493980_1 transform, i.e. the value is zero. When the MTS index 1122 is not signalled, DCT-2 horizontally and vertically (option (i)) as primary transform is typically used.

[000149] Fig. 12 shows a syntax structure 1200 for slice data 1204 (e.g., 1020) for a bitstream (e.g., 115, 133) with a separate coding tree for luma and chroma coding units of a coding tree unit. A separate coding tree is available for intra slices ('I-slices'). The slice data 1204 includes one or more CTUs, such as CTU 1210. CTU 1210 is generally 128x128 luma samples in size and begins with a shared tree including one quad-tree split common to luma and chroma. At each of the resulting 64x64 nodes, separate coding trees commence for luma and chroma. An example node 1214 is marked in Fig. 12. The node 1214 has a luma node 1214a and a chroma node 1214b. A luma tree commences from the luma node 1214a and a chroma tree commences from the chroma node 1214b. The coding trees continuing from the node 1214a and the node 1214b are independent between luma and chroma, so different splits options are possible to produce the resulting luma CUs and chroma CUs. A luma CU 1220 belongs to the luma coding tree and includes a luma prediction mode 1221, a luma transform tree 1222 and an MTS index 1126. The luma transform tree 1222 includes a TU 1230. Since the luma coding tree encodes samples of the luma channel only, the TU 1230 includes a luma TB 1234 and a luma transform skip flag 1232 indicates of the luma residual is to be transformed or not. The luma TB 1234 includes a last position 1236, a secondary transform index 1237, and residual coefficients 1238.

[000150] A chroma CU 1250 belongs to the chroma coding tree and includes a chroma prediction mode 1251, a chroma transform tree 1252. The chroma transform tree 1252 includes a TU 1260. As the chroma tree includes chroma blocks, the TU 1260 includes a Cb TB 1264 and a Cr TB 1268. Application of bypassing of the DCT-2 transform for the Cb TB 1264 and the Cr CB 1268 is signalled with a Cb transform skip flag 1262 and a Cr transform skip flag 1266, respectively. Each TB includes a last position and residual coefficients, for example a last position 1270 (only present when the corresponding TB is using a DCT-2 transform) and residual coefficients 1272 are associated with the Cb TB 1264. Each of the chroma TBs 1264 and 1268 is independently encoded (and decoded). For example, one chroma channel TB (such as 1264) can have a primary transform applied and the other (such as 1268) can have bypass applied, as indicated by the skip flag 1266.

[000151] Fig. 13 shows a method 1300 for encoding the frame data 113 into the bitstream 115, the bitstream 115 including one or more slices as sequences of coding tree units. The method 1300 may be embodied by apparatus such as a configured FPGA, an ASIC, or an

24493980_1

ASSP. Additionally, the method 1300 may be performed by the video encoder 114 under execution of the processor 205. The method 1300 is applicable when the video encoder 114 is configured to use a shared coding tree, which is the case for producing encode P slices and B slices and is one option for producing encoded I slices. The method 1300 confines signalling and application of the secondary transform to the luma channel only. As such, the method 1300 may be implemented as modules of the software 233 stored on computer-readable storage medium and/or in the memory 206.

[000152] The method 1300 begins at an encode SPS/PPS step 1310. At step 1310 the video encoder 114 encodes the SPS 1010 and the PPS 1012 into the bitstream 115 as sequences of fixed and variable length encoded parameters. Parameters of the frame data 113, such as resolution and sample bit depth, are encoded. Parameters of the bitstream, such as flags indicating the usage of particular coding tools, are also encoded. The picture parameter set includes parameters specifying the frequency with which 'delta QP' syntax elements are present in the bitstream 113, offsets for chroma QP relative to luma QP, and the like.

[000153] The method 1300 continues from step 1310 to an encode picture header step 1320. In execution of step 1320 the processor 205 encodes the picture header (for example 1015) into the bitstream 113, the picture header 1015 being applicable to all slices in the current frame. The picture header 1015 may include partition constraints signalling the maximum allowed depths of binary, ternary, and quadtree splitting, overriding similar constraints included as part of the SPS 1010.

[000154] The method 1300 continues from step 1320 to an encode slice header step 1330. At step 1330 the entropy encoder 338 encodes the slice header 1018 into the bitstream 115.

[000155] The method 1300 continues from step 1330 to a divide slice into CTUs step 1340. In execution of step 1340 the video encoder 114 divides the slice 1016 into a sequence of CTUs (such as the CTU 1110). Slice boundaries are aligned to CTU boundaries and CTUs in a slice are ordered according to a CTU scan order, generally a raster scan order. The division of a slice into CTUs establishes an order in which portions of the frame data 113 are to be processed by the video encoder 114 in encoding each current slice.

[000156] The method 1300 continues from step 1340 to a determine coding tree step 1350. At step 1350 the video encoder 114 determines a coding tree for a current selected CTU in the slice. The method 1300 starts from the first CTU in the slice 1016 on the first invocation of the step 1350 and progresses to subsequent CTUs in the slice 1016 on subsequent invocations. In

24493980_1 determining the coding tree of a CTU, a variety of combinations of quadtree, binary, and ternary splits are generated by the block partitioner 310 and tested.

[000157] The method 1300 continues from step 1350 to a determine coding unit step 1360. At step 1360 the video encoder 114 executes to determine encodings for the CUs resulting from various coding trees under evaluation using known methods. A candidate encoding for a CU is determined using a Lagrangian optimisation considering rate and distortion. A 'best' encoding for the CU is updated to match a candidate encoding for the CU based on the cost of the candidate encoding for the CU being lower than the current selected 'best' encoding for the CU. Candidate encodings for CUs are generated, for example, by iterating over combinations of parameters. For example, a two-level nested iteration over primary transform type (including transform skip) and secondary transform may be performed. Combinations such as applying MTS with a secondary transform may be immediately discarded. Where further restriction for encoding applies, for example last position requirements for application of a secondary transform, the candidate encoding for a CU may be later discarded if the residual coefficients are found to be incompatible with requirements for application of the secondary transform. Determining encodings involves determining a prediction mode (e.g. intra prediction 387 with specific mode or inter prediction with motion vector) and the primary transform type 389. If the primary transform type 389 is determined to be DCT-2 and all quantised primary transform coefficients that are not subject to forward secondary transformation are not significant, the secondary transform index 388 is determined and may indicate application of the secondary transform (for example encoded as 1142). Otherwise the secondary transform index 388 indicates bypassing of the secondary transform. Additionally, a transform skip flag 390 is determined for each TB in the CU, indicating to apply the primary (and optionally the secondary) transform, or to bypass transforms altogether (for example 1126/1130/1134). For the luma channel, the primary transform type is determined to be DCT-2, transform skip, or one of the MTS options and for the chroma channels, DCT-2 or transform skip are the available transform types. Determining the encoding can also include determining a quantisation parameter where it is possible to change the QP, that is, where a 'delta QP' syntax element is to be encoded into the bitstream 115. In determining individual coding units the optimal coding tree is also determined, in a joint manner. When a coding unit in a shared coding tree is to be coded using intra prediction, a luma intra prediction mode and a chroma intra prediction are determined at step 1360. When a coding unit in a separate coding tree is to be coded using intra prediction, either a luma intra prediction mode or a chroma intra prediction mode is determined at step 1360, depending on the branch of the coding tree being luma or chroma, respectively.

24493980_1

[000158] The determine coding unit step 1360 may inhibit testing application of the secondary transform when there are no 'AC' residual coefficients present in the primary domain residual resulting from application of the DCT-2 primary transform by the forward primary transform module 326. AC residual coefficients are residual coefficients in locations other than the top left position of the transform block. One approach to inhibit testing application of the secondary transform is to set a cost of the candidate encoding for the CU to a relatively large value, for example 'MAXDOUBLE' (equal to 1.7e+308) when cost is measured using double precision floating point values. Setting the cost of a candidate encoding for a CU to MAXDOUBLE ensures that the best encoding for the CU will not be updated to that particular candidate encoding for the CU. The inhibition of testing secondary transform when only a DC primary coefficient exists spans the blocks for which the secondary transform index 388 applies, that is, Y (luma, for example the block 1128) for shared tree. Accordingly, if a candidate encoding for a CU specifies application of a secondary transform and only a DC coefficient is present for the luma TB 1128, the cost is set to MAXDOUBLE (preventing selection of a secondary transform). In another example, if a candidate encoding for a CU specifies application of a secondary transform for the luma TB 1128 and the luma last position is (i) greater than 7 for 4x4 or 8x8 TBs or (ii) greater than 15 for other size TBs, the cost is set to MAXDOUBLE (preventing selection of a secondary transform).

[000159] Further, if a candidate encoding for a CU includes no significant residual coefficients in the luma TB 1128, that is, the coded block flag of the luma TB 1128 is equal to zero, and secondary transform is to be applied, the cost is set to MAXDOUBLE (preventing selection of a secondary transform). If a candidate encoding for a CU includes no significant residual coefficients in any TB (1128, 1132, or 1136) and transform skip is selected, the cost is set to MAXDOUBLE (preventing selection of a secondary transform). Selection of a secondary transform is prevented because the zero-valued coded block flag (1123a, 1123b, 1123c) suppresses or prevents coding of the corresponding transform skip flag (1126, 1130, 1134). As the transform skip flag is not encoded, encoding the selection of transform skip in the bitstream syntax defined in Fig. 11 is not possible. Provided at least one significant AC primary coefficient exists in the blocks for which the secondary transform index 388 applies, that is, Y (luma) for shared tree, the video encoder 114 tests for selection of non-zero secondary transform index values 388 (that is, for application of the secondary transform) at step 1360.

[000160] The method 1300 continues from step 1360 to an encode coding unit step 1370. At step 1370 the video encoder 114 encodes the determined coding unit of the step 1360 into the

24493980_1 bitstream 115. An example of how the coding unit is encoded is described in more detail with reference to Fig. 14.

[000161] The method 1300 continues from step 1370 to a last coding unit test step 1380. At step 1380 the processor 205 tests if the current coding unit is the last coding unit in the CTU. If not ("NO" at step 1380), control in the processor 205 returns to the determine coding unit step 1360. Otherwise, if the current coding unit is the last coding unit ("YES" at step 1380) control in the processor 205 progresses to a last CTU test step 1390.

[000162] At the last CTU test step 1390 the processor 205 tests if the current CTU is the last CTU in the slice 1016. If the current CTU is not the last CTU in the slice 1016 ("NO" at step 1390), control in the processor 205 returns to the determine coding tree step 1350. Otherwise, if the current CTU is the last ("YES" at step 1390), control in the processor 205 progresses to a last slice test step 13100.

[000163] At the last slice test step 13100 the processor 205 tests if the current slice being encoded is the last slice in the frame. If the current slice is not the last slice ("NO" at step 13100), control in the processor 205 returns to the encode slice header step 1330. Otherwise, if the current slice is the last slice and all slices have been encoded ("YES" at step 13100) the method 1300 terminates.

[000164] Fig. 14 shows a method 1400 for encoding a coding unit into the bitstream 115, corresponding to the step 1370 of Fig. 13. The method 1400 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1400 may be performed by the video encoder 114 under execution of the processor 205. As such, the method 1400 may be stored as modules of the software 233 on computer-readable storage medium and/or in the memory 206.

[000165] The method 1400 relates to encoding a shared coding tree. The method 1400 is invoked for each CU in the coding tree, for example the CU 1114 of Fig. 11, with Y, Cb, and Cr colour channels being encoded.

[000166] The method 1400 starts at a generate prediction block step 1410. At step 1410 the video encoder 114 generates the prediction block 320 according to a prediction mode for the CU as determined at the step 1360, e.g. the intra prediction mode 387. The entropy encoder 338 encodes the intra prediction mode 387 for the coding unit, as determined at the step 1360, into the bitstream 115. A 'pred mode' syntax element is encoded to distinguish

24493980_1 between use of intra prediction, inter prediction, or other prediction modes for the coding unit. If intra prediction is used for the coding unit then a luma intra prediction mode is encoded if a luma PB is applicable to the CU and a chroma intra prediction mode is encoded if chroma PBs are applicable to the CU. That is, for an intra-predicted CU belonging to a shared tree, such as the CU 1114, the prediction mode 1116 includes the luma intra prediction mode for the luma CB of the CU and the chroma intra prediction mode for the chroma CBs of the CU. The primary transform type 389 is encoded to select between use of DCT-2 horizontally and vertically, transform skip horizontally and vertically, or combinations of DCT-8 and DST-7 horizontally and vertically for the luma TB of the coding unit.

[000167] The method 1400 continues from step 1410 to a determine residuals step 1420. The prediction block 320 is subtracted from the corresponding block of frame data 312 by the difference module 322 to produce the difference 324 for each colour channel, that is Y, Cb, and Cr.

[000168] The method 1400 continues from step 1420 to a transform residuals step 1430. At the transform residuals step 1430 the video encoder 114, under execution of the processor 205, either bypasses primary and secondary transform on the residual for the luma TB of the step 1420 or performs transforms according to the primary transform type 389 and a secondary transform index 388 for the luma TB (such as the TB 1128). Transforming of the difference 324 may be performed or bypassed according to the transform skip flags 390, and if transformed, the secondary transform may also be applied for the luma TB, as determined at the step 1350 to produce the residual samples 350 for the luma channel, as described with reference to Fig. 3. The chroma channel TBs (such as the TBS 1132 and 1136) are either transformed according to a DCT-2 or transforming is bypassed. When a secondary transform is applied to the luma TB, each chroma TB is DCT-2 transformed, that is transform bypass of the chroma TBs is prohibited. After operation of the quantisation module 334 residual coefficients 336 are available.

[000169] The method 1400 continues from step 1430 to an encode luma transform skip flag step 1440. At the step 1440 the entropy encoder 338 encodes a context-coded transform skip flag 390 into the bitstream 115, indicating either the residual for the luma TB (1128) of the shared coding tree CU is to be transformed according to a primary transform, and possibly a secondary transform, or primary and secondary transforms are to be bypassed.

24493980_1

[000170] The method 1400 continues from step 1440 to an encode luma residual last position step 1450. At the step 1450 the entropy encoder 338 encodes the position of the last significant coefficient position of the residual coefficients 336 of the luma TB when the luma TB (1128) is subject to primary, and optionally secondary, transformation and when the coded block flag 1123a indicates the presence of significant residual coefficients in the luma TB 1128. When the luma TB (1128) is subject to transform bypass, the step 1450 is omitted. The last significant coefficient position is encoded as cartesian co-ordinates relative to the DC (top-left) coefficient of the TB. The particular residual coefficient deemed 'last' is the last significant residual coefficient encountered when scanning the TB with a 4x4 sub-block-based diagonal scan in a forward direction.

[000171] The method 1400 continues from step 1450 to an LFNST signalling test step 1452. At the step 1452 the video encoder 114 determines if a secondary transform is applicable to the luma TB 1128 or not. When the transform skip flag 1126 indicates that a transform was skipped the step 1452 returns "NO" and the method 1400 continues to an encode luma residual step 1456. When the coded block flag 1123a indicates that there were no significant residual coefficients in the luma TB 1128 the step 1452 returns "NO" and the method 1400 continues to an encode luma residual step 1456. When the last significant coefficient position of the luma TB is at a position other than the DC coefficient (0, 0) and within the range of coefficients which may be significant after application of the secondary transform, e.g. 806, 824, 842, or 862 depending on the TB size, then a secondary transform may have been applied when producing the luma TB residual coefficients. The step 1452 returns "YES" and the method 1400 continues to an encode luma LFNST index step 1454. When the last significant coefficient position of the luma TB is either at the DC position (0,0) or outside of 806, 824, 842, or 863 depending on the TB size, then the secondary transform was not applied and there is no need to encode a secondary transform index. The step 1452 returns "NO" and the method 1400 continues to an encode luma residual step 1456.

[000172] At the encode luma LFNST index step 1454 the entropy encoder 338 encodes a truncated unary codeword indicating three possible selections for application of the secondary transform to the luma TB. The selections are zero (not applied), one (first kernel of candidate set applied), and two (second kernel of candidate set applied). The codeword uses at most two bins, each of which is context coded. By virtue of the testing performed at the step 1452, the step 1454 is only performed when the secondary transform can be applied, i.e. for non-zero indices to be encoded. The step 1454 encodes the flag 1142 for example.

24493980_1

[000173] The method 1400 continues from step 1454 to an encode luma residual step 1456. At the step 1456 the entropy encoder 338 encodes the residual coefficients 336 for the luma TB into the bitstream 115 when the coded block flag 1123a indicates the presence of significant residual coefficients in the luma TB 1128. The step 1456 operates to scan the residual coefficients 336, encoding each one into the bitstream 115 according to a backward diagonal scan pattern, with 4x4 sub-blocks. Residual coefficients are scanned and encoded from the last significant coefficient position, as encoded at step 1450, to the DC coefficient position. The step 1450 is performed when the CU includes a luma TB, i.e., in a shared coding tree (encoding the TB 1128).

[000174] The method 1400 continues from step 1456 to an encode chroma transform skip flags test step 1460. At the step 1460 the video encoder 114 determines if the chroma transform skip flags for the CU 1114, i.e. 1130 and 1134, need to be encoded into the bitstream 115. When a secondary transform is applied to the luma TB 1128 of the CU 1114, the chroma TBs 1132 and 1136 are subject to DCT-2 transform and there is no need to encode 1130 and 1134. The step 1460 returns "NO" and the method 1400 continues to an encode chroma residual step 1480. When a secondary transform is not applied to the luma TB 1128 of the CU 1114, each chroma TB 1132 and 1136 is subject to DCT-2 transformation or transform bypass and the flags 1130 and 1134 need to be encoded in the bitstream 115. and the step 1640 returns "YES" and the method 1400 continues to an encode chroma transform skip flag step 1470.

[000175] At the step 1470 the entropy encoder 338 encodes a context-coded transform skip flag 390 into the bitstream 115, e.g. 1130 or 1134, indicating whether the corresponding chroma TB is to be subject to DCT-2 transform or transforming is to be bypassed.

[000176] The steps 1460 and 1470 are performed for each secondary colour channel, i.e. for Cb and Cr chroma channels.

[000177] The method 1400 continues from step 1470 to the encode chroma residuals step 1480. At the step 1480 the entropy encoder 338 encodes residual coefficients for the chroma TBs 1132 and 1136 into the bitstream 115, as described with reference to step 1450. The method 1400 progresses from step 1480 to an all chroma channels test step 1485. Test step 1485 determines if all chroma channels (for example Cb and Cr) have been encoded. If so ("YES" at step 1485) the method 1400 progresses to an MTS signalling test step 1490. If a chroma channel is still to be encoded ("NO" at step 1485) the method 1400 returns to step 1460.

24493980_1

[000178] At the MTS signalling step 1490 the video encoder 114 determines if the MTS index needs to be encoded into the bitstream 115 or not. If use of the DCT-2 transform was selected at the step 1360, the last significant coefficient position may be anywhere in the upper-left 32x32 region of the TB. If the last significant coefficient position is outside of the top-left 16x16 region of the TB or any significant residual coefficients exist outside of the top-left 16x16 region of the TB, it is not necessary to explicitly signal mts_idx in the bitstream. The signal mts-idx is not required in the bitstream in this event because usage of MTS would not produce a last significant coefficient outside the top-left 16x16 region. The step 1490 returns "NO" and the method 1400 terminates, with DCT-2 usage implied by the last significant coefficient position.

[000179] Non-DCT-2 selections for the primary transform type are only available when the TB width and height are less than or equal to 32. Accordingly, for TBs of width or height exceeding 32, the step 1490 returns "NO" and the method 1400 terminates. Non-DCT-2 selections are also only available if the secondary transform is not applied, accordingly, if the secondary transform type 388 was determined to be non-zero at the step 1360, the step 1490 returns "NO" and the method 1400 terminates at the step 1490.

[000180] Presence of a last significant coefficient position within the top-left 16x16 region of the TB and significant residual coefficient also only within the top-left 16x16 region of the TB may result either from application of a DCT-2 primary transform or an MTS combination of DST-7 and/or DCT-8, necessitating explicit signalling of mtsidx to encode the selection made at the step 1360. Accordingly, when the last significant coefficient position is within the top left 16x16 region of the TB, the step 1490 returns "YES" and the method 1400 progresses to an encode MTS index step 14100.

[000181] At the encode MTS index step 14100 the entropy encoder 338 encodes a truncated unary bin string representing the primary transform type 389. The step 14100 can encode 1122 for example. The method 1400 terminates upon execution of step 14100.

[000182] Fig. 15 shows a method 1500 for decoding the bitstream 133 to produce frame data 135, the bitstream 133 including one or more slices as sequences of coding tree units. The method 1500 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1500 may be performed by the video decoder 134 under execution of the processor 205. The method 1500 is applicable when the video decoder 134 is configured to use a shared coding tree, which is the case for decoding P slices and B slices and

24493980_1 is one option for decoding I slices. The method 1500 confines signalling and application of the secondary transform to the luma channel only. As such, the method 1500 may be stored as one or more modules of the software 233 on computer-readable storage medium and/or in the memory 206.

[000183] The method 1500 begins at a decode SPS/PPS step 1510. At step 1510 the video decoder 134 decodes, for example, the SPS 1010 and the PPS 1012 from the bitstream 133 as sequences of fixed and variable length encoded parameters. Parameters of the frame data 113, such as resolution and sample bit depth, are decoded. Parameters of the bitstream, such as flags indicating the usage of particular coding tools, are also decoded. Default partition constraints signal the maximum allowed depths of binary, ternary, and quadtree splitting and are also decoded as part of the SPS 1010 by the video decoder 134.

[000184] The method 1500 continues from step 1510 to a decode picture header step 1520. In execution of step 1520 the processor 205 decodes the picture header 1015 from the bitstream 113, applicable to all slices in the current frame. The picture parameter set includes parameters specifying the frequency with which 'delta QP' syntax elements are present in the bitstream 133, offsets for chroma QP relative to luma QP, and the like. Optional overridden partition constraints signal the maximum allowed depths of binary, ternary, and quadtree splitting and may also be decoded as part of the picture header 1015 by the video decoder 134.

[000185] The method 1500 continues from step 1520 to a decode slice header step 1530. At step 1530 the entropy decoder 420 decodes the slice header 1018 from the bitstream 133.

[000186] The method 1500 continues from step 1530 to a divide slice into CTUs step 1540. In execution of step 1540 the video encoder 114 divides the slice 1016 into a sequence of CTUs (such as the CTU 1110 for example). Slice boundaries are aligned to CTU boundaries and CTUs in a slice are ordered according to a CTU scan order, generally a raster scan order. The division of a slice into CTUs establishes which portion of the frame data 133 is to be processed by the video encoder 133 in decoding the current slice.

[000187] The method 1500 continues from step 1540 to a decode coding tree step 1550. At step 1550 the video decoder 134 decodes a coding tree for a current selected CTU in the slice. The method 1500 starts from the first CTU in the slice 1016 on the first invocation of the step 1550 and progresses to subsequent CTUs in the slice 1016 on subsequent invocations. In decoding the coding tree of a CTU, flags are decoded that are indicative of the combination of quadtree, binary, and ternary splits as determined at the step 1350 in the video encoder 114. The

24493980_1 step 1550 allows CUs for the primary (luma) channel and at least one secondary (chroma) channel to be determined from decoded flags such as the split flags.

[000188] The method 1500 continues from step 1550 to a decode coding unit step 1570. At step 1570 the video decoder 134 decodes the determined coding unit (for example the CU 1114) of the step 1560 from the bitstream 133. An example of how the coding unit is decoded is described in more detail with reference to Fig. 16.

[000189] The method 1500 continues from step 1570 to a last coding unit test step 1580. At step 1580 the processor 205 tests if the current coding unit is the last coding unit in the CTU. If not ("NO" at step 1580), control in the processor 205 returns to the decode coding unit step 1560. Otherwise, if the current coding unit is the last coding unit ("YES" at step 1580) control in the processor 205 progresses to a last CTU test step 1590.

[000190] At the last CTU test step 1590, the processor 205 tests if the current CTU is the last CTU in the slice 1016. If not the last CTU in the slice 1016 ("NO" at step 1590), control in the processor 205 returns to the decode coding tree step 1550. Otherwise, if the current CTU is the last ("YES" at step 1590), control in the processor progresses to a last slice test step 15100.

[000191] At the last slice test step 15100 the processor 205 tests if the current slice being decoded is the last slice in the frame. If the current slice is not the last slice ("NO" at step 15100), control in the processor 205 returns to the decode slice header step 1530. Otherwise, if the current slice is the last slice and all slices have been decoded ("YES" at step 15100) the method 1500 terminates.

[000192] Fig. 16 shows a method 1600 for decoding a coding unit from the bitstream 133, corresponding to the step 1570 of Fig. 15. The method 1600 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1600 may be performed by the video decoder 134 under execution of the processor 205. As such, the method 1600 may be stored on computer-readable storage medium and/or as one or more modules of the software 233 in the memory 206.

[000193] When a shared coding tree is in use, the method 1600 is invoked for each CU in the coding tree, e.g. CU 1114 of Fig. 11, with Y, Cb, and Cr colour channels being encoded in a single invocation.

24493980_1

[000194] The method 1600 starts at a decode luma transform skip flag step 1610. At the step 1610 the entropy decoder 420 decodes a context-coded transform skip flag 478 (for example encoded in the bitstream as 1126 in Fig. 11) from the bitstream 133. The skip flag indicates whether transforms (primary transform and, optionally, secondary transform) are to be applied to the luma TB (for example 1128) or not. The transform skip flag 478 (1126) indicates that the residual for the luma TB is to be transformed according to (i) a primary transform or a primary transform and a secondary transform, or (ii) that primary and secondary transforms are to be bypassed. The step 1610 is performed when the CU includes a luma TB in a shared coding tree (decoding 1126 for example).

[000195] The method 1600 continues from step 1610 to a decode luma residual last position step 1630. At the step 1630 the entropy decoder 420 decodes the position of the last non-zero residual coefficient in the transform block from the bitstream 133 as a Cartesian co-ordinate relative to the top-left coefficient of the transform block, i.e. 1140 when the coded block flag 1123a indicates the presence of significant residual coefficients in the luma TB 1128. In the context of the arrangements described, 'last' is defined as the last scan position when the residual coefficients of the TB are scanned in a forward direction using a 4x4 sub-block based diagonal scan.

[000196] The method 1600 continues from step 1630 to an LFNST signalling test step 1632. At the step 1632 the processor 205 determines if the secondary transform may be applicable to the luma TB 1128 of the CU 1114 or not. If the luma TB transform skip flag 1126 indicates use of transform skip for the luma TB 1128, then the secondary transform is not applicable and there is no need to encode a secondary transform index ('NO' at step 1632). When the coded block flag 1123a indicates that there were no significant residual coefficients in the luma TB 1128 the step 1632 returns "NO" and the method 1600 continues to a decode luma residual step 1636. The method 1600 progresses to a decode luma residual step 1636, with the secondary transform index (1142) inferred as zero (not applied). For the secondary transform to be applicable, the luma TB needs to only include significant residual coefficients in the positions of the TB that are subject to secondary transformation. That is, all other residual coefficients must be zero, a condition achieved when the last position of a TB is within 806, 824, 842, or 862 for the TB sizes shown in Figs. 8A-8D. If the last position of any TB in the CU is outside of 806, 824, 842, or 862 for the considered TB size, secondary transformation is not performed ('NO' at step 1632) and the method 1600 progresses to the decode luma residual step 1636, with the secondary transform index inferred as zero (not applied). An additional condition on performing the secondary transform is the presence of at least on AC residual

24493980_1 coefficient in the residual of the luma TB 1128. That is, if the only significant residual coefficient is at the DC (top-left) position of each TB, i.e. the last position decoded at step 1630 is equal to (0, 0), then the secondary transform is not applicable ('NO' at step 1632). The method 1600 progresses to the decode luma residual step 1636 with the secondary transform index inferred as zero (not applied). Provided the luma TB 1128 is transformed, last position constraints are met (at least one AC residual coefficient, and last significant coefficient subject to inverse secondary transformation), ('YES' at step 1632), control in the processor 205 progresses to a decode luma LFNST index step 1634.

[000197] At the decode luma LFNST index step 1634, the entropy decoder 420 decodes a truncated unary codeword, i.e. 1142, as secondary transform index 474 indicating three possible selections for application of the secondary transform. The selections are zero (not applied), one (first kernel of candidate set applied), and two (second kernel of candidate set applied). The codeword uses at most two bins, each of which is context coded. By virtue of the testing performed at the step 1632, the step 1634 is only performed when it is possible for the secondary transform to be applied, i.e. for non-zero indices to be decoded. The steps 1632 and 1634 operate to determine the LFNST index (1142) of a coding unit belonging to a shared coding tree, that is, 1120 or 474. The LFNST index is decoded from the video bitstream if the luma transform skip flag 1126 indicates that a transform of the respective transform block, i.e. 1128, is not to be skipped ("YES" at step 1632 and performing step 1634).

[000198] The method 1600 continues from step 1632 or 1634 to the decode luma residual step 1636. At the step 1636 the entropy decoder 420 decodes the residual coefficients 424 (for example 1144) for the luma TB (1128) from the bitstream 115 when the coded block flag 1123a indicates the presence of significant residual coefficients in the luma TB 1128. The residual coefficients 424 are assembled into a TB by applying a scan to the sequentially decoded residual coefficients. The scan is typically a backward diagonal scan pattern, using 4x4 sub blocks. The remaining residual coefficients are decoded in order from the coefficient at the last position, as determined at the step 1630, to the DC (top-left) residual coefficient, as luma residual coefficients 1144 of the luma TB 1128. For each sub-block other than the top-left sub block of the TB and the sub-block containing the last significant residual coefficient, a 'coded sub-block flag' is decoded to indicate the presence of at least one significant residual coefficient in the respective sub-block. If the coded sub-block flag indicates the presence of at least one significant residual coefficient in a sub-block then a 'significance map', a set of flags, is decoded indicating the significance of each residual coefficient in the sub-block. If a sub-block is indicated to include at least one significant residual coefficient from a decoded coded sub

24493980_1 block flag and the scan reaches the last scan position of the sub-block without encountering a significant residual coefficient, then the residual coefficient at the last scan position in the sub block is inferred to be significant. The coded sub-block flags and the significance map (each flag being named 'sigcoeff flag') are coded using context-coded bins. For each significant residual coefficient in a sub-block an 'abslevelgtxflag' is decoded, indicating if the magnitude of the corresponding residual coefficient is greater than one or not. For each residual coefficient in a sub-block having a magnitude greater than one, a 'par levelflag' and a 'abs-level gtxflag2' are decoded to further determine the magnitude of the residual coefficient, according to the formula: AbsLevelPass l = sigcoeff flag + parlevelflag

+ abslevel-gtxflag + 2xabs_levelgtxflag2. The abs-levelgtxflag and abslevelgtxflag2 syntax elements are coded using context-coded bins. For each residual coefficient having abslevel-gtx-flag2 equal to one, a bypass-coded syntax element 'absremainder' is decoded, using Rice-Golomb coding. The decoded magnitude for a residual coefficient is determined as: AbsLevel= AbsLevelPassl + 2xabsremainder. A sign bit is decoded for each significant residual coefficient to derive the residual coefficient value from the residual coefficient magnitude. The Cartesian co-ordinates of each sub-block in a scan pattern may be derived from the scan pattern by adjusting (right shifting) the X and Y residual coefficient Cartesian co ordinates by the log2 of the sub-block width and height, respectively. For luma TBs the sub block size is always 4x4 resulting in right-shifts of two bits for X and Y. The step 1620 is performed when the CU includes a luma TB, i.e., in a shared coding tree (decoding 1128) or for an invocation for a luma branch of a dual tree (decoding 1234 for example).

[000199] The method 1600 continues from step 1636 to a decode chroma transform skip flag test step 1640. At the step 1640 the processor 205 determines if a chroma TB, e.g. 1132 or 1136, may be subject to transform skip. If the secondary transform index 1142 decoded at the step 1634 indicates application of a secondary transform to the luma TB 1128, the chroma TBs are required to use a DCT-2 transform and there is no need to decode chroma transform skip flags. the step 1640 returns "NO" and the method 1600 continues to a decode chroma residual step 1660.

[000200] If either the secondary transform index 1142 decoded at the step 1634 or operation of the step 1632 indicates no application of a secondary transform to the luma TB 1128, then the chroma TBs (1132 and 1136) may either use DCT-2 transform or transform skip. The step 1640 returns "YES" and the method 1600 continues to a decode chroma transform skip flag step 1650.

24493980_1

[000201] At the step 1650 the entropy decoder 420 decodes a context-coded flag from the bitstream 133 for a chroma TB. For example, the context-coded flag may have been encoded as 1130 and 1134 for 1132 and 11136 respectively in Fig. 11. The flags decoded at step 1650 indicate whether transforms are to be applied to a corresponding chroma TB, in particular whether the corresponding chroma TB is to be subject to DCT-2 transform, or whether all transforming for the corresponding chroma TB is to be bypassed. The step 1650 is performed when the CU includes chroma TBs, i.e., the CU belongs to a shared coding tree (decoding 1130 and 1134).

[000202] The method 1600 continues from step 1650 to the decode chroma residual step 1660. At the step 1660 the entropy decoder 420 decodes residual coefficients for a chroma TB from the bitstream 133. The step 1660 operates in a similar manner to that described with reference to steps 1630 and 1636 for a luma TB. The step 1660 is performed when the CU includes chroma TBs, i.e., when the CU belongs to a shared coding tree.

[000203] The steps 1640, 1650, and 1660, are performed for each chroma channel, i.e. for Cb and Cr channels. Control in the processor 205 progresses from the step 1670 to an all chroma channels test step 1665. The step 1665 determines if all chroma channels have been decoded (for example both Cb and Cr channels). If all chroma channels have been decoded, the step 1665 returns "YES" and the method 1600 continues to a MTS signalling step 1670.If all chroma channels have not been decoded ("NO" at step 1665), the method 1600 returns to step 1640 and selects a next chroma channel.

[000204] At the MTS signalling step 1670 the video decoder 114 determines whether the MTS index (for example, 1122) needs to be decoded from the bitstream 133 or not. If use of the DCT-2 transform for the luma TB was selected at the step 1360, when encoding the bitstream, then the last significant coefficient position may be anywhere in the upper-left 32x32 region of the TB. If the last significant coefficient position decoded at the step 1630 is outside of the top left 16x16 region of the TB the last significant residual coefficient is present outside of the top left 16x6 region of the TB, it is not necessary to explicitly decode an mtsidx because usage of any non-DCT-2 primary transform would not produce a last significant coefficient outside this region. The step 1670 returns "NO" and the method 1600 progresses from the step 1670 to a determine MTS index step 1674. Non-DCT2 primary transforms are only available when the TB width and height are less than or equal to 32. Accordingly, for TBs of width or height exceeding 32, the step 1670 returns "NO" and the method 1600 progresses to the determine MTS index step 1674.

24493980_1

[000205] Non-DCT-2 primary transforms are only available when the secondary transform type 474 indicates bypassing application of a secondary transform kernel. Accordingly, when the secondary transform type 474 or 1142 has a non-zero value, the step 1670 returns "NO" and the method 1600 progresses from the step 1672 to the step 1674. Presence of a last significant coefficient position within the top-left 16x16 region of the luma TB 1128 and only significant residua coefficients within the top-left 16x16 region of the luma TB 1128, may result either from application of a DCT-2 primary transform or an MTS combination of DST-7 and/or DCT 8, necessitating explicit signalling of mts_idx to encode the selection made at the step 1360. Accordingly, when the last significant coefficient position is within the top-left 16x16 region of the luma TB 1128 and significant residual coefficients are present only within the top-left 16x16 region of the luma TB 1128 the step 1670 returns "YES" and the method 1600 progresses to a decode MTS index step 1676.

[000206] At the determine MTS index step 1674 the video decoder 134 determines that DCT-2 is to be used as primary transform. The primary transform type 476 is set to zero. The method 1400 progresses from the step 1674 to a transform residuals step 1680.

[000207] At the decode MTS index step 1676 the entropy decoder 420 decodes a truncated unary bin string from the bitstream 133 to determine the primary transform type 476. The truncated string is in the bitstream as 1122 in Fig. 11 for example. The method 1400 progresses from the step 1676 to the transform residuals step 1680.

[000208] At the transform residuals step 1680 the video decoder 134, under execution of the processor 205, either (i) bypasses inverse primary and inverse secondary transform on the residual of the step 1420 for the luma TB 1128 or (ii) performs inverse transforms according to the primary transform type 476 and the secondary transform index 474 for the luma TB 1128 of the CU 1114 in according with the decoded luma transform skip flag 1126, as described with reference to Fig. 4. The chroma TBs 1132 and 1136 are either DCT-2 inverse transformed or transform is bypassed at step 1680, in accordance with the result of operation of the steps 1650 and 1660 for each chroma channel. The primary transform type 476 selects between use of DCT-2 horizontally and vertically, or combinations of DCT-8 and DST-7 horizontally and vertically for the luma TB 1128 of the coding unit 1114. Effectively, step 1680 transforms the luma transform block (1127) of the CU according to the decoded luma transform skip flag (1126), the primary transform type 476, and the secondary transform index (1142) determined by operation of steps 1610, 1634, and 1670-1676 to decode the coding unit. The step 1680 also transforms the chroma transform blocks (1132, 1136) of the CU according to the respective

24493980_1 decoded chroma transform skip flags (1130, 1132) by operation of steps 1640 and 1650 to decode the coding unit.

[000209] The method 1600 continues from step 1680 to a generate prediction block step 1690. At step 1690 the video decoder 134 generates the prediction block 452 for each TB of the CU according to the luma and chroma prediction modes for the CU as determined at the step 1360 and decoded from the bitstream 113 by the entropy decoder 420. The entropy decoder 420 decodes the luma and chroma prediction modes for the coding unit (for example 1116), as determined at the step 1360, from the bitstream 133. A 'pred mode' syntax element is decoded to distinguish between use of intra prediction, inter prediction, or other prediction modes for the coding unit. If intra prediction is used for the coding unit then a luma intra prediction mode is decoded if a luma PB is applicable to the CU and a chroma intra prediction mode is decoded. If intra prediction is used, a luma PB and a pair of chroma PBs are generated according to the decoded luma intra prediction mode and chroma intra prediction mode, respectively.

[000210] The method 1600 continues from step 1690 to a reconstruct coding unit step 16100. At the step 16100 the prediction block 452, that is the PBs resulting from the step 1690, are added to the residual samples 424 (the TBs resulting from the steps 1630, 1636, and 1660) for each colour channel of the CU to produce the reconstructed samples 456. Additional in-loop filtering steps, such as deblocking, may be applied to the reconstructed samples 456 before they are output as frame data 135. The method 1600 terminates on execution of the step 16100.

[000211] In an arrangement of the video encoder 114 and the video decoder 134, the secondary transform is applied to chroma channels when the chroma format is 4:2:2 or 4:4:4. Chroma formats of 4:2:2 and 4:4:4 are already more costly to implement due to the increase in the quantity of chroma samples and are reserved for applications requiring higher chroma fidelity. Professional applications, such as video editing and cinematic production, are examples of applications requiring higher chroma fidelity.

[000212] Fig. 17 shows a method 1700 for encoding the frame data 113 into the bitstream 115, the bitstream 115 including one or more slices, e.g. 1204, as sequences of coding tree units. The method 1700 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1700 may be performed by the video encoder 114 under execution of the processor 205. The method 1700 is applicable by the encoder 114 when the video encoder 114 is configured to use a separate coding tree, which is one option for producing encoded I slices. The method 1700 confines signalling and application of the

24493980_1 secondary transform to the luma channel only. As such, the method 1700 may be implemented as modules of the software 233 stored on computer-readable storage medium and/or in the memory 206.

[000213] The method 1700 begins at an encode SPS/PPS step 1710. The step 1710 is similar to the step 1310. At step 1710 the video encoder 114 encodes the SPS 1010 and the PPS 1012 into the bitstream 115 as sequences of fixed and variable length encoded parameters. Parameters of the frame data 113, such as resolution and sample bit depth, are encoded. Parameters of the bitstream, such as flags indicating the usage of particular coding tools, are also encoded. The picture parameter set includes parameters specifying the frequency with which 'delta QP' syntax elements are present in the bitstream 113, offsets for chroma QP relative to luma QP, and the like.

[000214] The method 1700 continues from step 1710 to an encode picture header step 1720. The step 1720 is similar to the step 1320. In execution of step 1720 the processor 205 encodes the picture header (for example 1015) into the bitstream 113, the picture header 1015 being applicable to all slices in the current frame. The picture header 1015 may include partition constraints signalling the maximum allowed depths of binary, ternary, and quadtree splitting, overriding similar constraints included as part of the SPS 1010.

[000215] The method 1700 continues from step 1720 to an encode slice header step 1730. At step 1730 the entropy encoder 338 encodes the slice header 1018 into the bitstream 115.

[000216] The method 1700 continues from step 1730 to a divide slice into CTUs step 1740. In execution of step 1740 the video encoder 114 divides the slice 1016 into a sequence of CTUs (such as the sequence including the CTU 1210 for example). Slice boundaries are aligned to CTU boundaries and CTUs in a slice are ordered according to a CTU scan order, generally a raster scan order. The division of a slice into CTUs establishes an order in which portions of the frame data 113 are to be processed by the video encoder 114 in encoding each current slice.

[000217] The method 1700 continues from step 1740 to a determine luma coding tree step 1750. At step 1750 the video encoder 114 determines a luma coding tree for a current selected CTU in the slice. The method 1700 starts from the first CTU in the slice 1016 on the first invocation of the step 1750 and progresses to subsequent CTUs in the slice 1016 on subsequent invocations. In determining the luma coding tree of a CTU, a variety of combinations of quadtree, binary, and ternary splits are generated by the block partitioner 310

24493980_1 and tested. The step 1750 operates to determine a CU for the primary (luma) channel from flags such as the split flag for the luma channel.

[000218]

[000219] The method 1700 continues from step 1750 to a determine luma coding units and LFNST indices step 1760. At the step 1760, for each luma coding unit of the determined luma coding tree of the step 1750, an intra prediction mode, a luma transform skip flag, a primary transform type, and a secondary transform type is determined. Referring to the example of Fig. 12, at step 1760 the prediction mode 1221, the transform skip flag 1232, the primary transform index 1226 and the secondary transform index 1237 are determined. In the example of Fig. 12, the index 1237 is located immediately after the last position flag of the luma TB 1234 rather than at a CU level (for example adjacent the MTS index 1226. Similarly to location of the index 1142 in Fig. 11, location of the index 1237 in the luma TB 1234 reduces complexity and latency compared to location CU level. Locating the secondary transform index 1237 within the luma TB 1234, such as adjacent the last position flag 1236 as shown in Fig. 12, can accordingly improve latency. The step 1760 operates to determine, for each CU of the primary channel, an index (1237) to select the secondary transform kernel corresponding to each CU.

[000220] In determining a luma intra prediction mode (1221), a two-stage search is performed. In the first stage, an approximation of the distortion resulting from each candidate intra prediction mode is generated. A set of intra prediction modes offering the lowest-distortion approximated cost is produced. The produced set is fully searched with more precise cost and distortion accounting to select a final intra prediction mode for the luma CU ("full rate distortion optimisation" or "full RDO"). The size of the determined set is dependent upon the block size. In the absence of a need to select a chroma secondary transform index, a larger set size is possible than with selection of a chroma secondary transform index. Use of the larger set size enables improved compression efficiency with the existing intra-prediction modes and transforms available to the luma and chroma channels. The absence of secondary transforms in chroma can result in a larger bitrate (reduced compression efficiency) in some instances. However, the luma compression efficiency can be improved with an increased set size for full RDO to an extent to compensate fully or partially for the larger bitrate. An example of increased set size (each set increased by one with respect to operation of the "VTM-8.0" VVC software reference model) is shown in Table 1 as follows:

const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] =

24493980_1

{

{4,4,4,4,3,3},// 4x4, 4x8, 4x16, 4x32, 4x64, 4x128,

{4,4,4,4,4,3},// 8x4, 8x8, 8x16, 8x32, 8x64, 8x128,

{4,4,4,4,4,3}, 16x4, 16x8, 16x16, 16x32, 16x64, 16x128,

{4,4,4,4,4,3}, 32x4, 32x8, 32x16, 32x32, 32x64, 32x128,

{3,4,4,4,4,3}, 64x4, 64x8, 64x16, 64x32, 64x64, 64x128,

{3,3,3,3,3,4}, //128x4,128x8,128x16,128x32,128x64,128x128,

}

Table 1 Increased set size of one mode.

[000221] Further increases beyond the specified reference set sizes for which full RDO is applied are also possible, such as an increase of two for each block size, as shown in Table 2 below:

const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6]=

{

{5,5,5,5,4,4},// 4x4, 4x8, 4x16, 4x32, 4x64, 4x128,

{5,5,5,5,5,4},// 8x4, 8x8, 8x16, 8x32, 8x64, 8x128,

{5,5,5,5,5,4}, 16x4, 16x8, 16x16, 16x32, 16x64, 16x128,

{5,5,5,5,5,4}, 32x4, 32x8, 32x16, 32x32, 32x64, 32x128,

{4,5,5,5,5,4},/ 64x4, 64x8, 64x16, 64x32, 64x64, 64x128,

{3,3,3,3,3,4}, //128x4,128x8,128x16,128x32,128x64,128x128,

}

Table 2 Increase set size of two modes.

24493980_1

[000222] Experiments show that when Table 2 is used, the coding performance in luma is overall 0.0% under 'All Intra' configuration under the JVET 'common test conditions' compared with VTM-8.0 (using secondary transforms in chroma) with encoder runtime of approximately 87% compared to VTM-8.0, indicating that further increase in the set size is possible without exceeding the encoder complexity of VTM-8.0.

[000223] The set size for popular block sizes may be further increased compared to other, less popular, block sizes. Generally, blocks with a square, or close to square, aspect ratio are more popular than highly elongated blocks. The saving in latency allowed by locating the secondary transform index to be located with the luma TB rather than at CU level, as described with reference to Figs. 11 and 12 for example. The saving in encoder time due to searching for secondary transform indices for the luma channel only (1237) in a separate tree coding tree (1214), allows the set of intra prediction modes used to select the mode 1221 to be extended by an integer number of modes compared to the "VTM-8.0" VVC software reference model.

[000224] For each luma CU (such as the CU 1220), either application of transform skip (1232) to the luma CU or application of a primary, and optionally a secondary, transform is determined (for example by determining the indices 1226 and 1237 respectively). If the luma CU is determined to use a DCT-2 as the primary transform, bypass of the secondary transform and application of each one of two possible secondary transform kernels is also tested.

[000225] The method 1700 continues from step 1760 to a determine chroma coding tree step 1762. At step 1762 the video encoder 114 determines a chroma coding tree for a current selected CTU in the slice. In determining the chroma coding tree of a CTU, a variety of combinations of quadtree, binary, and ternary splits are generated by the block partitioner 310 and tested. The step 1762 operates to determine a CU for at least one secondary (chroma) channel from flags such as the split flag for the chroma channels.

[000226] The method 1700 continues from step 1762 to a determine chroma coding units step 1764. At the step 1764, for each chroma coding unit of the determined chroma coding tree of the step 1762 (for example the CU 1250), an intra prediction mode (1251) and a chroma transform skip flag (1262) is determined. In particular, as there is no support for secondary transforms in the chroma CUs of the chroma coding tree, there is no need to select secondary transform indices for the chroma coding tree in execution of the method 1700.

[000227] The method 1700 continues from step 1764 to an encode luma coding units step 1770. At step 1770 the video encoder 114 encodes the determined luma coding tree of the step 1750

24493980_1 and the determined luma coding units (such as 1220) and LFNST indices (such as 1237) of the step 1760 into the bitstream 115, resulting in node 1214a of Fig. 12.

[000228] The method 1700 continues from step 1770 to an encode chroma coding units step 1780. At step 1780 the video encoder 114 encodes the determined chroma coding tree of the step 1762 and the determined chroma coding units (such as 1250) of the step 1764 into the bitstream 115, resulting in node 1214b of Fig. 12.

[000229] The method 1700 continues from step 1770 to a last CTU test step 1790. At the last CTU test step 1790 the processor 205 tests if the current CTU is the last CTU in the slice 1204. If the current CTU is not the last CTU in the slice 1204 ("NO" at step 1790), control in the processor 205 returns to the determine luma coding tree step 1750. Otherwise, if the current CTU is the last ("YES" at step 1790), control in the processor 205 progresses to a last slice test step 17100.

[000230] At the last slice test step 17100 the processor 205 tests if the current slice being encoded is the last slice in the frame. If the current slice is not the last slice ("NO" at step 17100), control in the processor 205 returns to the encode slice header step 1730. Otherwise, if the current slice is the last slice and all slices have been encoded ("YES" at step 17100) the method 1700 terminates.

[000231] Fig. 18 shows a method 1800 for decoding the frame data 135 from the bitstream 133, the bitstream 133 including one or more slices, e.g. 1204, as sequences of coding tree units. The method 1800 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1800 may be performed by the video decoder 134 under execution of the processor 205. The method 1800 is applicable when the video decoder 134 is configured to use a separate coding tree, which is one option for encoded I slices. The method 1800 confines signalling and application of the secondary transform to the luma channel only. As such, the method 1800 may be implemented as modules of the software 233 stored on computer-readable storage medium and/or in the memory 206.

[000232] The method 1800 begins at a decode SPS/PPS step 1810. The step 1810 is similar to the step 1510 At step 1810 the video decoder 134 decodes the SPS 1010 and the PPS 1012 from the bitstream 133 as sequences of fixed and variable length encoded parameters. Parameters of the frame data 133, such as resolution and sample bit depth, are decoded. Parameters of the bitstream, such as flags indicating the usage of particular coding tools, are also decoded. The picture parameter set includes parameters specifying the frequency with which 'delta QP'

24493980_1 syntax elements are present in the bitstream 133, offsets for chroma QP relative to luma QP, and the like.

[000233] The method 1800 continues from step 1810 to a decode picture header step 1820. The step 1820 is similar to the step 1620. In execution of step 1820 the processor 205 decodes the picture header (for example 1015) from the bitstream 133, the picture header 1015 being applicable to all slices in the current frame. The picture header 1015 may include partition constraints signalling the maximum allowed depths of binary, ternary, and quadtree splitting, overriding similar constraints included as part of the SPS 1010.

[000234] The method 1800 continues from step 1820 to a decode slice header step 1830. At step 1830 the entropy decoder 420 decodes the slice header 1018 from the bitstream 133. The step 1830 is similar to the step 1630.

[000235] The method 1800 continues from step 1830 to a divide slice into CTUs step 1840. In execution of step 1840 the video decoder 134 divides the slice 1016 into a sequence of CTUs. Slice boundaries are aligned to CTU boundaries and CTUs in a slice are ordered according to a CTU scan order, generally a raster scan order. The division of a slice into CTUs establishes an order in which portions of the frame data 133 are to be processed by the video decoder 134 in decoding each current slice. The step 1840 is similar to the step 1640.

[000236] The method 1800 continues from step 1840 to a decode luma coding tree step 1850. At step 1850 the video decoder 134 decodes a luma coding tree (such as 1214a) for a current selected CTU in the slice. The method 1800 starts from the first CTU in the slice 1016 on the first invocation of the step 1850 and progresses to subsequent CTUs in the slice 1016 on subsequent invocations. In decoding the luma coding tree of a CTU, a variety of split flags signalling quadtree, binary, and ternary splits are decoded. The step 1850 operates to determine a CU for the primary (luma) channel from decoded flags such as the split flag for the luma channel.

[000237] The method 1800 continues from step 1850 to a decode luma coding units and LFNST indices step 1860. At the step 1860, for each luma coding unit (such as 1220) of the determined luma coding tree of the step 1850, an intra prediction mode (1221), a luma transform skip flag (1232), a primary transform type (1226), and a secondary transform type (1237) is decoded. The step 1860 operates to determine, for each CU of the primary channel, an index (1237) to select the secondary transform kernel corresponding to each CU.

24493980_1

[000238] For each luma CU, either application of transform skip to the luma CU or application of a primary, and optionally a secondary, transform is determined by decoding the luma transform skip flag 1232. If the luma TB 1234 is determined to use either DCT-2 or an MTS transform as primary transform type, the luma last position 1236 is decoded at step 1260 from the bitstream indicating the position of the last significant residual coefficient in the luma TB 1234. If the luma CU is determined to use a DCT-2 as the primary transform and the last position 1236 indicates a non-DC residual coefficient and is within the range of secondary transform coefficients (i.e. either up to scan position 7 or 15, i.e. 806, 824, 842, or 862, depending on the TB size as described with reference to Figs. 8A-8D) then the secondary transform index 1237 is decoded to determine whether bypass of the secondary transform or application of one of two possible secondary transform kernels is to be performed.

[000239] The method 1800 continues from step 1860 to a decode chroma coding tree step 1862. At step 1862 the video decoder 134 decodes a chroma coding tree (such as 1214b) for a current selected CTU in the slice. In decoding the chroma coding tree of a CTU, split flags indicating the combinations of quadtree, binary, and ternary splits in the chroma coding tree are decoded from the bitstream 133 by the entropy decoder 420. Each decoded chroma coding tree overlaps a corresponding luma coding tree. Each collocated luma and chroma coding tree typically covers a 64x64 pixel region of the frame, corresponding to firstly performing a quadtree split of a 128x128 CTU into four quadrants followed by subsequent separation into luma and chroma trees, as described with reference to Fig. 12. The step 1862 operates to determine CUs for at least one secondary (chroma) channel from decoded flags such as the split flag for the chroma channels.

[000240] The method 1800 continues from step 1862 to a decode chroma coding units step 1864. At the step 1864, for each chroma coding unit (such as 1250) of the decoded chroma coding tree of the step 1862, an intra prediction mode (1251) and a chroma transform skip flag (1262 or 1266) is decoded from the bitstream 133. In particular, as there is no support for secondary transforms in the chroma CUs of the chroma coding tree in the arrangements described, there is no need to decode secondary transform indices for coding units of the chroma coding tree.

[000241] The method 1800 continues from step 1864 to a transform luma residuals step 1870. At step 1870 the video decoder 134 transforms the decoded residual of each luma TB according to the primary and secondary transform type (1226 and 1237, respectively), or bypasses

24493980_1 transforming the residual for a luma TB, according to the corresponding luma transform skip flag (1232).

[000242] The method 1800 continues from step 1870 to a transform chroma residuals step 1880. At step 1880 the video decoder 134 either bypasses transforming of a chroma TB residual (such as 1264 or 1268) or performs a DCT-2 transform, according to a respective chroma transform skip flag (i.e. 1262 or 1266).

[000243] The method 1800 continues from step 1880 to a last CTU test step 1890. At the last CTU test step 1890 the processor 205 tests if the current CTU is the last CTU in the slice 1204. If the current CTU is not the last CTU in the slice 1204 ("NO" at step 1890), control in the processor 205 returns to the decode luma coding tree step 1850. Otherwise, if the current CTU is the last ("YES" at step 1890), control in the processor 205 progresses to a last slice test step 18100.

[000244] At the last slice test step 18100 the processor 205 tests if the current slice being encoded is the last slice in the frame. If the current slice is not the last slice ("NO" at step 18100), control in the processor 205 returns to the decode slice header step 1830. Otherwise, if the current slice is the last slice and all slices have been encoded ("YES" at step 18100) the method 1800 terminates.

[000245] The methods 1700 and 1800 operate to encode or decode CUs for a dual tree case in which a secondary transform is not used for the chroma (secondary) channels. A secondary transform index (1237) is encoded for the luma (primary channel only. Resultantly, residual coefficients of CUs for the primary channel can be encoded using a primary transform, an optional secondary transform (the kernel indicated using the index 1237) or by bypass. In contrast, the chroma (secondary) channels are encoded based on a selection from a set consisting of a primary transform (such as a DCT-2 transform) or bypass encoded. Either one the DCT-2 transform and bypass are available for the secondary channels, a secondary transform is not available or encoded. Correspondingly, decoding of the CUs for the chroma channels is limited to application of a DCT-2 transform or bypass to residual coefficients of the CUs determined from the bitstream and a secondary transform is not available or signalled in the bitstream.

[000246] Performance of the methods 1300, 1400, and 1700 in the video encoder 114 and the methods 1500, 1600, and 1800 in the video decoder 134 allow improved compression efficiency is achieved without the need for secondary transform logic in the chroma channels.

24493980_1

When a 'random access' picture structure is used, the separate tree operation (for I slice) of the methods 1700 and 1800 typically applies only infrequently, or example to approximately one frame every second whereas the shared tree operation of the methods 1300, 1400, 1500, and 1600 applies to the remaining frames (P or B slice). Consequently, absence of secondary transform logic from the infrequently coded I-slices imposes negligible compression efficiency loss when such logic is already absent from the frequently coded P and B slices.

[000247] In each of the shared coding tree and separate coding tree cases, the secondary transform indices (i.e. 1142, 1237) are signalled immediately after the luma TB residual last position (i.e. 1140, 1236), permitting application of the secondary transform with reduced latency.

INDUSTRIAL APPLICABILITY

[000248] The arrangements described are applicable to the computer and data processing industries and particularly for the digital signal processing for the encoding a decoding of signals such as video and image signals, achieving high compression efficiency with reduced implementation complexity.

[000249] Some arrangements described herein reduce multiplier count and thus implementation cost by signalling and applying the secondary transform index only for coding units in the luma channel, independent of whether a shared tree or a separate tree is used. Complexity reduction is achieved in the video encoder due to the absence of a need to select a secondary transform index for chroma coding units in chroma coding trees when separate coding trees are in use. Application of secondary transforms to the luma channel only permits moving the secondary transform index earlier in the bitstream, allowing a latency reduction to be achieved, for example in video decoder implementations as the decoded residual coefficients of the coding unit can be immediately processed according to the decoded secondary transform index instead of being buffered until the secondary transform index becomes known. The reduction in latency can in some arrangements allow the number of intra prediction modes tested for full RD to be increased. Increasing the number of intra prediction modes can improve gain without losing the improvement on latency provided by moving the secondary transform index to earlier in the bitstream.

[000250] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

24493980_1

[000251] In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

24493980_1

Claims

1. A method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising:

determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree;

decoding, for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and

decoding the first plurality of coding units by applying the corresponding selected kernel and a DCT-2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by applying either one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

2. The method according to claim 1, wherein the at least one secondary colour channel comprises two secondary colour channels, the DCT-2 transform is applied to one of the two secondary colour channels and the bypass is applied to the other of the two secondary colour channels.

3. The method according to claim 1, wherein the decoded index is located in the bitstream immediately after a last position of transform blocks of each of the coding units for the primary colour channel.

4. A method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units for primary colour channel and the second plurality of coding units for least one secondary colour channel, the method comprising:

24493980_1 determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree; decoding indices to select kernels for the first plurality of coding units only, wherein each index selects one of a plurality of kernels for a corresponding one of the first plurality of coding units; and decoding the first plurality of coding units by applying the selected kernel and a DCT-2 transform to residual coefficients of the corresponding coding units and decoding the second plurality of coding units applying one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units.

5. A method of decoding a coding tree unit of a bitstream of video data, each coding unit of the coding tree unit being for a primary colour channel and for at least one secondary colour channel, the method comprising:

determining the coding unit according to decoded split flags of the primary colour channel and the at least one secondary channel;

decoding, for the coding units, an index to select a kernel for the primary colour channel, the decoded index being located immediately after a last position of transform blocks of the primary colour channel in the bitstream; and

decoding the coding unit by applying the corresponding selected kernel and a DCT-2 transform to residual coefficients of a transform block for the primary colour channel and applying one of a DCT-2 transform or a bypass operation to residual coefficients of the transform blocks for the at least one secondary colour channel.

6. The method according to claim 5, wherein the index one of the first plurality of coding units is determined if presence of a last residual coefficient of the transform block of the coding unit is equal to or less than a threshold last position.

7. The method according to claim 6 wherein the threshold last position is 7 if the transform block has a size of 4x4 or8x8 residual coefficients.

24493980_1

8. The method according to claim 6 wherein the threshold last position is 15 if the transform block has a size other than 4x4 or8x8 residual coefficients.

9. A method of encoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, into a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising:

determining the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel;

determining, for each of the first plurality of coding units only, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit;

encoding the first plurality of coding units into the bitstream by applying a DCT-2 transform followed by the corresponding selected kernel to residual coefficients for each coding unit; and

encoding the second plurality of coding units into the bitstream by applying one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

10. The method according to claim 9, wherein the determined index is encoded into the bitstream immediately after a last position of transform blocks of each of the coding units for the primary colour channel.

11. The method according to claim 9, further comprising determining an intra prediction mode for each of the first plurality of coding units, the intra prediction mode selected from a set of prediction modes, the set being extended by an integer number of modes based on a VTM 8.0 Variable Video Coding software reference model.

12. The method according to claim 9, further comprising determining an intra prediction mode for each of the first plurality of coding units, the intra prediction mode selected from a set of prediction modes of:

24493980_1 const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 4, 4}, for transform blocks sized 4x4, 4x8, 4x16, 4x32, 4x64, 4x128; const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 5, 4}, for transformblocks sized 8x4, 8x8, 8x16, 8x32, 8x64, 8x128; const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 5, 4}, for transform blocks sized 16x4, 16x8, 16x16, 16x32, 16x64, 16x128; const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {5, 5, 5, 5, 5, 4}, for transform blocks sized 32x4, 32x8, 32x16, 32x32, 32x64, 32x128; const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = {4, 5, 5, 5, 5, 4}, for transform blocks sized 64x4, 64x8, 64x16, 64x32, 64x64, 64x128; or const uint8_t g_aucntraModeNumFastUseMPM_2D[6][6] = 3, 3, 3, 3, 3, 4}, for transform blocks sized 128x4, 128x8, 128x16, 128x32, 128x64, 128x128.

13. A non-transitory computer readable medium having a computer program stored thereon to implement a method of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising:

decoding the first plurality of coding units by applying the corresponding selected kernel and a DCT-2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by selecting either one of a set consisting of a DCT-2 transform or a bypass operation to apply to residual coefficients of the coding units for the at least one secondary colour channel.

24493980_1

14. A system, comprising:

a memory; and

a processor, wherein the processor is configured to execute code stored on the memory for implementing of decoding a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel, the method comprising:

15. A video decoder, configured to:

receive a first plurality of coding units of a first coding tree of a coding tree unit, and a second plurality of coding units of the coding tree unit, from a video bitstream, the first plurality of coding units being for a primary colour channel and the second plurality of coding units being for at least one secondary colour channel,

determine the first plurality of coding units for the primary colour channel and the second plurality of coding units for the at least one secondary colour channel according to decoded split flags of the first coding tree and the second coding tree;

decode, for each of the first plurality of coding units, an index to select a kernel, wherein each index selects one of a plurality of kernels for the corresponding coding unit; and

24493980_1 decode the first plurality of coding units by applying the corresponding selected kernel and a DCT-2 kernel to each residual coefficients for each coding unit and decoding the second plurality of coding units by applying either one of a DCT-2 transform or a bypass operation to residual coefficients of the coding units for the at least one secondary colour channel.

CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant Spruson & Ferguson

24493980_1