CN117837146A - AI-based image encoding and decoding apparatus and method of performing the same - Google Patents

AI-based image encoding and decoding apparatus and method of performing the same Download PDF

Info

Publication number
CN117837146A
CN117837146A CN202280054487.4A CN202280054487A CN117837146A CN 117837146 A CN117837146 A CN 117837146A CN 202280054487 A CN202280054487 A CN 202280054487A CN 117837146 A CN117837146 A CN 117837146A
Authority
CN
China
Prior art keywords
image
cross
feature data
luminance
channel prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280054487.4A
Other languages
Chinese (zh)
Inventor
趋可卡纳哈·迪娜
崔光杓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020210188870A external-priority patent/KR20230022085A/en
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority claimed from PCT/KR2022/011070 external-priority patent/WO2023013966A1/en
Publication of CN117837146A publication Critical patent/CN117837146A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

According to an embodiment, an image decoding method is disclosed, the method comprising the steps of: acquiring cross-channel prediction information by applying the feature data for cross-channel prediction to a neural network-based cross-channel decoder, acquiring a predicted image of a chroma image by performing cross-channel prediction based on the reconstructed luma image and the cross-channel prediction information, acquiring a residual image of the chroma image by applying the feature data of the chroma image to a neural network-based chroma residual decoder, and reconstructing the chroma image based on the predicted image and the residual image.

Description

AI-based image encoding and decoding apparatus and method of performing the same
Technical Field
The present disclosure relates to image encoding and decoding. More particularly, the present disclosure relates to techniques for encoding and decoding feature data required for cross-channel prediction of an image by using Artificial Intelligence (AI) (e.g., neural networks), and techniques for encoding and decoding an image.
Background
Codecs such as h.264 Advanced Video Codec (AVC) and High Efficiency Video Codec (HEVC) may divide an image into blocks, and predictively encode and decode each block through inter prediction or intra prediction.
Intra prediction is a method of compressing pictures by removing spatial redundancy in the pictures, and inter prediction is a method of compressing pictures by removing temporal redundancy between pictures.
As a representative example of intra prediction, there is a prediction method of a chrominance signal using a cross-component linear model (CCLM). The prediction method of a chrominance signal using the CCLM improves coding prediction performance by removing redundancy existing between a signal and a chrominance signal.
More specifically, a prediction method of a chrominance signal using a CCLM predicts the chrominance signal through a linear model that calculates a correlation between a sample of the chrominance signal and a sample of a luminance component reconstructed at the same position as the sample of the chrominance signal.
As a detailed example of a prediction method of a chrominance signal using the CCLM, there is a prediction method of a chrominance signal using a multidirectional linear model (MDLM). When deriving coefficients of a linear model, a prediction method of a chrominance signal using MDLM may support a mode using top and left side neighboring samples (lm_cclm), top neighboring samples (t_cclm), or left side neighboring samples (l_cclm).
Disclosure of Invention
Technical problem
An image encoding and decoding apparatus and method according to embodiments of the present disclosure enable signaling an image at a low bit rate through cross-channel prediction.
In addition, the image encoding and decoding apparatus and method according to the embodiments of the present disclosure accurately predicts and reconstructs a chrominance image by cross-channel prediction.
Further, the image encoding and decoding apparatus and method according to the embodiments of the present disclosure accurately reconstruct an image from a bitstream having a low bit rate.
Technical solution
A method of decoding an image based on cross-channel prediction using Artificial Intelligence (AI) according to an embodiment of the present disclosure includes: obtaining feature data for cross-channel prediction from a bitstream, obtaining feature data of a luminance image in a current image and feature data of a chrominance image in the current image from the bitstream, reconstructing the luminance image by applying the feature data of the luminance image to a neural network-based luminance decoder, obtaining cross-channel prediction information by applying the feature data for cross-channel prediction to the neural network-based cross-channel decoder, obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information, obtaining a residual image of the chrominance image by applying the feature data of the chrominance image to a neural network-based chrominance residual decoder, and reconstructing the chrominance image based on the predicted image and the residual image.
At least one of the feature data for cross-channel prediction, the feature data of the luminance image, or the feature data of the chrominance image may be obtained by entropy decoding and dequantizing the bitstream.
The neural network-based cross-channel decoder may be trained based on first loss information corresponding to a difference between a current training chromaticity image and a current reconstructed training chromaticity image corresponding to the current training chromaticity image and second loss information corresponding to entropy of feature data of the current training chromaticity image for cross-channel prediction.
The method may further comprise: when the chroma subsampling format of the current image is not YUV (YCbCr) 4:4:4, downsampling the reconstructed luma image, wherein obtaining the predicted image of the chroma image comprises obtaining the predicted image of the chroma image by performing cross-channel prediction based on the downsampled luma image and the cross-channel prediction information.
The method may further comprise: when the chroma subsampling format of the current image is not YCbCr 4:4:4, generating multi-channel luminance image data by performing a spatial-to-depth transform on the reconstructed luminance image, wherein obtaining a predicted image of the chrominance image comprises obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the multi-channel luminance image and the cross-channel prediction information.
The luminance image may include an image of a Y component and the chrominance image may include at least one of an image of a Cb component or an image of a Cr component.
Obtaining cross-channel prediction information by applying the feature data for cross-channel prediction to a neural network-based cross-channel decoder may include: the cross-channel prediction information is obtained by applying the feature data for cross-channel prediction and the feature data of the luminance image to a neural network-based cross-channel decoder.
Obtaining a residual image of the chroma image by applying the feature data of the chroma image to a neural network-based chroma residual decoder may include: a residual image of the chroma image is obtained by further applying at least one of the feature data of the luma image or the feature data for cross-channel prediction to a neural network based chroma residual decoder.
The cross-channel prediction information may include information about scalar parameters and information about bias parameters.
A computer-readable recording medium according to an embodiment of the present disclosure has recorded thereon a program for executing a method of decoding an image on a computer.
An apparatus for decoding an image based on cross-channel prediction using Artificial Intelligence (AI) according to an embodiment of the present disclosure includes: an acquirer configured to acquire feature data for cross-channel prediction from a bitstream, and to acquire feature data of a luminance image in a current image and feature data of a chrominance image in the current image from the bitstream; and an image decoder configured to reconstruct a luminance image by applying the feature data of the luminance image to the neural network-based luminance decoder, obtain cross-channel prediction information by applying the feature data for cross-channel prediction to the neural network-based cross-channel decoder, and obtain a predicted image of a chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information, obtain a residual image of the chrominance image by applying the feature data of the chrominance image to the neural network-based chrominance residual decoder, and reconstruct a chrominance image based on the predicted image of the chrominance image and the residual image of the chrominance image.
A method of encoding an image based on cross-channel prediction using Artificial Intelligence (AI) according to an embodiment of the present disclosure includes: obtaining feature data of a luminance image in a current image by applying an original luminance image in the current original image to a neural network-based luminance encoder, reconstructing the luminance image by applying the feature data of the luminance image to a neural network-based luminance decoder, obtaining feature data for cross-channel prediction by applying the reconstructed luminance image and the original chrominance image in the current original image to a neural network-based cross-channel encoder, obtaining cross-channel prediction information by applying the obtained feature data for cross-channel prediction to the neural network-based cross-channel decoder, obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information, obtaining feature data of the chrominance image by applying a residual image of the chrominance image obtained based on the original chrominance image and the predicted image to the neural network-based chrominance residual encoder, and generating a bitstream including the feature data of the luminance image, the feature data of the chrominance image, and the feature data for cross-channel prediction.
At least one of the feature data for cross-channel prediction, the feature data of the luminance image, or the feature data of the chrominance image may be quantized and entropy encoded.
The neural network-based cross-channel encoder may be trained based on first loss information corresponding to a difference between a current training chromaticity image and a current reconstructed training chromaticity image corresponding to the current training chromaticity image and second loss information corresponding to entropy of feature data of the current training chromaticity image for cross-channel prediction.
The method may further comprise: when the chroma subsampling format of the current image is not YCbCr 4:4:4, performing downsampling on the reconstructed luma image, wherein obtaining the predicted image of the chroma image comprises obtaining the predicted image of the chroma image by performing cross-channel prediction based on the downsampled luma image and the cross-channel prediction information.
The method may further comprise: when the chroma subsampling format of the current image is not YCbCr 4:4:4, generating multi-channel luminance image data by performing a spatial-to-depth transform on the reconstructed luminance image, wherein obtaining a predicted image of the chrominance image comprises obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the multi-channel luminance image and the cross-channel prediction information.
Obtaining cross-channel prediction information by applying the feature data for cross-channel prediction to a neural network-based cross-channel decoder may include: the cross-channel prediction information is obtained by applying the feature data for cross-channel prediction and the feature data of the luminance image to a neural network-based cross-channel decoder.
The method may further comprise: obtaining a residual image of the chroma image by applying the characteristic data of the chroma image to a neural network-based chroma residual decoder, wherein obtaining the residual image of the chroma image by applying the characteristic data of the chroma image to the neural network-based chroma residual decoder comprises obtaining the residual image of the chroma image by further applying at least one of the characteristic data of the luma image or the characteristic data for cross-channel prediction to the neural network-based chroma residual decoder.
An apparatus for encoding an image based on cross-channel prediction using Artificial Intelligence (AI) according to an embodiment of the present disclosure includes: an encoder configured to obtain feature data of a luminance image in a current image by applying an original luminance image in the current original image to a neural network-based luminance encoder, reconstruct the luminance image by applying the feature data of the luminance image to a neural network-based luminance decoder, obtain feature data for cross-channel prediction by applying the reconstructed luminance image and an original chrominance image in the current original image to a neural network-based cross-channel encoder, obtain cross-channel prediction information by applying the obtained feature data for cross-channel prediction to a neural network-based cross-channel decoder, obtain a predicted image of a chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information, and obtain feature data of a chrominance image by applying a residual image of a chrominance image obtained based on the original chrominance image and the predicted image of the chrominance image to a neural network-based chrominance residual encoder; and a bitstream generator configured to generate a bitstream including the feature data of the luminance image, the feature data of the chrominance image, and the feature data for cross-channel prediction.
A method of reconstructing an image based on cross-channel prediction using Artificial Intelligence (AI) according to an embodiment of the present disclosure includes: obtaining feature data for cross-channel prediction from a bitstream, obtaining feature data of a luminance residual image in a current image and feature data of a chrominance residual image in the current image from the bitstream, reconstructing the luminance residual image by applying the feature data of the luminance residual image to a neural network-based luminance residual decoder, obtaining cross-channel prediction information by applying the feature data for cross-channel prediction to a neural network-based cross-channel decoder, obtaining a predicted image of the chrominance residual image by performing cross-channel prediction based on the reconstructed luminance residual image and the cross-channel prediction information, obtaining a residual image of the chrominance residual image by applying the feature data of the chrominance residual image to the neural network-based chrominance residual decoder, and reconstructing the chrominance residual image based on the predicted image of the chrominance residual image and the residual image of the chrominance residual image.
A method of encoding an image based on cross-channel prediction using Artificial Intelligence (AI) according to an embodiment of the present disclosure includes: obtaining feature data of a luminance residual image by applying the luminance residual image in the current image to a neural network-based luminance residual encoder, and reconstructing the luminance residual image by applying the feature data of the luminance residual image to a neural network-based luminance residual decoder; obtaining feature data for cross-channel prediction by applying the reconstructed luminance residual image and the chrominance residual image in the current image to a neural network-based cross-channel encoder; obtaining cross-channel prediction information by applying the obtained feature data for cross-channel prediction to a cross-channel decoder based on a neural network, obtaining a predicted image of a chrominance residual image by performing cross-channel prediction based on the reconstructed luminance residual image and the cross-channel prediction information, obtaining feature data of the chrominance residual image by applying the chrominance residual image obtained based on the residual image of the chrominance residual image and the predicted image of the chrominance residual image to a chrominance residual encoder based on the neural network, and generating a bitstream including the feature data of the luminance residual image, the feature data of the chrominance residual image, and the feature data for cross-channel prediction.
Advantageous effects
An image encoding and decoding apparatus and method according to an embodiment of the present disclosure may enable signaling an image at a low bit rate through cross-channel prediction.
Further, the image encoding and decoding apparatus and method according to the embodiments of the present disclosure may accurately reconstruct an image from a bitstream having a low bit rate.
Drawings
Fig. 1 is a diagram illustrating an Artificial Intelligence (AI) -based inter-prediction process for an image, which is a schematic diagram of an AI-based cross-channel prediction process for an image.
Fig. 2a is a block diagram of an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 2b is a block diagram of an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 3 is a diagram of the acquirer shown in fig. 2 a.
Fig. 4a is a diagram of the image decoder shown in fig. 2 a.
Fig. 4b is a diagram of the image decoder shown in fig. 2 b.
Fig. 5a is a flowchart of an image decoding method according to an embodiment of the present disclosure.
Fig. 5b is a flowchart of an image decoding method according to another embodiment of the present disclosure.
Fig. 6a is a block diagram of an image encoding apparatus according to an embodiment of the present disclosure.
Fig. 6b is a block diagram of an image encoding apparatus according to an embodiment of the present disclosure.
Fig. 7 is a block diagram of the image encoder shown in fig. 6 a.
Fig. 8 is a block diagram of the bit stream generator shown in fig. 6.
Fig. 9a is a flowchart of an image encoding method according to an embodiment of the present disclosure.
Fig. 9b is a flowchart of an image encoding method according to an embodiment of the present disclosure.
Fig. 10a is a diagram for describing cross-channel prediction according to an embodiment of the present disclosure.
Fig. 10b is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Fig. 10c is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus including a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Fig. 11a is a diagram for describing cross-channel prediction according to an embodiment of the present disclosure.
Fig. 11b is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Fig. 12a is a diagram for describing cross-channel prediction according to an embodiment of the present disclosure.
Fig. 12b is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Fig. 13 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 14 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 15 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 16 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 17 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 18 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Fig. 19 is a diagram showing an example of an architecture of a neural network according to an embodiment of the present disclosure.
Fig. 20 is a diagram for describing convolution operations in a first convolution layer according to an embodiment of the present disclosure.
Fig. 21 is a diagram for describing a method of training a first decoder, a first encoder, a second decoder, a second encoder, a third decoder, and a third encoder.
Fig. 22 is a diagram for describing a process of training a neural network used in the first decoder, the second decoder, the third decoder, the first encoder, the second encoder, and the third encoder performed by the training apparatus.
Detailed Description
Throughout the disclosure, the expression "at least one of a, b or c" means all of a alone, b alone, c alone, both a and b, both a and c, both b and c, a, b and c, or variants thereof.
While the embodiments of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit embodiments of the disclosure to the particular forms disclosed, but on the contrary, embodiments of the disclosure are to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure.
In the following description of the present disclosure, a detailed description of known functions and configurations incorporated herein will be omitted when it may obscure the subject matter of the present disclosure. Moreover, the numerals used in describing the present specification (e.g., first, second, etc.) are merely identification symbols that distinguish one component from another.
Furthermore, when an element is referred to herein as being "connected" or "coupled" to another element, it can be directly connected or directly coupled to the other element, but it should be understood that the element can also be connected or coupled to the other element via the other element therebetween, unless otherwise indicated.
In the present disclosure, two or more elements expressed as "units", "modules", or the like may be combined into one element, or one element may be divided into two or more elements for subdividing functions. Each element described herein may perform not only its primary function, but additionally some or all of the functions of other elements, and some of the primary functions of each element may be performed exclusively by another element.
As used herein, the term "image" may refer to a still image (or frame), a moving image or video that includes a plurality of consecutive still images.
The "neural network" in this specification is a representative example of an artificial neural network model that simulates a brain nerve, and is not limited to an artificial neural network model using a specific algorithm. Neural networks may also be referred to as deep neural networks.
In this specification, a "parameter" is a value used for calculation at each layer included in the neural network, and can be used, for example, to apply an input value to a specific operation formula. The parameters are values set as a result of training and may be updated based on separate training data if necessary.
In the present specification, "characteristic data" refers to data obtained by processing input data by a neural network-based encoder. The feature data may be 1-dimensional or 2-dimensional data including a plurality of samples. The feature data may be referred to as a potential vector or potential representation. The characteristic data represents potential characteristics of the data output by the decoder described below.
"feature data for cross-channel prediction" may indicate scalar parameter values and bias parameter values for cross-channel prediction. Here, the scalar parameter value may be a value obtained by performing a multiplication operation on each element and a sample value of the reconstructed luminance image, and the offset parameter value may be a value obtained by performing an addition operation on each element on a result obtained by performing a multiplication operation on each element on the scalar parameter and a sample value of the reconstructed luminance image. The "feature data for cross-channel prediction" may be referred to as a "cross-channel stream".
The "cross-channel flow" may be a concept corresponding to the optical flow of fig. 1, with a motion vector for each pixel.
The difference in position (motion vector) between a block or sample in the current image and a reference block or reference sample in the previously reconstructed image can be used to encode and decode the current image. These positional differences may be referred to as optical flow. Optical flow may be defined as a set of motion vectors corresponding to points or blocks in an image.
Similar to optical flow, a cross-channel flow may be defined as a set of parameters including a linear model for transforming samples of a luminance component into samples of a chrominance component. In this case, the parameters of the linear model may include scalar quantities and deviations. Based on the cross-channel stream, chroma channels (samples of chroma components) can be predicted from luma channels (samples of luma components).
In this specification, a "sample point" corresponds to data assigned to a sampling position in image or feature data, and refers to data to be processed. For example, the samples may comprise samples in a 2-dimensional image.
Fig. 1 shows an Artificial Intelligence (AI) -based inter-prediction process for an image, which is a schematic diagram of an AI-based cross-channel prediction process for an image.
Fig. 1 shows a current image x i The encoding and decoding processes are performed, and the first encoder 110, the second encoder 130, the first decoder 150, and the second decoder 170 are used for inter prediction. The first encoder 110, the second encoder 130, the first decoder 150, and the second decoder 170 are implemented as a neural network.
Inter prediction is performed by using the current image x i With previously reconstructed image y i-1 Temporal redundancy between current image x i The encoding and decoding processes are performed.
Current image x i Block or sample in (a) and the previously reconstructed image y i-1 The difference in position (motion vector) between the reference blocks or reference samples in (b) can be used to determine the current image x i Encoding and decoding are performed. These positional differences may be referred to as optical flow. Optical flow may be defined as a set of motion vectors corresponding to points or blocks in an image.
Optical flow represents the previously reconstructed image y i-1 How the position of the sample point in the current image x i Or the current image x i Is used to reconstruct an image y in advance i-1 Is provided. For example, when located in the current image x i The sample point at (1, 1) in (1) is located in the previously reconstructed image y i-1 At (2, 1), the optical flow or motion vector of the sample can be calculated as (1 (=2-1), 0 (=1-1)).
In the image encoding and decoding process using the AI, the first encoder 110 and the first decoder 150 are used to obtain the rightFront image x i Current optical flow g i
Specifically, the image y was previously reconstructed i-1 And the current image x i Is input to the first encoder 110. The first encoder 110 processes the current image x by setting parameters based on the training result i And previously reconstructed image y i-1 Outputting characteristic data w of current optical flow i
Characteristic data w of current optical flow i Potential features of the current optical flow may be represented.
Characteristic data w of current optical flow i Is input to the first decoder 150. The first decoder 150 processes the input feature data w by processing the input feature data based on the parameters set as the training result i To output the current optical flow g i
Based on the current optical flow g i To warp the previously reconstructed image y i-1 (see reference numeral 190) and obtains the current predicted image x' i As a result of the distortion 190. The warping 190 is a geometric transformation used to change the position of the sample points in the image. By reconstructing the image y based on the representation i-1 In (a) and the current image x i Optical flow g of relative position between sample points in (3) i To warp 190 the previously reconstructed image y i-1 To obtain the current image x i Similar current prediction image x' i . For example, when located in the previously reconstructed image y i-1 The sample point at (1, 1) and the current image x i When the samples at (2, 1) in (2, 1) are most similar, the previously reconstructed image y can be located by warping 190 i-1 The position of the sample point at (1, 1) in (2, 1) is changed.
Due to the use of previously reconstructed images y i-1 Generated current prediction image x' i Not the current image x i Itself, therefore, the current prediction image x 'can be obtained' i And the current image x i Current differential image data r corresponding to the difference between them i
For example, by subtracting the current image x from the current image x i Subtracting the current predicted image x 'from the sample value in (a)' i Obtaining the current differential image data r by the sample value in the image data r i
Current differential image data r i Is input to the second encoder 130. The second encoder 130 processes the current differential image data r by setting parameters based on the training result i To output characteristic data v of the current differential image data i
Characteristic data v of current differential image data i Is input to the second decoder 170. The second decoder 170 processes the input feature data v by processing the input feature data v based on parameters set as a result of training i To output the current differential image data r' i
By differentiating the current data r' i And previously reconstructing image y by warping 190 i-1 Generated current prediction image x' i Combining to obtain the current reconstructed image y i
In the inter-prediction process shown in fig. 1, the feature data w of the current optical flow obtained by the first encoder 110 i Is input to the first decoder 150.
When viewing the current image x from the angle of the encoding device i The encoding device must generate characteristic data w of the current optical flow at the time of encoding and decoding processes of (a) i Corresponding bit stream to transfer characteristic data w of current optical flow i Signaling to the decoding device. However, when included in the current image x i And previous image x i-1 When the movement of the object in (a) is large, the amplitude of the sample values included in the current optical flow is large, so that the feature data w representing the potential features of the current optical flow is based on i The bit rate of the generated bit stream may also be high.
The embodiments of the present disclosure described below are embodiments of the present disclosure related to a method of performing cross-channel prediction by using a cross-channel flow corresponding to an optical flow of inter-prediction. The cross-channel flow and cross-channel prediction based on the cross-channel flow will be described in detail with reference to fig. 10a and 10b, etc.
Fig. 2a is a block diagram of an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 2a, the image decoding apparatus 200 according to the embodiment of the present disclosure includes an acquirer 210 and an image decoder 230.
The acquirer 210 and the image decoder 230 may be implemented as processors and operate based on instructions stored in a memory (not shown).
Although the acquirer 210 and the image decoder 230 are separately shown in fig. 2a, the acquirer 210 and the image decoder 230 may be implemented as one processor. In this case, the acquirer 210 and the image decoder 230 may be implemented as dedicated processors, or a combination of software and a general-purpose processor, such as an Application Processor (AP), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU). The special purpose processor may include a memory to implement embodiments of the present disclosure, or a memory processor to use an external memory.
The acquirer 210 and the image decoder 230 may be implemented as a plurality of processors. In this case, the acquirer 310 and the predictive decoder 330 may be implemented as a combination of dedicated processors, or a combination of software and a general-purpose processor (e.g., an AP, CPU, or GPU).
The acquirer 210 acquires a bit stream including a result of encoding a current image.
The acquirer 210 may receive a bitstream from the image encoding apparatus 600 described below through a network. In embodiments of the present disclosure, the acquirer 210 may acquire a bit stream from a data storage medium including a magnetic medium (e.g., a hard disk, a floppy disk, or a magnetic tape), an optical medium (e.g., a compact disk read only memory (CD-ROM) or a Digital Versatile Disk (DVD)), or a magneto-optical medium (e.g., a magneto-optical diskette).
The acquirer 210 may parse the bit stream to acquire feature data of the luminance image.
The acquirer 210 may parse the bitstream to acquire feature data of the chroma image. The characteristic data of the chrominance image may comprise characteristic data of a residual image of the chrominance image. The prediction image of the chroma image may be generated through cross-channel prediction such that the residual image of the chroma image may be obtained by decoding feature data of the residual image of the chroma image, which represents a difference between an original image of the chroma image and the prediction image of the chroma image through cross-channel prediction. A reconstructed image of the chroma image may be generated based on the residual image of the chroma image and the predicted image of the chroma image.
The acquirer 210 may parse the bitstream to obtain feature data for cross-channel prediction. The feature data for cross-channel prediction may include feature data of a cross-channel stream. The cross-channel flow may include scalar parameters and bias parameters. Scalar parameters and bias parameters may refer to parameters of a linear model for cross-channel prediction.
As a result of the processing performed by the neural network-based encoder, feature data of the luminance image, feature data of the chrominance image, and feature data for cross-channel prediction can be obtained.
In an embodiment of the present disclosure, the acquirer 210 may acquire a first bit stream corresponding to feature data of a luminance image, a second bit stream corresponding to feature data of a chrominance image, and a third bit stream corresponding to feature data for cross-channel prediction. The acquirer 210 may acquire feature data of a luminance image, feature data of a chrominance image, and feature data for cross-channel prediction by parsing the first, second, and third bitstreams, respectively.
The feature data of the luminance image, the feature data of the chrominance image, and the feature data for the cross-channel prediction are transmitted to the image decoder 230, and the image decoder 230 may obtain a reconstructed image of the current image by using the feature data of the luminance image, the feature data of the chrominance image, and the feature data for the cross-channel prediction.
The image decoder 230 may obtain a reconstructed image of the current luminance image by using the feature data of the luminance image.
The image decoder 230 may obtain a predicted image of the current chroma image by using the reconstructed image of the current luma image and the feature data for cross-channel prediction.
The image decoder 230 may obtain a residual image of the current chroma image by using feature data of the chroma image. The image decoder 230 may obtain a reconstructed image of the current chroma image by using the predicted image of the current chroma image and the residual image of the current chroma image.
The image decoder 230 may obtain a reconstructed image of the current image by using the reconstructed image of the current luminance image and the reconstructed image of the current chrominance image.
According to an implementation, the feature data of the residual image data of the current chroma image may not be included in the bitstream. The acquirer 210 may acquire feature data for cross-channel prediction from the bitstream, and the image decoder 230 may reconstruct the cross-channel stream. In this case, the image decoding apparatus 200 may be referred to as a cross-channel stream decoding apparatus.
The cross-channel stream reconstructed by the image decoder 230 may be transmitted to another device, where a reconstructed image of the current image may be generated by the other device based on the cross-channel stream.
More specifically, the other device may generate the reconstructed image of the current chroma image by combining residual image data of the chroma image obtained from the bitstream with a predicted image of the chroma image generated from the reconstructed image of the current luma image according to the cross-channel stream.
Fig. 2b is a block diagram of an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 2b, the image decoding apparatus 250 according to an embodiment of the present disclosure includes an acquirer 260 and a residual image decoder 270.
The acquirer 260 and the residual image decoder 270 may be implemented as a processor and operate based on instructions stored in a memory (not shown).
Although the acquirer 260 and the residual image decoder 270 are separately shown in fig. 2b, the acquirer 260 and the residual image decoder 270 may be implemented as one processor. In this case, the acquirer 260 and the residual image decoder 270 may be implemented as a dedicated processor, or a combination of software and a general-purpose processor (such as an AP, a CPU, or a GPU). The special purpose processor may include a memory to implement embodiments of the present disclosure, or a memory processor to use an external memory.
The acquirer 260 and the residual image decoder 270 may be implemented as a plurality of processors. In this case, the acquirer 260 and the residual image decoder 270 may be implemented as a combination of dedicated processors, or a combination of software and a general-purpose processor (e.g., an AP, a CPU, or a GPU).
The acquirer 260 acquires a bit stream including a result of encoding the current image.
The acquirer 260 may receive the bit stream from the image encoding apparatus 650 described below through a network. In an embodiment of the present disclosure, the acquirer 260 may acquire a bit stream from a data storage medium including a magnetic medium (e.g., a hard disk, a floppy disk, or a magnetic tape), an optical medium (e.g., a CD-ROM or DVD), or a magneto-optical medium (e.g., a magneto-optical diskette).
The acquirer 260 may parse the bitstream to acquire feature data of the luminance residual image.
The acquirer 260 may parse the bitstream to obtain feature data of the chroma residual image. The characteristic data of the chroma residual image may include characteristic data of a residual image of the chroma residual image.
The predicted image of the chroma residual image may be generated by cross-channel prediction such that the residual image of the chroma residual image may be obtained by decoding feature data of the residual image of the chroma residual image, which represents a difference between an original image of the chroma residual image and the predicted image of the chroma residual image predicted by cross-channel. A reconstructed image of the chroma residual image may be generated based on the residual image of the chroma residual image and the predicted image of the chroma residual image.
The acquirer 260 may parse the bitstream to acquire feature data for cross-channel prediction. The feature data for cross-channel prediction may include feature data of a cross-channel stream. The cross-channel flow may include scalar parameters and bias parameters.
As a result of the processing performed by the neural network-based encoder, feature data of the luminance residual image, feature data of the chrominance residual image, and feature data for cross-channel prediction may be obtained.
In an embodiment of the present disclosure, the acquirer 260 may acquire a first bit stream corresponding to the feature data of the luminance residual image, a second bit stream corresponding to the feature data of the chrominance residual image, and a third bit stream corresponding to the feature data for cross-channel prediction. The acquirer 260 may acquire feature data of a luminance residual image, feature data of a chrominance residual image, and feature data for cross-channel prediction by parsing the first, second, and third bitstreams, respectively.
The feature data of the luminance residual image, the feature data of the chrominance residual image, and the feature data for the cross-channel prediction are transmitted to the residual image decoder 270, and the residual image decoder 270 may obtain a reconstructed image of the current residual image by using the feature data of the luminance residual image, the feature data of the chrominance residual image, and the feature data for the cross-channel prediction.
The residual image decoder 270 may obtain a reconstructed image of the current luminance residual image by using the feature data of the luminance residual image.
The residual image decoder 270 may obtain a predicted image of the current chroma residual image by using the reconstructed image of the current luma residual image and the feature data for cross-channel prediction. The residual image decoder 270 may obtain a residual image of the current chroma residual image by using the feature data of the chroma residual image. The residual image decoder 270 may obtain a reconstructed image of the current chroma residual image by using the predicted image of the current chroma residual image and the residual image of the current chroma residual image.
The residual image decoder 270 may obtain a reconstructed residual image of the current image by using the reconstructed image of the current luminance residual image and the reconstructed image of the current chrominance residual image.
According to an implementation, the feature data of the residual image data of the current chroma residual image may not be included in the bitstream. The acquirer 260 may acquire feature data for cross-channel prediction from the bitstream, and the residual image decoder 270 may reconstruct the cross-channel stream. In this case, the image decoding apparatus 250 may be referred to as a cross-channel stream decoding apparatus.
The cross-channel stream reconstructed by the residual image decoder 270 may be transmitted to another device, where a reconstructed image of the current residual image may be generated by the other device based on the cross-channel stream.
More specifically, the other device may generate the reconstructed image of the current chroma residual image by combining residual image data of the chroma residual image obtained from the bitstream with a predicted image of the chroma residual image generated from the reconstructed image of the luma residual reconstructed image across the channel stream.
The image decoding apparatus 250 may generate a reconstructed image of the current chroma residual image based on the reconstructed image of the current chroma residual image. The image decoding apparatus 250 may obtain a predicted image of the current chroma image. As described above with reference to the image decoding apparatus 200, a predicted image of the current chroma image may be obtained, but is not limited thereto. For example, when the frame type of the current chroma image is not an I frame, the image decoding apparatus 250 may obtain a predicted image of the current chroma image from a reconstructed image of a previous chroma image by using the optical flow described with reference to fig. 1.
Hereinafter, the operation of the acquirer 210 and the image decoders 230 and 270 will be described in detail with reference to fig. 3, 4a and 4 b.
Fig. 3 is a diagram of the acquirer 210 shown in fig. 2 a.
Referring to fig. 3, the acquirer 210 may include an entropy decoder 211 and an inverse quantizer 213.
The entropy decoder 211 obtains quantization characteristic data of a luminance image, quantization characteristic data for cross-channel prediction, and quantization characteristic data of a chrominance image data by entropy decoding binary bits included in a bitstream.
The inverse quantizer 213 obtains feature data of a luminance image, feature data for cross-channel prediction, and feature data of a chrominance image by inverse-quantizing the quantized feature data of the luminance image, the quantized feature data for cross-channel prediction, and the quantized feature data of the chrominance image, respectively.
Depending on the implementation, the acquirer 210 may also include an inverse transformer. The inverse transformer inversely transforms the characteristic data output from the inverse quantizer 213 from the first domain to the second domain. The first domain may be a frequency domain, but the frequency domain is merely an example of the first domain, and the present disclosure is not limited thereto. The second domain may be a spatial domain, but the spatial domain is only an example of the second domain, and the present disclosure is not limited thereto.
When the image encoding apparatus 600 described below transforms the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image from the second domain to the first domain, the inverse transformer may inverse-transform the feature data output from the inverse quantizer 213 from the first domain to the second domain.
Depending on the implementation, the acquirer 210 may not include the inverse quantizer 213. That is, the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image may be obtained through the processing performed by the entropy decoder 211.
According to an implementation, the acquirer 210 may acquire feature data of a luminance image, feature data for cross-channel prediction, and feature data of a chrominance image by performing inverse binarization only on binary bits included in a bitstream. This operation is used in the case where the image encoding apparatus 600 generates a bitstream by binarizing the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image, that is, the image encoding apparatus 600 does not apply entropy encoding, transformation, and quantization to the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image.
The operation of the acquirer 210 is described above with reference to fig. 3. The operation of the acquirer 260 may be similar to the operation of the acquirer 210 described above, except that the input data and the output data are different, and thus the acquirer 260 will not be described.
Fig. 4a is a block diagram of the image decoder 230 shown in fig. 2 a.
Referring to fig. 4a, the image decoder 230 may include a first decoder 231, a second decoder 232, a third decoder 234, a cross-channel predictor 233, and a combiner 235.
The first decoder 231, the second decoder 232, and the third decoder 234 may be stored in a memory. In an embodiment of the present disclosure, the first decoder 231, the second decoder 232, and the third decoder 234 may be implemented as at least one dedicated processor for AI.
According to an implementation, when the frame type of the current chroma image is an I frame, an operation of the image decoder 230 may be performed to reconstruct the current chroma image.
The feature data of the luminance image output by the acquirer 210 is input to the first decoder 231. The characteristic data for cross-channel prediction output by the acquirer 210 is input to the second decoder 232. The feature data of the current chroma image output by the acquirer 210 is input to the third decoder 234.
According to an implementation, in order to reconstruct the data for cross-channel prediction, the feature data of the luminance image may be concatenated with the feature data for cross-channel prediction and then may be input to the second decoder 232. In this context, concatenation may refer to a process of combining two or more pieces of characteristic data in a channel direction. For example, by cascading, data having the number of channels equal to the sum of the number of channels of two or more pieces of characteristic data can be obtained.
According to an implementation, in order to reconstruct the data of the chrominance image, the feature data of the luminance image or the feature data for cross-channel prediction is concatenated with the feature data of the chrominance image and then input to the third decoder 234.
The first decoder 231 may obtain a reconstructed image of the luminance image by processing the feature data of the luminance image based on the parameters set through training.
When the frame type of the current luminance image is an I frame, feature data of the current luminance image may be input to the first neural network to obtain a reconstructed image of the current luminance image.
When the frame type of the current luminance image is not an I frame, the feature data of the current luminance image may include feature data of a residual image of the current luminance image. In this case, a predicted image of the current luminance image may be generated. The reconstructed image of the previous luma image may be used to obtain a predicted image of the current luma image as described above with reference to fig. 1.
The first decoder 231 may reconstruct a residual image of the current luminance image by processing feature data of the current luminance image based on parameters set through training. The first decoder 231 may generate a reconstructed image of the current luminance image by using the predicted image of the current luminance image and the residual image of the current luminance image.
The third decoder 234 may obtain a residual image (or residual image data) of the chroma image by processing the feature data of the chroma image based on the parameter set through training. The feature data of the chrominance image may include feature data of a residual image of the chrominance image, and the residual image of the chrominance image may be obtained by processing the feature data of the residual image of the chrominance image, but the residual image of the chrominance image may be obtained by using the feature data of the residual image of the chrominance residual image according to the image decoder 270 shown in fig. 4b described below. Residual image data (which is 1-dimensional or 2-dimensional data) of a chroma image may include a plurality of samples.
The second decoder 232 may obtain data for cross-channel prediction (cross-channel prediction information) by processing feature data for cross-channel prediction based on parameters set through training. The data for cross-channel prediction (which is 1-dimensional or 2-dimensional data) may include multiple samples. The data for cross-channel prediction may include parameters for cross-channel prediction. Parameters for cross-channel prediction may include scalar parameters and bias parameters. Parameters for cross-channel prediction may be obtained for each chrominance component. For example, the parameters of the chrominance component Cb and the parameters of the chrominance component Cr may be obtained separately. However, without being limited thereto, a common parameter for cross-channel prediction may be obtained for a plurality of chrominance components. For example, for chrominance components Cb and Cr, common parameters for cross-channel prediction may be obtained.
The image of the luminance component and the image of the chrominance component are for a common object such that a linear relationship may exist between the image samples of the luminance component and the image samples of the chrominance component. Such a linear relationship may be represented as a linear model, and the parameters of the linear model may include scalar parameters and bias parameters.
The color expression scheme of the image may vary depending on the implementation. The Y component (luminance component) is sensitive to errors, and thus encoding can be performed on more Y component samples than chroma components Cb (U) and Cr (V) samples. A method of encoding an image by reducing data of a chrominance component as compared to data of a luminance component is called a chrominance sub-sampling method.
Examples of chroma sub-sampling methods include YUV 4:4:4, 4:2:2, 4:2:1, 4:1:1, 4:2:0, and the like.
In order to match the resolution of the luminance component with the resolution of the chrominance component, the feature data of the luminance image may be downsampled. The downsampling method may be one of various methods such as a bilinear method, a bicubic method, etc.
For example, when the chroma subsampling method is not YUV 4:4:4, downsampling may be performed on the feature data of the luminance image.
For example, when the chroma subsampling method is not YUV 4:2: at 0, the feature data of the luminance image may be downsampled twice. That is, when the chroma subsampling method is YUV 4:2: at 0, the height of the chrominance component image may be 1/2 of the height of the luminance component image, and the width of the chrominance component image may be 1/2 of the width of the luminance component image. Downsampling may be performed 2 times in the horizontal direction and 2 times in the vertical direction.
The downsampled feature data of the luminance image is concatenated with feature data for cross-channel prediction, and the concatenated data may be processed based on parameters set by training to obtain data for cross-channel prediction.
According to an implementation, in order to match the resolution of the luminance component with the resolution of the chrominance component, a spatial-to-depth transformation may be performed on the feature data of the luminance image.
For example, when the chroma subsampling method is not YUV 4:4:4, a spatial to depth transformation may be performed on the feature data of the luminance image. For example, when the chroma subsampling method is YUV 4:2: at 0, a level 2 spatial to depth transform may be performed on the feature data of the luminance image. That is, the spatial-to-depth conversion is a process of rearranging feature data of a luminance image in a channel (or depth) direction according to a ratio of the luminance image to a chrominance image, and rearranged data downsampled 2 times per channel size may be generated from the feature data of the luminance image due to the level 2 spatial-to-depth conversion.
The transformed feature data of the luminance image is concatenated with feature data for cross-channel prediction, and the concatenated data may be processed based on parameters set by training to obtain data for cross-channel prediction.
The cross-channel predictor 233 may generate a predicted image of the chrominance image by using the cross-channel prediction information and the reconstructed image of the luminance image. The cross-channel predictor 233 may obtain a predicted image of the chrominance image by applying scalar parameters and offset parameters included in the cross-channel prediction information to a reconstructed image of the luminance image. Scalar parameters can be used for multiplication of samples in the reconstructed image of the luminance image. Scalar parameters may exist for each sample. The offset parameter may be used for an addition operation on a result obtained by a multiplication operation with a scalar parameter. A bias parameter may be present for each sample.
The cross-channel predictor 233 may obtain a predicted image of the current chroma image by performing an addition operation using offset parameters after performing a multiplication operation on a reconstructed image of the current luma image using scalar parameters.
When the color expression scheme of the current image is not YUV 4:4:4, downsampling may be performed on the reconstructed image of the current luminance image. By downsampling, the resolution of the luminance image and the resolution of the chrominance image can be matched.
An image obtained after performing a multiplication operation using scalar parameters and an addition operation using offset parameters on the downsampled luminance image for each sample may be determined as a predicted image of the current chrominance image. The resolution of the downsampled luminance image may be the same as the resolution of the chrominance image, but the present disclosure is not limited thereto, such that the resolution of the downsampled luminance image may be greater than the resolution of the chrominance image and less than the resolution of the luminance image.
In this case, an image obtained after performing a multiplication operation using scalar parameters and an addition operation using offset parameters on the downsampled luminance image for each sample point may not be determined as a predicted image of the current chroma image. Downsampling may be further performed on an image obtained after performing the operation, and the downsampled image may be determined as a predicted image of the current chroma image.
According to an implementation, a multiplication operation and an addition operation are performed for each sample, but the present disclosure is not limited thereto, so that a multiplication operation and an addition operation may be performed for each sample group. The size of the set of samples may be, but is not limited to, K x K (K is an integer greater than 1). K may be a multiple of 2 or a multiplier of 2.
The predicted image of the chrominance image obtained by the cross-channel predictor 233 and the residual image of the chrominance image obtained by the third decoder 234 may be provided to the combiner 235.
The combiner 235 may generate a reconstructed image of the chroma image by combining the predicted image of the chroma image with the residual image of the chroma image. The combiner 235 may generate the sample value of the reconstructed image of the chroma image by summing the sample value of the predicted image of the chroma image with the sample value of the residual image of the chroma image.
The image decoder 230 may generate a reconstructed image of the current image by using the reconstructed image of the current luminance image and the reconstructed image of the current chrominance image. For output to a display device, the color expression method may be changed. For example, the display device may support a red/green/blue (RGB) scheme, and the current luminance image and the chrominance image are based on a YUV scheme, so that the color expression method may be changed.
According to an implementation, the image decoder 230 may obtain cross-channel prediction information from the feature data for cross-channel prediction and provide the cross-channel prediction information to another device. In this case, the first decoder 231, the cross-channel predictor 233, the third decoder 234, and the combiner 235 may not be included in the image decoder 230.
According to an implementation, when residual image data of a chroma image may be obtained from a bitstream, the third decoder 234 may not be included in the image decoder 230. That is, the image decoder 230 may generate a reconstructed image of the chroma image by combining residual image data of the chroma image obtained from the bitstream with a predicted image of the chroma image.
According to an embodiment of the present disclosure, a bitstream is generated based on cross-channel prediction, thereby achieving a lower bitrate than when the bitstream is predicted without cross-channel prediction.
Next, fig. 4b is a block diagram of the image decoder 270 shown in fig. 2 b.
Referring to fig. 4b, the image decoder 270 may include a first decoder 271, a second decoder 272, a third decoder 274, a cross-channel predictor 273, and a combiner 275.
The first decoder 271, the second decoder 272, and the third decoder 274 may be stored in a memory. In an embodiment of the present disclosure, the first decoder 271, the second decoder 272, and the third decoder 274 may be implemented as at least one dedicated processor for AI.
According to an embodiment, when the frame type of the current chroma image is not an I frame, the operation of the image decoder 270 may be performed to reconstruct a residual image of the current chroma image. When the frame type of the current chroma image is an I frame, the operation of the image decoder 270 may not be performed.
However, the present disclosure is not limited thereto, and when the frame type of the current chroma image is an I frame, the operation of the image decoder 270 may be performed to reconstruct a residual image of the current chroma image.
The feature data of the current luminance residual image output by the acquirer 260 is input to the first decoder 271. The characteristic data for cross-channel prediction output by the acquirer 260 is input to the second decoder 272. The feature data of the current chroma residual image output by the acquirer 260 is input to the third decoder 274.
According to an implementation, in order to reconstruct the data for cross-channel prediction, the feature data of the current luminance residual image may be concatenated with the feature data for cross-channel prediction and then may be input to the second decoder 272.
According to an embodiment, in order to reconstruct data of the current chroma residual image, feature data of the current luma residual image or feature data for cross-channel prediction may be concatenated with feature data of the current chroma residual image and then may be input to the third decoder 274.
The first decoder 271 may obtain a reconstructed image of the current luminance residual image by processing the feature data of the current luminance residual image based on the parameters set through training.
When the frame type of the current luminance residual image is an I frame, feature data of the current luminance residual image may be input to the first neural network to obtain a reconstructed image of the current luminance residual image.
According to an implementation, when the frame type of the current luminance residual image is not an I-frame, the feature data of the current luminance residual image may include feature data of a residual image of the current luminance residual image. In this case, a predicted image of the current luminance residual image may be generated. The reconstructed image of the previous luminance residual image may be used to obtain a predicted image of the current luminance residual image, as described above with reference to fig. 1.
The first decoder 271 may reconstruct a residual image of the current luminance residual image by processing feature data of the current luminance residual image based on the parameters set through training. The first decoder 271 may generate a reconstructed image of the current luminance residual image by using the predicted image of the current luminance residual image and the residual image of the current luminance residual image.
Third decoder 274 may obtain a residual image of the current chroma residual image by processing the feature data of the current chroma residual image based on the parameters set through training. The residual image data (which is 1-dimensional or 2-dimensional data) of the current chroma residual image may include a plurality of samples.
The second decoder 272 may obtain data for cross-channel prediction by processing feature data for cross-channel prediction based on parameters set through training. The data for cross-channel prediction (which is 1-dimensional or 2-dimensional data) may include multiple samples. The data for cross-channel prediction may include parameters for cross-channel prediction. Parameters for cross-channel prediction may include scalar parameters and bias parameters. Parameters for cross-channel prediction may be obtained for each chrominance component. For example, the parameters of the chrominance component Cb and the parameters of the chrominance component Cr may be obtained separately. However, without being limited thereto, a common parameter for cross-channel prediction may be obtained for a plurality of chrominance components. For example, for chrominance components Cb and Cr, common parameters for cross-channel prediction may be obtained.
The residual image of the luminance component and the residual image of the chrominance component are for a common object such that a linear relationship may exist between the residual image samples of the luminance component and the residual image samples of the chrominance component. Such a linear relationship may be represented as a linear model, and the parameters of the linear model may include scalar parameters and bias parameters.
The color expression scheme of the image may vary depending on the implementation. The Y component (luminance component) is sensitive to errors, and thus encoding can be performed on more Y component samples than chroma components Cb (U) and Cr (V) samples.
The second decoder 272 may downsample the feature data of the luminance residual image to match the resolution of the luminance component with the resolution of the chrominance component.
The downsampled feature data of the luminance residual image is concatenated with feature data for cross-channel prediction, and the concatenated data may be processed based on parameters set by training to obtain data for cross-channel prediction.
The cross-channel predictor 273 may generate a predicted image of the chrominance image by using the cross-channel prediction information and the reconstructed image of the luminance residual image. The cross-channel predictor 273 may obtain a predicted image of the chroma residual image by applying scalar parameters and offset parameters included in the cross-channel prediction information to a reconstructed image of the luma residual image.
Scalar parameters can be used for multiplication of samples in the reconstructed image of the luminance residual image. In this case, there may be a scalar parameter for each sample. The offset parameter may be used for an addition operation on a result obtained by a multiplication operation with a scalar parameter. In this case, there may be a bias parameter for each sample.
The cross-channel predictor 273 may obtain a predicted image of the current chroma residual image by performing an addition operation using the offset parameter after performing a multiplication operation on the reconstructed image of the current luma residual image using the scalar parameter.
When the color representation scheme of the current residual image is not YUV 4:4:4, downsampling may be performed on the reconstructed image of the luminance residual image. The resolution of the luminance residual image and the resolution of the chrominance residual image may be matched with each other by downsampling, and an image obtained after performing a multiplication operation using scalar parameters and an addition operation using offset parameters on the downsampled luminance residual image for each sample point may be determined as a predicted image of the current chrominance residual image. The resolution of the downsampled luminance residual image may be the same as the resolution of the chrominance residual image, but the present disclosure is not limited thereto, such that the resolution of the downsampled luminance residual image may be greater than the resolution of the chrominance residual image and less than the resolution of the luminance residual image.
In this case, an image obtained after performing a multiplication operation using scalar parameters and an addition operation using offset parameters on the downsampled luminance residual image for each sample point may not be determined as a predicted image of the current chrominance residual image.
Downsampling may be further performed on an image obtained after performing the operation, and the downsampled image may be determined as a predicted image of the current chroma residual image.
According to an implementation, a multiplication operation and an addition operation are performed for each sample, but the present disclosure is not limited thereto, so that a multiplication operation and an addition operation may be performed for each sample group. The size of the set of samples may be, but is not limited to, K x K (K is an integer greater than 1). K may be a multiple of 2 or a multiplier of 2.
The predicted image of the chroma residual image obtained by the cross-channel predictor 273 and the residual image of the chroma residual image obtained by the third decoder 274 may be provided to the combiner 275.
Combiner 275 may generate a reconstructed image of the chroma residual image by combining the predicted image of the chroma residual image with the residual image of the chroma residual image. Combiner 275 may generate the sample values of the reconstructed image of the chroma residual image by summing the sample values of the predicted image of the chroma residual image with the sample values of the residual image of the chroma residual image.
The image decoder 270 may generate a predicted image of the current chroma image. The image decoder 270 may generate a predicted image of the current chroma image based on the reconstructed image of the previous chroma image. The method of predicting the current chroma image based on the reconstructed image of the previous chroma image may be similar to the prediction method described with reference to fig. 1.
According to an implementation, the image decoder 270 may obtain cross-channel prediction information from the feature data for cross-channel prediction and provide the cross-channel prediction information to another device. In this case, the first decoder 271, the cross-channel predictor 273, the third decoder 274, and the combiner 275 may not be included in the image decoder 270.
According to an embodiment of the present disclosure, a bitstream is generated based on cross-channel prediction, thereby achieving a lower bitrate than when the bitstream is predicted without cross-channel prediction.
Fig. 5a is a flowchart of an image decoding method according to an embodiment of the present disclosure.
Referring to fig. 5a, in operation S505, the image decoding apparatus 200 may obtain feature data for cross-channel prediction from a bitstream.
In operation S510, the image decoding apparatus 200 may obtain feature data of a luminance image in a current image and feature data of a chrominance image in the current image.
In operation S515, the image decoding apparatus 200 may reconstruct the luminance image by applying the feature data of the luminance image to the neural network-based luminance decoder.
In operation S520, the image decoding apparatus 200 may obtain cross-channel prediction information by applying feature data for cross-channel prediction to a cross-channel decoder.
In operation S525, the image decoding apparatus 200 may obtain a predicted image of the chroma image by performing cross-channel prediction based on the reconstructed luma image and the cross-channel prediction information.
In operation S530, the image decoding apparatus 200 may obtain a residual image of the chroma image by applying the feature data of the chroma image to a neural network-based chroma decoder.
In operation S535, the image decoding apparatus 200 may reconstruct a chroma image based on the predicted image of the chroma image and the residual image of the chroma image.
In operation S540, the image decoding apparatus 200 may obtain a current image by using the reconstructed luminance image and the reconstructed chrominance image.
Fig. 5b is a flowchart of an image decoding method according to another embodiment of the present disclosure.
Referring to fig. 5b, the image decoding apparatus 250 may obtain feature data for cross-channel prediction from a bitstream in operation S555.
In operation S560, the image decoding apparatus 250 may obtain feature data of a luminance residual image in the current image and feature data of a chrominance residual image in the current image.
In operation S565, the image decoding apparatus 250 may reconstruct the luminance residual image by applying the feature data of the luminance residual image to the neural network-based luminance decoder.
In operation S570, the image decoding apparatus 250 may obtain cross-channel prediction information by applying feature data for cross-channel prediction to a cross-channel decoder.
In operation S575, the image decoding apparatus 200 may obtain a predicted image of the chroma image by performing cross-channel prediction based on the reconstructed luma image and the cross-channel prediction information.
In operation S580, the image decoding apparatus 250 may obtain a residual image of the chroma residual image by applying the feature data of the chroma residual image to a neural network-based chroma decoder.
In operation S585, the image decoding apparatus 250 may reconstruct a chroma residual image based on the predicted image of the chroma residual image and the residual image of the chroma residual image.
Fig. 6a is a block diagram of an image encoding apparatus 600 according to an embodiment of the present disclosure.
Referring to fig. 6a, the image encoding apparatus 600 may include an image encoder 610, a bitstream generator 620, an acquirer 630, and an image decoder 640.
The image encoder 610, the bit stream generator 620, the acquirer 630, and the image decoder 640 may be implemented as a processor and operate based on instructions stored in a memory (not shown).
Although the image encoder 610, the bit stream generator 620, the acquirer 630, and the image decoder 640 are separately illustrated in fig. 6a, the image encoder 610, the bit stream generator 620, the acquirer 630, and the image decoder 640 may be implemented as one processor. In this case, the image encoder 610, the bit stream generator 620, the acquirer 630, and the image decoder 640 may be implemented as dedicated processors, or a combination of software and a general-purpose processor (such as an AP, a CPU, or a GPU). The special purpose processor may include a memory to implement embodiments of the present disclosure, or a memory processor to use an external memory.
The image encoder 610, the bit stream generator 620, the acquirer 630, and the image decoder 640 may be implemented as a plurality of processors. In this case, the image encoder 610, the bit stream generator 620, the acquirer 630, and the image decoder 640 may be implemented as a combination of dedicated processors, or a combination of software and a general-purpose processor (such as an AP, a CPU, or a GPU).
The image encoder 610 may obtain feature data of a luminance image, feature data for cross-channel prediction, and feature data of a chrominance image from a current image.
The image encoder 610 may obtain feature data of the luminance image using a first encoder based on a neural network. The image encoder 610 may use a neural network-based second encoder to obtain feature data for cross-channel prediction. The image encoder 610 may obtain the feature data of the chrominance image using a third encoder based on a neural network.
The feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image obtained by the image encoder 610 are transmitted to the bitstream generator 620.
The bit stream generator 620 may generate a bit stream from the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image. According to an implementation, the bitstream generator 620 may generate a first bitstream corresponding to characteristic data of a luminance image, a second bitstream corresponding to characteristic data for cross-channel prediction, and a third bitstream corresponding to characteristic data of a chrominance image.
The bitstream may be transmitted from the image decoding apparatus 200 through a network. In embodiments of the present disclosure, the bitstream may be recorded on a data storage medium including a magnetic medium (e.g., hard disk, floppy disk, or magnetic tape), an optical medium (e.g., CD-ROM or DVD), or a magneto-optical medium (e.g., floptical disk).
The acquirer 630 may acquire feature data of a luminance image, feature data for cross-channel prediction, and feature data of a chrominance image from the bitstream generated by the bitstream generator 620. According to an implementation, the acquirer 630 may receive the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image from the image encoder 610.
The feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image may be transmitted to the image decoder 640. The image decoder 640 may reconstruct the luminance image by using the feature data of the luminance image. The image decoder 640 may obtain cross-channel prediction information by using feature data for cross-channel prediction. The image decoder 640 may reconstruct a chrominance image by using the cross-channel prediction information and the reconstructed image of the luminance image. The image decoder 640 may generate a reconstructed image of the current image by using the reconstructed image of the luminance image and the reconstructed image of the chrominance image.
The structure and operation of the acquirer 630 and the image decoder 640 are the same as those of the acquirer 210 and the image decoder 230 of fig. 2a, 3, and 4a, and thus a detailed description thereof is not provided herein.
In an embodiment of the present disclosure, the image encoder 610 may obtain feature data for cross-channel prediction, and the bitstream generator 620 may generate a bitstream corresponding to the feature data for cross-channel prediction. The acquirer 630 may acquire feature data for cross-channel prediction from the bitstream. The acquirer 630 may acquire cross-channel prediction information based on the feature data for cross-channel prediction.
That is, the cross-channel prediction information is encoded by the image encoder 610, the bitstream generator 620, the acquirer 630, and the image decoder 640, and in this case, the image encoding apparatus 600 may be referred to as a cross-channel prediction encoding apparatus.
The cross-channel prediction information reconstructed by the image decoder 640 may be transmitted to another device that may encode the chroma image. More specifically, the other device may encode data of a residual image of the chroma image, the data corresponding to a difference between a predicted image of the chroma image obtained from the luma reconstructed image based on the cross-channel prediction information and an original image of the chroma image.
Fig. 6b is a block diagram of an image encoding apparatus 650 according to an embodiment of the present disclosure.
Referring to fig. 6b, the image encoding apparatus 650 may include an image encoder 660, a bitstream generator 670, an acquirer 680, and an image decoder 690.
The image encoder 660, the bitstream generator 670, the acquirer 680, and the image decoder 690 may be implemented as a processor and operate based on instructions stored in a memory (not shown).
Although the image encoder 660, the bit stream generator 670, the acquirer 680 and the image decoder 690 are separately shown in fig. 6b, the image encoder 660, the bit stream generator 670, the acquirer 680 and the image decoder 690 may be implemented as one processor. In this case, the image encoder 660, the bit stream generator 670, the acquirer 680, and the image decoder 690 may be implemented as dedicated processors, or a combination of software and a general-purpose processor (such as an AP, a CPU, or a GPU). The special purpose processor may include a memory to implement embodiments of the present disclosure, or a memory processor to use an external memory.
The image encoder 660, the bitstream generator 670, the acquirer 680, and the image decoder 690 may be implemented as a plurality of processors. In this case, the image encoder 660, the bit stream generator 670, the acquirer 680, and the image decoder 690 may be implemented as a combination of dedicated processors, or a combination of software and a general-purpose processor (such as an AP, a CPU, or a GPU).
The image encoder 660 may obtain feature data of a luminance residual image, feature data for cross-channel prediction, and feature data of a chrominance residual image from a residual image of a current image.
The image encoder 660 may obtain feature data of the luminance residual image using a first encoder based on a neural network. The image encoder 610 may use a neural network-based second encoder to obtain feature data for cross-channel prediction. The image encoder 660 may obtain the feature data of the chroma residual image using a third encoder based on a neural network.
The feature data of the luminance residual image, the feature data for cross-channel prediction, and the feature data of the chrominance residual image obtained by the image encoder 660 are transmitted to the bitstream generator 670.
The bitstream generator 670 may generate a bitstream from the feature data of the luminance residual image, the feature data for cross-channel prediction, and the feature data of the chrominance residual image. According to an implementation, the bitstream generator 670 may generate a first bitstream corresponding to characteristic data of a luminance residual image, a second bitstream corresponding to characteristic data for cross-channel prediction, and a third bitstream corresponding to characteristic data of a chrominance residual image.
The bitstream may be transmitted from the image decoding apparatus 250 through a network. In embodiments of the present disclosure, the bitstream may be recorded on a data storage medium including a magnetic medium (e.g., hard disk, floppy disk, or magnetic tape), an optical medium (e.g., CD-ROM or DVD), or a magneto-optical medium (e.g., floptical disk).
The acquirer 680 may acquire feature data of a luminance residual image, feature data for cross-channel prediction, and feature data of a chrominance residual image from a bitstream generated by the bitstream generator 670. According to an implementation, the acquirer 680 may receive the feature data of the luminance residual image, the feature data for cross-channel prediction, and the feature data of the chrominance residual image from the image encoder 660.
The feature data of the luminance residual image, the feature data for cross-channel prediction, and the feature data of the chrominance residual image may be transmitted to the image decoder 690. The image decoder 690 may reconstruct the luminance residual image by using the feature data of the luminance residual image. The image decoder 690 may obtain cross-channel prediction information by using feature data for cross-channel prediction. The image decoder 690 may reconstruct a chroma residual image by using the cross-channel prediction information and the reconstructed image of the luma residual image.
The structure and operation of the acquirer 680 and the image decoder 690 are the same as those of the acquirer 260 and the image decoder 270 of fig. 2b, 3 and 4b, and thus a detailed description thereof is not provided herein.
In embodiments of the present disclosure, the image encoder 660 may obtain feature data for cross-channel prediction, and the bitstream generator 670 may generate a bitstream corresponding to the feature data for cross-channel prediction. The acquirer 680 may acquire feature data for cross-channel prediction from the bitstream. The acquirer 680 may acquire cross-channel prediction information based on the feature data for cross-channel prediction.
That is, the cross-channel prediction information is encoded by the image encoder 660, the bitstream generator 670, the acquirer 680, and the image decoder 690, and in this case, the image encoding apparatus 650 may be referred to as a cross-channel prediction encoding apparatus.
The cross-channel prediction information reconstructed by the image decoder 690 may be transmitted to another device that may encode the chroma residual image. More specifically, the other device may encode data of a residual image of the chroma residual image, the data corresponding to a difference between a predicted image of the chroma residual image obtained from the luma residual image based on the cross-channel prediction information and an original image of the chroma residual image.
The structures of the image encoder 610 and the bitstream generator 620 will be described in more detail with reference to fig. 7 and 8.
Fig. 7 is a block diagram of the image encoder 610 shown in fig. 6 a.
Referring to fig. 7, the image encoder 610 may include a first encoder 611, a second encoder 612, a third encoder 614, and a subtractor 613.
The first encoder 611 and the second encoder 612 may be stored in a memory. In an embodiment of the present disclosure, the first encoder 611 and the second encoder 612 may be implemented as at least one dedicated processor for AI.
The image decoder 640 may include a cross-channel predictor, and the image encoding apparatus 600 may obtain a predicted image of the chroma image through the cross-channel predictor included in the image decoder 640 in the same manner as the cross-channel predictor 233 of the image decoding apparatus 200. The generated predicted image of the chrominance image is supplied to the subtractor 613.
The original luminance image is input to the first encoder 611. The first encoder 611 outputs feature data of the current luminance image according to the current original luminance image based on the parameter set as a training result. When the frame type of the current luminance image is not an I frame, the feature data of the current luminance image may include feature data of a residual image of the current luminance image. The first encoder 611 may generate a predicted image of the current luminance image from reconstructed images of the current original luminance image and the previous luminance image, and obtain a residual image of the current luminance image from the predicted images of the current original luminance image and the current luminance image. In this case, the method of generating the predicted image of the current luminance image and the method of generating the residual image of the current luminance image may be based on the method described above with reference to fig. 1.
The original chrominance image and the reconstructed luminance image are input to a second encoder 612. The second encoder 612 may output feature data for cross-channel prediction from reconstructed images of the original chrominance image and the luminance image based on parameters set as a result of training. The feature data for cross-channel prediction is supplied to the image decoder 640, and as described above, a predicted image of the chroma image may be obtained based on the feature data for cross-channel prediction and the reconstructed image of the luma image.
The subtractor 613 may obtain residual image data of a chrominance image between an original image of the chrominance image and a predicted image of the chrominance image. The subtractor 613 may obtain residual image data of the chroma image by subtracting a sample value of a predicted image of the chroma image from a sample value of an original image of the chroma image.
The residual image data of the chroma image is input to the third encoder 614, and the third encoder 614 outputs the feature data of the chroma image by processing the residual image data of the chroma image based on the parameter set as the training result. The characteristic data of the chrominance image may comprise characteristic data of a residual image of the chrominance image.
The bit stream generator 620 generates a bit stream based on the feature data of the luminance image, the feature data of the chrominance image, and the feature data for cross-channel prediction output from the image encoder 610.
Fig. 8 is a block diagram of the bit stream generator 620 shown in fig. 6.
Referring to fig. 8, the bitstream generator 620 includes a quantizer 621 and an entropy encoder 623.
The quantizer 621 quantizes the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image.
The entropy encoder 623 generates a bitstream by entropy encoding the quantization characteristic data of the luminance image, the quantization characteristic data for cross-channel prediction, and the quantization characteristic data of the chrominance image.
Depending on the implementation, bitstream generator 620 may also include a transformer. The transformer transforms the quantized feature data of the luminance image, the quantized feature data for cross-channel prediction, and the quantized feature data of the chrominance image from the second domain to the first domain, and provides the transformed feature data to the quantizer 621.
Depending on the implementation, bitstream generator 620 may not include quantizer 621. That is, a bit stream corresponding to the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image may be obtained through the process performed by the entropy encoder 623.
Further, according to an implementation, the bitstream generator 620 may generate a bitstream by performing binarization on the feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image. That is, when the bitstream generator 620 performs only binarization, the quantizer 621 and the entropy encoder 623 may not be included in the bitstream generator 620.
The structures of the image encoder 610 and the bitstream generator 620 shown in fig. 6a have been described so far with reference to fig. 7 and 8. The structure of the image encoder 660 and the bit stream generator 670 shown in fig. 6b is also similar to the structure of the image encoder 610 and the bit stream generator 620 described above, and thus will not be described in detail.
Fig. 9a is a flowchart of an image encoding method according to an embodiment of the present disclosure.
Referring to fig. 9a, in operation S905, the image encoding apparatus 600 may obtain feature data of a luminance image in a current image by applying an original luminance image in the current original image to a neural network-based luminance encoder, and reconstruct the luminance image by applying the feature data of the luminance image to the neural network-based luminance encoder.
In operation S910, the image encoding apparatus 600 may obtain feature data for cross-channel prediction by applying the reconstructed luminance image and an original chrominance image in the current original image to a neural network-based cross-channel encoder.
In operation S915, the image encoding apparatus 600 may obtain the cross-channel prediction information by applying the obtained feature data for cross-channel prediction to a neural network-based chroma decoder.
In operation S920, the image encoding apparatus 600 may obtain a predicted image of the chroma image by performing cross-channel prediction based on the reconstructed luma image and the cross-channel prediction information.
In operation S925, the image encoding apparatus 600 may obtain feature data of the chroma image by applying a residual image of the chroma image obtained based on the original chroma image and the predicted image of the chroma image to a neural network-based chroma encoder.
In operation S930, the image encoding apparatus 600 may generate a bitstream including feature data of a luminance image, feature data of a chrominance image, and feature data for cross-channel prediction.
Fig. 9b is a flowchart of an image encoding method according to an embodiment of the present disclosure.
Referring to fig. 9b, the image encoding apparatus 650 may obtain feature data of a luminance residual image by applying the residual image of the current image to a neural network-based luminance encoder and reconstruct the luminance residual image by applying the feature data of the luminance residual image to a neural network-based luminance decoder in operation S955.
In operation S960, the image encoding apparatus 650 may obtain feature data for cross-channel prediction by applying the reconstructed luminance residual image and the chrominance residual image of the current image to a neural network-based cross-channel encoder.
In operation S965, the image encoding apparatus 650 may obtain the cross-channel prediction information by applying the obtained feature data for the cross-channel prediction to a neural network-based chroma decoder.
In operation S970, the image encoding apparatus 650 may obtain a predicted image of the chroma residual image by performing cross-channel prediction based on the reconstructed luma residual image and the cross-channel prediction information.
In operation S925, the image encoding apparatus 650 may obtain feature data of the chroma residual image by applying the residual image of the chroma residual image obtained based on the chroma residual image and the predicted image of the chroma residual image to the neural network-based chroma encoder.
In operation S980, the image encoding apparatus 650 may generate a bitstream including the feature data of the luminance residual image, the feature data of the chrominance residual image, and the feature data for cross-channel prediction.
Fig. 10a is a diagram for describing cross-channel prediction according to an embodiment of the present disclosure.
Referring to fig. 10a, the reconstructed luminance image and the original chrominance image are concatenated with each other and then input to a cross-channel encoder 1005. The feature data for cross-channel prediction output from the cross-channel encoder 1005 is included in the bitstream. The feature data for cross-channel prediction included in the bitstream is concatenated with the feature data of the luminance image and then input to the cross-channel decoder 1010. Scalar parameters 1015 and bias parameters 1020 are output from the cross-channel decoder 1010 for each element (i.e., each sample). Cross-channel prediction 1012 may be performed using scalar parameters 1015, reconstructed luma image 1025, and bias parameters 1020. A multiplication operation may be performed on scalar parameters 1015 and reconstructed luminance image 1025 for each element. By performing an addition operation on the result value of the multiplication operation and the bias parameter 1020, a predicted chroma image 1030 may be generated. There may be a scalar parameter 1015 and a bias parameter 1020 for each chroma component. For example, scalar parameters 1015 and bias parameters 1020 may exist separately for Cb and Cr components. Thus, the predicted chroma image 1030 may also be generated separately for the Cb and Cr components.
Fig. 10b is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Referring to fig. 10b, an rgb input image x is input to an image converter 1035, and the image converter 1035 may output a luminance image y and a chrominance image c.
The luminance image y is input to the luminance encoder 1040, and feature data of the luminance image y may be output. The feature data of the luminance image y is input to the luminance decoder 1045. The luminance decoder 1045 may output a reconstructed luminance image
Reconstructing a luminance imageAnd chrominance images c are concatenated with each other and then input to the cross-channel encoder 1050. The cross-channel encoder 1050 may obtain characteristic data for cross-channel prediction. The feature data for cross-channel prediction may be input to a cross-channel decoder 1055. According to an embodiment of the present disclosure, feature data for cross-channel prediction and feature data of a luminance image are concatenated with each other and then input to the cross-channel decoder 1055.
The cross-channel decoder 1055 may output cross-channel prediction information. Cross-channel prediction information and reconstructed luma imagesMay be input to cross-channel predictor 1060. The cross-channel predictor 1060 can output the chroma prediction image c p
Subtracting chroma prediction image c from chroma image c p And then input to a chroma residual encoder 1065. The chroma residual encoder 1065 may output characteristic data of the chroma residual image.
The feature data of the chroma residual image may be input to the chroma residual decoder 1070. According to an embodiment of the present disclosure, the feature data of the chroma residual image and the feature data of the luma image y may be concatenated with each other and then input to the chroma residual decoder 1070. The chroma residual decoder 1070 may output a chroma residual image.
Chroma reconstructed imagesBy combining the chroma residual image and the chroma prediction image c p The summation is generated.
Luminance reconstructed imageAnd chroma reconstruction image->Is input to an image transformer 1075. The image transformer 1075 may output RGB output image +.>
It has been described that the RGB input image is transformed into a luminance image and a chrominance image (such as a YUV image) and then input to the luminance encoder 1040, the cross-channel encoder 1050, and the chrominance residual encoder 1065. In addition, it has been described that the luminance reconstructed image output from the luminance decoder 1045 and the chrominance reconstructed image output by the chrominance residual decoder 1070 and the cross-channel predictor 1060 are transformed into RGB output images and then output.
However, the present disclosure is not limited thereto, so that luminance images and chrominance images can be input and output without transformation in the image transformers 1035 and 1075.
Furthermore, the description has been made under the assumption that the input image and the output image represent the entire image, but the input image and the output image may represent residual images. That is, the RGB input image may be an RGB residual image, and the RGB output image may be a reconstructed RGB residual image. Luminance image y may be a luminance residual image and chrominance image c may be a chrominance residual image. The chroma prediction image may be a prediction image of a chroma residual image, and the chroma residual image may be a residual image of the chroma residual image.
Fig. 10c is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus including the pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction described with reference to fig. 10 b.
Referring to fig. 10c, when the current image is an I frame, the current image may be encoded and decoded to generate a current reconstructed image without reference to a previous reconstructed image. The I encoder 1082 and the I decoder 1084 included in the first block 1080 may be replaced with a pair of an image encoding device and an image decoding device described with reference to fig. 10 b. The input images of the pair of image encoding apparatus and image decoding apparatus instead may be YUV input images (or RGB input images), and the output images thereof may be YUV output images (or RGB output images).
When the current image is not an I-frame, the current image may be encoded and decoded with reference to a previously reconstructed image to generate the current reconstructed image.
The second encoder 1087 and the second decoder 1089 included in the second block 1085 may be replaced with a pair of an image encoding device and an image decoding device as described with reference to fig. 10 b. In this case, the input image of the pair of image encoding apparatus and image decoding apparatus that is replaced may be a residual image r of the YUV input image i And the output image therefrom may be a residual image r 'of the YUV output image' i . Components other than the second decoder 1089 and the second encoder 1087 have been described with reference to fig. 1.
Similar to the pair of image encoding apparatus and image decoding apparatus described with reference to fig. 10b, the pair of image encoding apparatus and image decoding apparatus to be described with reference to fig. 11b, 12b, 13 to 18 may also be included in the first block 1080 or the second block 1085.
Fig. 11a is a diagram for describing cross-channel prediction according to an embodiment of the present disclosure.
Referring to fig. 11a, in contrast to fig. 10a, downsamplers 1105 and 1110 may be included. The feature data of the luminance image may be converted into downsampled feature data of the luminance image by the downsampler 1105. The reconstructed luminance image may be transformed into a downsampled reconstructed luminance image by a downsampler 1110.
Unlike fig. 10a, the downsampled reconstructed luma image instead of the reconstructed luma image may be concatenated with the original chroma image and then input to a cross-channel encoder 1115. Downsampled feature data of the luma image instead of feature data of the luma image may be concatenated with feature data for cross-channel prediction and then input to the cross-channel decoder 1120.
The cross-channel decoder 1120 may output scalar parameters 1125 and bias parameters 1130.
The cross-channel prediction 1122 may include element-specific multiplications of scalar parameters 1125 with downsampled reconstructed luminance image 1133 and element-specific additions by bias parameters 1130.
Unlike fig. 10a, by performing downsampling, the resolution of the luminance component and the resolution of the chrominance component may be matched to each other. When the color representation scheme is YUV 4:4:4, the resolution of the Y component and the resolution of the U/V component are the same as each other, so that separate downsampling may not be performed as in fig. 10 a. However, when the color representation scheme is not YUV 4:4:4 (e.g., 4:2: 0), because the resolution of the Y component and the resolution of the U/V component are different from each other, downsampling may be performed as in fig. 11 a.
Fig. 11b is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Referring to fig. 11b, in contrast to fig. 10b, downsamplers 1135 and 1140 may be included. Downsampler 1135 may reconstruct an image for brightnessThe down-sampling is performed such that,and the downsampler 1140 may downsample the feature data of the luminance image. The downsampled luminance reconstructed image may be input to a cross-channel predictor 1145. The downsampled feature data for the luma image may be concatenated with the chroma image c and then input to the cross-channel encoder 1150.
Depending on the implementation, the downsampled feature data of the luminance image may be concatenated with feature data for cross-channel prediction and then input to cross-channel decoder 1155. Depending on the implementation, the downsampled feature data of the luma image may be concatenated with the feature data of the chroma image and then input to the chroma residual decoder 1160.
As a result, unlike fig. 10b, the pair of image encoding and decoding apparatuses of fig. 11b may include downsamplers 1135 and 1140 to perform downsampling on a luminance image to match the resolution of the luminance image with the resolution of a chrominance image, thereby generating an accurate prediction image of the chrominance image.
Fig. 12a is a diagram for describing cross-channel prediction according to an embodiment of the present disclosure.
Referring to fig. 12a, unlike fig. 10a, transducers 1205 and 1210 may be included. The feature data of the luminance image may be transformed into multi-channel reconstructed luminance image data (or multi-channel luminance image data) by a transformer 1205. The reconstructed luminance image may be transformed into a multi-channel reconstructed luminance image by a transformer 1210. The transforms of the transformers 1205 and 1210 may be spatial to depth transforms, which represent the process of transforming (or rearranging) spatial data having a plurality of channels into depth data having a plurality of channels, wherein the total size of the data does not change. For example, when spatial-to-depth conversion is performed on 4×4×1 data, 2×2×4 data may be output, and the size of input data may be equal to the size of output data.
Unlike fig. 10a, the multi-channel reconstructed luma image instead of the reconstructed luma image may be concatenated with the original chroma image and then input to a cross-channel encoder 1215. In addition, multi-channel feature data of a luminance image may be concatenated with feature data for cross-channel prediction instead of feature data of the luminance image and then input to the cross-channel decoder 1220.
The cross-channel decoder 1220 may output scalar parameters 1225 and offset parameters 1230.
The cross-channel prediction 1222 may include scalar parameters 1225 and element-specific multiplications of the multi-channel reconstructed luminance image 1233, and element-specific additions by bias parameters 1230. The multi-channel result values may be calculated by an element-specific multiplication of scalar parameter 1225 with the multi-channel reconstructed luminance image 1233, and the multi-channel result values may be summed for each element to calculate the result value for one channel. An element-specific addition operation of the result value of one channel and the bias parameter 1230 may be performed. The cross-channel prediction 1222 shown in fig. 12a is a prediction of one chroma component, and the cross-channel prediction 1222 may be performed for prediction of another chroma component.
Unlike fig. 10a, the resolution of the luminance component may be matched to the resolution of the chrominance component by performing a spatial to depth transform. When the color representation scheme is YUV 4:4:4, the resolution of the Y component and the resolution of the U/V component are the same as each other, so that a separate spatial-to-depth transformation may not be performed as in fig. 10 a. However, when the color representation scheme is not YUV 4:4:4 (e.g., 4:2:0, 4:2:2, 4:1:1), because the resolution of the Y component and the resolution of the U/V component are different from each other, a spatial-to-depth transform may be performed as in fig. 12 a. For example, when the color representation scheme is YUV 4:2: at 0, a level 2 spatial to depth transform may be performed.
Fig. 12b is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus for cross-channel prediction according to an embodiment of the present disclosure.
Referring to fig. 12b, in contrast to fig. 10b, inverters 1235 and 1240 may be included. The transformer 1235 may reconstruct an image for brightnessThe transformation is performed, and the transformer 1240 may transform the feature data of the luminance image. The transformed luminance reconstructed image may be input to a cross-channel predictor 1245. The feature data of the transformed luminance image may be concatenated with the chrominance image c and then input to the cross-channel encoder 1250.
According to an implementation, the feature data of the transformed luminance image may be concatenated with feature data for cross-channel prediction and then input to the cross-channel decoder 1255. Depending on the implementation, the feature data of the transformed luminance image may be concatenated with the feature data of the chrominance image and then input to the chrominance residual decoder 1260.
As a result, unlike fig. 10b, the pair of image encoding and decoding apparatuses of fig. 12b may include transformers 1235 and 1240 to perform transformation on the luminance image, matching the resolution of the luminance image with the resolution of the chrominance image, thereby generating an accurate prediction image of the chrominance image.
The downsamplers 1135 and 1140 and the converters 1235 and 1240 described with reference to fig. 11b and 12b may be included in a pair of the image encoding apparatus and the image decoding apparatus described below with reference to fig. 13 to 18.
Fig. 13 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 13, feature data of a luminance image may be concatenated with feature data for cross-channel prediction and then input to a cross-channel decoder 1305. The feature data of the luminance image may be concatenated with the feature data of the chrominance image and may not be input to the chrominance residual decoder 1310. That is, feature data of the chroma image may be input to the chroma residual decoder 1310.
Fig. 14 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 14, feature data of a luminance image may be concatenated with feature data of a chrominance image and then input to a chrominance residual decoder 1405.
The feature data of the luminance image may not be concatenated with the feature data for cross-channel prediction. Data regarding feature data for cross-channel prediction may be input to the cross-channel decoder 1410.
Fig. 15 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 15, feature data of a luminance image may not be concatenated with feature data of a chrominance image and may not be input to the chrominance residual decoder 1505.
The feature data for cross-channel prediction may be concatenated with the feature data of the chroma image and then input to the chroma residual decoder 1505.
The feature data of the luminance image may not be concatenated with the feature data for cross-channel prediction and may not be input to the cross-channel decoder 1510. The feature data for cross-channel prediction may be input to the cross-channel decoder 1510.
Fig. 16 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 16, feature data of a luminance image may be concatenated with feature data for cross-channel prediction and then input to a cross-channel decoder 1605.
The feature data of the luminance image may not be concatenated with the feature data of the chrominance image and may not be input to the chrominance residual decoder 1610.
The feature data of the chroma image may be concatenated with feature data for cross-channel prediction and then input to the chroma residual decoder 1610.
Fig. 17 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 17, feature data of a luminance image may not be concatenated with feature data for cross-channel prediction and may not be input to the cross-channel decoder 1705. The feature data for cross-channel prediction may be input to the cross-channel decoder 1705.
The feature data for cross-channel prediction may be concatenated with the feature data of the chroma image and then input to the chroma residual decoder 1710.
Fig. 18 is a diagram for describing a pair of an image encoding apparatus and an image decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 18, feature data of a luminance image may be concatenated with feature data for cross-channel prediction and then input to a cross-channel decoder 1805.
The feature data of the luminance image, the feature data for cross-channel prediction, and the feature data of the chrominance image may all be concatenated with each other and then input to the chrominance residual decoder 1810.
As described above with reference to fig. 13 to 18, various combinations of feature data may be input to a cross-channel decoder or a chroma residual decoder, thereby improving the accuracy of a predicted image of a chroma image based on cross-channel prediction, and improving encoding/decoding efficiency by reducing the size of feature data of a residual image of the chroma image.
Fig. 19 is a diagram showing an example of an architecture of a neural network 1900 according to an embodiment of the disclosure.
As shown in fig. 19, input data 1905 is input to a first convolution layer 1910. Here, the input data 1905 varies depending on whether the neural network 1900 is used as the first decoder 231, the second decoder 232, the third decoder 234, the first decoder 271, the second decoder 272, the third decoder 274, the first encoder 611, the second encoder 612, or the third encoder 614.
For example, when the neural network 1900 is used as the first decoder 231, the input data 1905 may correspond to feature data of a luminance image, and when the neural network 1900 is used as the second decoder 232, the input data 1905 may correspond to feature data for cross-channel prediction.
The 3×3×4 marked on the first convolution layer 1910 shown in fig. 19 represents that convolution is performed by collating one piece of input data 1905 using four 3×3 filters. Four feature maps are generated by four filter kernels as convolution results.
The feature map generated by the first convolution layer 1910 represents unique features of the input data 1905. For example, each feature map may represent a vertical feature, a horizontal feature, or an edge feature of the input data 1905.
The convolution operation performed by the first convolution layer 1910 will now be described in detail with reference to fig. 20.
A signature 2050 may be generated by performing multiplication and addition operations between the parameters of the 3 x 3 filter kernel 2030 used by the first convolution layer 1910 and the sample values in the input data 1905 corresponding thereto. Because the first convolution layer 1910 uses four filter kernels 2030, four feature maps 1945 may be generated by performing a convolution operation using the four filter kernels 2030.
I1 to I49 marked on the input data 2005 in fig. 20 indicate the samples of the input data 1905, and F1 to F9 marked on the filter kernel 2030 indicate the samples (also referred to as parameters) of the filter kernel 2030. M1 to M9 marked on the feature map 2050 indicate the samples of the feature map 2050.
In the convolution operation, the sample values I1, I2, I3, I8, I9, I10, I15, I16, and I17 of the input data 1905 may be multiplied by F1, F2, F3, F4, F5, F6, F7, F8, and F9 of the filter kernel 2030, respectively, and a value obtained by combining (e.g., adding) the result values of the multiplication operation may be allocated as the value of M1 of the feature map 2050. When the stride of the convolution operation is 2, the sample values I3, I4, I5, I10, I11, I12, I17, I18, and I19 of the input data 1905 may be multiplied by F1, F2, F3, F4, F5, F6, F7, F8, and F9 of the filter kernel 2030, respectively, and the value obtained by combining the result values of the multiplication operation may be assigned as the value of M2 of the feature map 2050.
By performing a convolution operation between the sample values in the input data 1905 and the samples of the filter kernel 2030 while the filter kernel 2030 moves to the last sample of the input data 1905 based on a step, a feature map 2050 having a specific size can be obtained.
According to the present disclosure, parameter values of the neural network 1900 may be optimized by training the neural network 1900, e.g., samples of the filter kernel 2030 (e.g., F1, F2, F3, F4, F5, F6, F7, F8, and F9 of the filter kernel 2030) used by a convolutional layer of the neural network 1900.
Although the convolution layers included in the neural network 1900 may perform the convolution operation described with respect to fig. 20, the convolution operation described with respect to fig. 20 is merely an example and is not limited thereto.
Referring back to fig. 19, the signature of the first convolution layer 1910 is input to the first activation layer 1920.
The first activation layer 1920 may impart a non-linear characteristic to each of the feature maps. The first active layer 1920 may include, but is not limited to, a sigmoid function, a hyperbolic tangent (tanh) function, a rectifying linear unit (ReLU) function, and the like.
When the first activation layer 1920 imparts a nonlinear feature, this represents that some of the sample values of the feature map are changed and output. In this case, the change is performed by applying a nonlinear characteristic.
The first activation layer 1920 determines whether to send the sample values of the feature map to the second convolution layer 1930. For example, some of the sample values of the feature map are activated by the first activation layer 1920 and sent to the second convolution layer 1930, and other sample values are deactivated by the first activation layer 1920 and not sent to the second convolution layer 1930. The unique features of the input data 1905 represented by the feature map are enhanced by the first activation layer 1920.
The feature map 1925 output from the first activation layer 1920 is input to the second convolution layer 1930. Any of the feature maps 1925 shown in FIG. 19 are the result of processing the feature map 2050 of FIG. 20 by the first activation layer 1920.
The 3×3×4 marked on the second convolution layer 1930 means that convolution is performed by checking the input feature map 1925 using four 3×3 filters. The output of the second convolution layer 1930 is input to the second activation layer 1940. The second activation layer 1940 may impart non-linear features to the input feature map.
The feature map 1945 output from the second activation layer 1940 is input to the third convolution layer 1950. The 3 x 1 marked on the third convolution layer 1950 represents the generation of a piece of output data 1955 by performing convolution using a 3 x 3 filter kernel.
The output data 1955 is input to a first convolutional layer 1910 of the neural network 1900. Here, the output data 1955 varies according to whether the neural network 1900 is used as the first decoder 231, the second decoder 232, the third decoder 234, the first decoder 271, the second decoder 272, the third decoder 274, the first encoder 611, the second encoder 612, or the third encoder 614.
For example, when the neural network 1900 is used as the first decoder 231, the output data 1955 may be a reconstructed image of a luminance image, and when the neural network 1900 is used as the second decoder 232, the output data 1955 may be cross-channel prediction information.
Although the neural network 1900 includes three convolution layers and two activation layers in fig. 19, fig. 19 is merely an example, and the number of convolution layers and activation layers included in the neural network 1900 may be varied in various ways according to the implementation.
According to an implementation, the neural network 1900 may be implemented as a Recurrent Neural Network (RNN). This represents a change in the neural network 1900 from a Convolutional Neural Network (CNN) architecture to an RNN architecture according to an embodiment of the present disclosure.
In an embodiment of the present disclosure, the image decoding apparatuses 200 and 250 and the image encoding apparatuses 600 and 650 may include at least one Arithmetic Logic Unit (ALU) for the above-described convolution and activation operations.
The ALU may be implemented as a processor. For convolution operations, the ALU may include a multiplier for multiplying the sample value of the filter kernel by the sample value of the input data 1905 or the feature map output from the previous layer, and an adder for adding the result values of the multiplication.
For the active operation, the ALU may include a multiplier for multiplying the input sample value by a weight for a predetermined sigmoid, tanh, or ReLU function, and a comparator for comparing the multiplication result with a specific value to determine whether to send the input sample value to the next layer.
A method of training a neural network used in image encoding and decoding processes will now be described with reference to fig. 21 and 22.
Fig. 21 is a diagram for describing a method of training the first decoder 231, the first encoder 611, the second decoder 232, the second encoder 612, the third decoder 234, and the third encoder 614.
In fig. 21, current training luma image 2105, current training chroma image 2110, current reconstructed training luma image 2120, and current reconstructed training chroma image 2140 correspond to the current luma image, the current chroma image, the current reconstructed luma image, and the current reconstructed chroma image.
In training the neural network used in the first decoder 231, the second decoder 232, the third decoder 234, the first encoder 611, the second encoder 612, and the third encoder 614, it is necessary to consider the similarity between the current reconstructed training luminance image 2120 and the current training luminance image 2105, the similarity between the current reconstructed training chrominance image 2140 and the current training chrominance image 2110, the bit rate of the bit stream generated by encoding the current training luminance image 2105, the bit rate of the bit stream generated by encoding the current reconstructed training luminance image 2120 and the current training chrominance image 2110, and the bit rate of the bit stream generated by encoding the current training chrominance image 2110.
To this end, in an embodiment of the present disclosure, the neural network used in the first decoder 231, the first encoder 611, the second decoder 232, the second encoder 612, the third decoder 234, and the third encoder 614 may be trained based on the first loss information 2150 corresponding to the similarity between the current training chrominance image 2110 and the current reconstructed training chrominance image 2140, the fourth loss information 2180 corresponding to the similarity between the current training luminance image 2105 and the current reconstructed training luminance image 2120, and the second loss information 2160, the third loss information 2170, and the fifth loss information 2190 corresponding to the bit rate of the bit stream, respectively.
Referring to fig. 21, a current training luminance image 2105 is input to a first encoder 611. The first encoder 611 processes the current training luminance image 2105 to output feature data L of the luminance image i
Characteristic data L of luminance image i Is input to the first decoder 231 and the first decoder 231 outputs a current reconstructed training luminance image 2120.
The current reconstructed training luma image 2120 and the current training chroma image 2110 are input to the second encoder 612. The second encoder 612 may output feature data W for cross-channel prediction by processing the current reconstructed training luma image 2120 and the current training chroma image 2110 i
Feature data w for cross-channel prediction i Is input to the second decoder 232, and the second decoder 232 may output cross-channel prediction information g i
Training luminance image 2120 and cross-channel prediction information g using current reconstruction i Executing cross-channelsPrediction 2130, and may generate a predicted training image x 'of the current chroma image' i
Obtaining residual image data r of a current training chrominance image i Which corresponds to the predicted training image x 'of the current chrominance image' i Differences from the current training chrominance image 2110.
Residual image data r of current training chrominance image i Is input to the third encoder 614, and the third encoder 614 processes the residual image data r of the current training chrominance image by processing i To output the feature data v of the residual image data of the current training chrominance image i
Feature data v of residual image data of current training chrominance image i Is input to the third decoder 234.
The third decoder 234 processes the feature data v of the residual image data of the current training chrominance image i To output a residual training image r 'of the current chrominance image' i And by predicting the training image x' i And residual training image r' i The summation is performed to obtain the current reconstructed training chromaticity image 2140.
To train the neural network used in the first encoder 611, the second encoder 612, the third encoder 614, the first decoder 231, the second decoder 232, and the third decoder 234, at least one of the first loss information 2150, the second loss information 2160, the third loss information 2170, the fourth loss information 2180, or the fifth loss information 2190 may be obtained.
First loss information 2150 corresponds to the difference between current training chromaticity image 2110 and current reconstructed training chromaticity image 2140. The differences between the current training chromaticity image 2110 and the current reconstructed training chromaticity image 2140 may include at least one of an L1 norm value, an L2 norm value, a Structural Similarity (SSIM) value, a peak signal-to-noise ratio-human visual system (PSNR-HVS) value, a multi-scale SSIM (MS-SSIM) value, a variance-expansion factor (VIF) value, or a video multi-method assessment fusion (VMAF) value between the current training chromaticity image 2110 and the current reconstructed training chromaticity image 2140.
The first loss information 2150 is related to the quality of the current training chrominance image 2110 such that the first loss information 2150 may also be referred to as quality loss information.
Similar to the first loss information 2150, the fourth loss information 2180 corresponds to the difference between the current training luminance image 2105 and the current reconstructed training luminance image 2120.
The second loss information 2160 corresponds to the feature data w for cross-channel prediction i Entropy of or corresponds to characteristic data w for cross-channel prediction i Bit rate of the corresponding bit stream.
Third loss information 2170 corresponds to feature data v of residual image data of the current training chrominance image i Or characteristic data v corresponding to residual image data of a current training chrominance image i Bit rate of the corresponding bit stream.
Fifth loss information 2190 corresponds to feature data L of the current training luminance image i Entropy or characteristic data L corresponding to the current training luminance image i Bit rate of the corresponding bit stream.
When the bitstream comprises feature data w for cross-channel prediction i And feature data v of residual image data of the current training chrominance image i In both cases, sixth loss information corresponding to the bit rate of the corresponding bit stream may be calculated. In this case, training may not be performed using the second loss information 2160 and the third loss information 2170.
When the bitstream comprises feature data w for cross-channel prediction i Feature data v of residual image data of current training chrominance image i Feature data L of the current training luminance image i Can calculate seventh loss information corresponding to bit rates of the respective bit streams In this case Training may be performed without the second loss information 2160, the third loss information 2170, and the fifth loss information 2190.
The second loss information 2160, the third loss information 2170, and the fifth loss information 2190 are related to the efficiency of encoding the current training luminance image 2105 and the current training chrominance image 2110, such that the second loss information 2160, the third loss information 2170, and the fifth loss information 2190 may be referred to as compression loss information.
The neural networks used in the first, second, third, first, second, and third decoders 231, 232, 234, 611, 612, 614, 231, 232, and 234 are trained to reduce or minimize final loss information calculated from at least one of the first, second, third, fourth, or fifth loss information 2150, 2160, 2170, 2180, or 2190.
More specifically, the neural network used in the first, second, third, and third decoders 231, 232, 234, 611, 612, and 614 is trained to reduce or minimize final loss information while changing the values of preset parameters.
In embodiments of the present disclosure, final loss information may be calculated based on equation 1.
[ equation 1]
Final loss information=a×first loss information+b×second loss information+c×third loss information+d×fourth loss information+e×fifth loss information
In equation 1, a, b, c, d and e represent weights applied to the first loss information 2150, the second loss information 2160, the third loss information 2170, the fourth loss information 2180, and the fifth loss information, respectively.
Based on equation 1, it is shown that the neural network used in the first, second, third, and third decoders 231, 232, 234, 611, 612, and 614 is trained in the following manner: the current reconstructed training luma image 2120 is as similar as possible to the current training luma image 2105, the current reconstructed training chroma image 2140 is as similar as possible to the current training chroma image 2110, and the size of the bit stream corresponding to the data output from the first encoder 611, the second encoder 612, and the third encoder 614 is minimized.
The neural network used in the first, second, third, and third decoders 231, 232, 234, 611, 612, 614 may be individually trained according to a plurality of pieces of final loss information based on at least some of the first, second, third, fourth, and fifth loss information 2150, 2160, 2170, 2180, 2190, and the neural network used in the first, second, third, and third decoders 231, 232, 234, 611, 612, 614, which is not limited to training together the final loss information based on equation 1.
Fig. 22 is a diagram for describing a process of training a neural network used in the first, second, third, and third decoders 231, 232, 234, 611, 612, and 614, which is performed by the training apparatus 2200.
The training process described above with respect to fig. 21 may be performed by training device 2200. Training device 2200 may be, for example, image encoding device 600 or 650 or a separate server. Parameters obtained as a result of training are stored in the image encoding apparatuses 600 and 650 and the image decoding apparatuses 200 and 250.
Referring to fig. 22, the training apparatus 2200 initially sets parameters of the neural network used in the first, second, third, and third decoders 231, 232, 234, 611, 612, and 614. Accordingly, the first, second, third, and third decoders 231, 232, 234, 611, 612, and 614 may operate based on the initially set parameters.
In operation S2210, the training device 2200 may input a current training luminance image to the first encoder 631.
In operation S2215, the first encoder 631 may process the input data to convert the characteristic data L of the luminance image i To the training device 2200 and the first decoder 231.
In operation S2220, the training device 2200 may perform training according to the feature data L of the luminance image i Fifth loss information 2190 is calculated.
In operation S2225, the first decoder 231 may process the feature data L of the luminance image i The current reconstructed training luminance image 2120 is output to the training device 2200.
In operation S2230, the training device 2200 may calculate fourth loss information 2180 from the current reconstructed training luminance image 2120.
In operation S2235, the training device 2200 may input the current reconstructed training luma image 2120 and the current training chroma image 2110 to the second encoder 632.
In operation S2240, the second encoder 632 may process the feature data w for the cross-channel prediction by processing the current reconstructed training luminance image 2120 and the current training chrominance image 2110 i Output to training device 2200 and second decoder 232.
In operation S2245, the training device 2200 may perform a prediction based on the feature data w for the cross-channel prediction i The second loss information 2160 is calculated.
In operation S2250, the second decoder 232 may predict the characteristic data w for cross-channel prediction by processing i Cross-channel prediction information g i Output to training device 2200.
In operation S2255, the training apparatus 2200 may predict the information g by using the cross-channel i Performing cross-channel prediction 2130 with the current reconstructed training luma image 2120 to generate a predicted image x 'of the training chroma image' i
In operation S2260, the training device 2200 may use the current training chromaticity image 2110 and the predicted image x 'of the training chromaticity image' i Generating a residual image r of a current training chrominance image i
In operation S2265, the training device 2200 may train the residual image r of the current training chromaticity image i Is input to the third encoder 634.
In operation S2270, the third encoder 634 may process the residual image r of the current chroma image i To convert characteristic data v of chrominance image i Output to training device 2200 and third decoder 234.
In operation S2275, the training device 2200 may perform training according to the feature data v of the chromaticity image i Third loss information 2170 is calculated.
In operation S2280, the third decoder 234 may process the feature data v of the chrominance image i To reconstruct a residual image r 'of the training chrominance image' i Output to training device 2200.
In operation S2285, the training device 2200 may train the predicted image x 'of the chroma image according to the current reconstruction' i And a residual image r 'of the current reconstructed training chrominance image' i To generate a current reconstructed training chrominance image 2140.
In operation S2290, the training device 2200 may calculate first loss information from the current reconstructed training chromaticity image 2140 and the current training chromaticity image 2110.
In operations S2291, S2292, S2293, S2294, S2295, and S2296, the training device 2200 calculates final loss information by combining at least one of the first loss information 2150, the second loss information 2160, the third loss information 2170, the fourth loss information 2180, or the fifth loss information 2190, and the neural network used in the first decoder 231, the second decoder 232, the third decoder 234, the first encoder 611, the second encoder 612, and the third encoder 614 updates the initially set parameters through the back propagation process based on the final loss information.
Thereafter, the training apparatus 2200, the first decoder 231, the second decoder 232, the third decoder 234, the first encoder 611, the second encoder 612, and the third encoder 614 update the parameters while repeating operations S2110 to S2296 until the final loss information is minimized. In this case, the first, second, third, and third decoders 231, 232, 234, 611, 612, and 614 may operate based on the parameters updated in the previous operation.
Furthermore, the foregoing embodiments of the present disclosure may be written as computer-executable programs and the written programs may be stored in machine-readable storage media.
The machine-readable storage medium may be provided in the form of a non-transitory storage medium. When the storage medium is "non-transitory," this means that the storage medium is tangible and does not include signals (e.g., electromagnetic waves), and does not limit the data to be semi-permanently or temporarily stored in the storage medium. For example, a "non-transitory storage medium" may include a buffer that temporarily stores data.
According to embodiments of the present disclosure, methods according to various embodiments of the present disclosure may be included and provided in a computer program product. The computer program product may be used as an article of commerce for transactions between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium, e.g. a compact disc read only memory (CD-ROM), or electronically distributed (e.g. downloaded or uploaded) via an application store or directly between two user devices, e.g. smart phones. For electronic distribution, at least a portion of the computer program product (e.g., the downloadable app) may be temporarily generated or at least temporarily stored in a machine-readable storage medium (e.g., memory of a manufacturer's server, an application store's server, or a relay server).
While the present disclosure has been particularly shown and described with reference to embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the present disclosure as defined by the following claims.

Claims (15)

1. A method of decoding an image based on cross-channel prediction using artificial intelligence AI, the method comprising:
obtaining feature data for cross-channel prediction from the bitstream;
obtaining feature data of a luminance image in a current image and feature data of a chrominance image in the current image from the bitstream;
reconstructing the luminance image by applying the feature data of the luminance image to a neural network based luminance decoder;
obtaining cross-channel prediction information by applying the characteristic data for cross-channel prediction to a neural network-based cross-channel decoder;
obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information;
obtaining a residual image of the chrominance image by applying the characteristic data of the chrominance image to a neural network-based chrominance residual decoder; and
The chrominance image is reconstructed based on the prediction image and the residual image.
2. The method of claim 1, wherein at least one of the feature data for cross-channel prediction, the feature data of the luminance image, or the feature data of the chrominance image is obtained by entropy decoding and dequantizing the bitstream.
3. The method of claim 1, wherein the neural network-based cross-channel decoder is trained based on first and second loss information,
wherein the first loss information corresponds to a difference between a current training chromaticity image and a current reconstructed training chromaticity image corresponding to the current training chromaticity image; and
the second loss information corresponds to entropy of the feature data for cross-channel prediction of the current training chrominance image.
4. The method of claim 1, further comprising: when the chroma subsampling format of the current image is not YUV (YCbCr) 4:4:4, downsampling is performed on the reconstructed luma image,
wherein obtaining the predicted image of the chroma image comprises: a predicted image of the chroma image is obtained by performing cross-channel prediction based on the downsampled luma image and the cross-channel prediction information.
5. The method of claim 1, further comprising: when the chroma subsampling format of the current image is not YCbCr 4:4:4, generating multi-channel luminance image data by performing a spatial to depth transformation on the reconstructed luminance image,
wherein obtaining the predicted image of the chroma image comprises: a predicted image of the chroma image is obtained by performing cross-channel prediction based on the multi-channel luma image data and the cross-channel prediction information.
6. The method of claim 1, wherein the luminance image comprises an image of a Y component and the chrominance image comprises at least one of an image of a Cb component or an image of a Cr component.
7. The method of claim 1, wherein obtaining the cross-channel prediction information by applying the feature data for cross-channel prediction to the neural network-based cross-channel decoder comprises: the cross-channel prediction information is obtained by applying the feature data for cross-channel prediction and the feature data of the luminance image to the neural network-based cross-channel decoder.
8. The method of claim 1, wherein obtaining the residual image of the chroma image by applying the characteristic data of the chroma image to the neural network-based chroma residual decoder comprises obtaining the residual image of the chroma image by further applying at least one of the characteristic data of the luma image or the characteristic data for cross-channel prediction to the neural network-based chroma residual decoder.
9. The method of claim 1, wherein the cross-channel prediction information includes information about scalar parameters and information about bias parameters.
10. A computer-readable recording medium having recorded thereon a program for executing the method according to claim 1 on a computer.
11. An apparatus for decoding an image based on cross-channel prediction using artificial intelligence AI, the apparatus comprising:
an acquirer configured to:
obtaining feature data for cross-channel prediction from a bitstream, and
obtaining feature data of a luminance image in a current image and feature data of a chrominance image in the current image from the bitstream; and
an image decoder configured to:
reconstructing the luminance image by applying the characteristic data of the luminance image to a neural network based luminance decoder,
obtaining cross-channel prediction information by applying the characteristic data for cross-channel prediction to a neural network-based cross-channel decoder, and obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information,
Obtaining a residual image of the chrominance image by applying the characteristic data of the chrominance image to a neural network-based chrominance residual decoder, and
reconstructing the chroma image based on a predicted image of the chroma image and a residual image of the chroma image.
12. A method of encoding an image based on cross-channel prediction using artificial intelligence AI, the method comprising:
obtaining feature data of a luminance image in a current original image by applying the original luminance image in the current original image to a neural network-based luminance encoder, and reconstructing the luminance image by applying the feature data of the luminance image to a neural network-based luminance decoder;
obtaining feature data for cross-channel prediction by applying the reconstructed luma image and the original chroma image in the current original image to a neural network-based cross-channel encoder;
obtaining cross-channel prediction information by applying the obtained feature data for cross-channel prediction to a neural network-based cross-channel decoder;
obtaining a predicted image of the chrominance image by performing cross-channel prediction based on the reconstructed luminance image and the cross-channel prediction information;
Obtaining feature data of the chroma image by applying a residual image of the chroma image obtained based on the original chroma image and a predicted image of the chroma image to a neural network-based chroma residual encoder; and
generating a bitstream comprising the feature data of the luminance image, the feature data of the chrominance image, and the feature data for cross-channel prediction.
13. The method of claim 12, wherein at least one of the feature data for cross-channel prediction, the feature data of the luma image, or the feature data of the chroma image is quantized and entropy encoded.
14. The method of claim 12, wherein the neural network-based cross-channel encoder is trained based on first loss information and second loss information,
wherein the first loss information corresponds to a difference between a current training chromaticity image and a current reconstructed training chromaticity image corresponding to the current training chromaticity image; and
the second loss information corresponds to entropy of the feature data for cross-channel prediction of the current training chrominance image.
15. The method of claim 12, further comprising: when the chroma subsampling format of the current image is not YCbCr 4:4:4, downsampling the reconstructed luminance image,
Wherein obtaining the predicted image of the chroma image comprises: a predicted image of the chroma image is obtained by performing cross-channel prediction based on the downsampled luma image and the cross-channel prediction information.
CN202280054487.4A 2021-08-06 2022-07-27 AI-based image encoding and decoding apparatus and method of performing the same Pending CN117837146A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0104201 2021-08-06
KR10-2021-0188870 2021-12-27
KR1020210188870A KR20230022085A (en) 2021-08-06 2021-12-27 Artificial intelligence based encoding apparatus and decoding apparatus of image, and method thereby
PCT/KR2022/011070 WO2023013966A1 (en) 2021-08-06 2022-07-27 Ai-based image encoding and decoding device, and method performed thereby

Publications (1)

Publication Number Publication Date
CN117837146A true CN117837146A (en) 2024-04-05

Family

ID=90504436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280054487.4A Pending CN117837146A (en) 2021-08-06 2022-07-27 AI-based image encoding and decoding apparatus and method of performing the same

Country Status (1)

Country Link
CN (1) CN117837146A (en)

Similar Documents

Publication Publication Date Title
US11622112B2 (en) Decomposition of residual data during signal encoding, decoding and reconstruction in a tiered hierarchy
KR101901355B1 (en) Method and apparatus for performing graph-based prediction using optimazation function
TWI816439B (en) Block-based prediction
US11432012B2 (en) Method and apparatus for encoding and decoding digital images or video streams
JP6409516B2 (en) Picture coding program, picture coding method, and picture coding apparatus
EP3085089B1 (en) Optimised video coding involving transform and spatial domain weighting
US9641847B2 (en) Method and device for classifying samples of an image
WO2017052174A1 (en) Method and apparatus for processing video signals using coefficient derivation prediction
US11863756B2 (en) Image encoding and decoding apparatus and method using artificial intelligence
CN117837146A (en) AI-based image encoding and decoding apparatus and method of performing the same
EP4354871A1 (en) Ai-based image encoding and decoding device, and method performed thereby
US20230044603A1 (en) Apparatus and method for applying artificial intelligence-based filtering to image
JP6557483B2 (en) Encoding apparatus, encoding system, and program
KR20230022085A (en) Artificial intelligence based encoding apparatus and decoding apparatus of image, and method thereby
US10051268B2 (en) Method for encoding, decoding video signal and device therefor
KR20240115147A (en) Image decoding method and apparatus, and image encoding method and apparatus
CN116868566A (en) AI-based image encoding and decoding apparatus and method thereof
CN116888961A (en) Apparatus for image encoding and decoding using AI and method for image encoding and decoding using the same
CN117882372A (en) Apparatus and method for AI-based filtering of images
CN118318248A (en) Image encoding apparatus and image decoding apparatus using AI, and method of encoding and decoding image by the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination