CN111800629A

CN111800629A - Video decoding method, video encoding method, video decoder and video encoder

Info

Publication number: CN111800629A
Application number: CN201910279326.5A
Authority: CN
Inventors: 周川; 金慕淳
Original assignee: Huawei Technologies Co Ltd; Korea Advanced Institute of Science and Technology KAIST
Current assignee: Huawei Technologies Co Ltd; Korea Advanced Institute of Science and Technology KAIST
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2020-10-20

Abstract

The application provides a video decoding method, an encoding method, a video decoder and a video encoder. The video decoding method comprises the following steps: respectively analyzing the main coding stream and the auxiliary coding stream to obtain a first video image with lower resolution and a residual value, then performing super-resolution improvement on the first video image by using a preset convolutional neural network to obtain a second video image with higher resolution, and finally obtaining a final target video image according to the second video image and the residual value. In the method, the first video image with lower resolution can be better subjected to super-resolution processing through the preset neural network model, then the second video image with higher resolution is obtained, then the final target video image can be obtained according to the second video image and the residual value through one-time superposition, and compared with the mode that the final video image can be obtained only through a layer-by-layer superposition mode in the traditional scheme, the decoding flow can be simplified.

Description

Video decoding method, video encoding method, video decoder and video encoder

Technical Field

The present application relates to the field of video encoding and decoding technology, and more particularly, to a video decoding method, an encoding method, a video decoder and an encoder.

Background

Reconstruction Video Coding (RVC) is a technique for compression coding of video content. The RVC technology is used for coding the video image to generate two paths of coding streams, namely a main coding stream and an auxiliary coding stream. The main coding stream is a video code stream with lower resolution, and the player can directly decode and play the video code stream; the auxiliary encoding stream assists the player to improve the resolution of the main encoding stream to obtain a video image with a higher resolution (for example, a video image with a resolution of 1080p can be obtained by analyzing the main encoding stream, and then a video image with a resolution of 4K can be obtained by the auxiliary encoding stream).

In the conventional RVC technology, an encoding end generally determines a plurality of video images with different levels of resolutions according to an original video image during encoding, then determines a reduction residual of the video images with different levels of resolutions relative to the original video image, and then the encoding end can encode the reduction residual of the video images with different levels of resolutions relative to the original image and a video image with a lowest level of resolution (a video image with a lowest resolution among the plurality of video images with different levels of resolutions is the video image with the lowest level of resolution) to generate a code stream. When decoding, the decoding end can gradually obtain a final video image (the resolution of the final video image is the same as that of the original video image) by decoding the video image of the lowest level resolution and the descending residual error of the video image of different levels resolution relative to the original video image.

When the conventional RVC technology is used to encode and decode video images, an encoding end needs to obtain video images with different levels of resolutions and needs to encode the reduced residuals of the video images with different levels of resolutions relative to the original video image, and there are many involved video images with different levels. In addition, when decoding, the decoding end needs to acquire the video image with the lowest level resolution first, and then superimposes layer by layer according to the descending residual error of the video image with the lowest level resolution and the video image with different level resolutions relative to the original video image to obtain the video image with the same resolution as the original video image, and the encoding and decoding processes are complex.

Disclosure of Invention

The application provides a video encoding method, a video decoding method, a video encoder and a video decoder, which are used for reducing the complexity of encoding and decoding and improving the efficiency of encoding and decoding.

In a first aspect, a video decoding method is provided, which includes: decoding the main coding stream to obtain a first video image; performing super-resolution processing on the first video image by adopting a neural network model to obtain a second video image; decoding the auxiliary encoded stream to obtain residual values; and obtaining the target video image according to the second video image and the residual value.

The neural network model may be a neural network model trained in advance.

Optionally, the neural network model is trained based on a training sample image and a training target image, the training target image is a video image with the same resolution as that of the target video image, and the training sample image is a video image obtained by downsampling, encoding and decoding the training target image.

It should be understood that the training sample image is an input of the neural network model, and the training target image is a sample to which the neural network model is trained (the video image output by the neural network model during training is different from the training target image as little as possible).

In training the above neural network model, a training sample image may be input into the neural network model, and then model parameters of the neural network model (mainly including weight values of a plurality of input parameters of respective activation functions in the neural network model) may be determined by comparing differences between an output image of the neural network model and a training target.

Specifically, a large number of training sample images can be used for training the neural network model, the difference between the output image of the neural network model and the target training image is calculated, and the parameter value of the model parameter of the neural network model when the difference between the output image of the neural network model and the training target image meets the preset requirement is determined as the final value of the model parameter of the neural network model, so that the training of the neural network model is completed. The trained neural network model can be used for performing super-resolution processing on the first video image.

Optionally, the obtaining the target video image according to the second video image and the residual value includes: and superposing the pixel value of the second video image and the residual value to obtain a target video image.

It should be understood that the above-mentioned residual values include pixel values corresponding to each pixel point in the second video image. The superposition of the pixel value and the residual value of the second video image means that the pixel value of each pixel point of the second video image is added with the corresponding residual value to obtain the pixel value of the corresponding pixel point in the target video image.

According to the method and the device, super-resolution processing can be better performed on a first video image with low resolution through a neural network model, a second video image with high resolution is obtained, next, a final target video image can be obtained through superposition of the second video image and residual values obtained through decoding, and compared with a mode that the final video image is obtained through superposition of a low-resolution video image and residual values of multiple levels layer by layer in a traditional scheme, the decoding process can be simplified.

In addition, the neural network model can be flexibly optimized and adjusted according to needs, and the flexibility of the scheme is higher.

With reference to the first aspect, in some implementation manners of the first aspect, when the residual value only includes a residual corresponding to a pixel luminance component value, the performing super-resolution processing on the first video image by using a neural network model to obtain a second video image includes: performing super-resolution processing on the pixel point brightness component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value of the second video image; the above superimposing the pixel value and the residual value of the second video image to obtain the target video image includes: and superposing the pixel point brightness component value of the second video image with the residual value to obtain the pixel point brightness component value of the target video image.

In the application, the neural network model only processes the pixel point brightness component value of the first video image, so that the calculated amount of the neural network model during super-resolution processing can be reduced, and the video decoding efficiency is improved.

In particular, the human eye is more sensitive to the luminance of an image relative to the chrominance of the image. Therefore, the neural network model can be adopted to perform super-resolution processing on the brightness component values of the pixel points of the first video image, and the chromaticity component values of the pixel points of the second video image can be obtained through calculation of the traditional interpolation algorithm, so that the visual experience can be guaranteed, and the calculation complexity of video decoding can be reduced.

With reference to the first aspect, in certain implementations of the first aspect, the method further includes: and carrying out interpolation processing on the pixel point chromaticity component values of the first video image to obtain the pixel point chromaticity component values of the target video image.

Optionally, when the residual value includes a residual corresponding to a pixel luminance component value and a pixel chrominance component value, performing super-resolution processing on the first video image by using the neural network model to obtain a second video image, including: and performing super-resolution processing on the pixel point brightness component value and the pixel point chromaticity component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value and the pixel point chromaticity component value of the second video image.

It should be understood that, in the present application, the neural network model may be used to calculate the pixel luminance component value and the pixel chrominance component value of the second video image, or only the neural network model may be used to calculate the pixel luminance component value of the second video image, and other methods (such as an interpolation method) are used to obtain the pixel chrominance component value of the second video image.

The pixel luminance component value may be referred to as a pixel luminance component value, and the pixel chrominance component value may be referred to as a pixel chrominance component value. For convenience of description, the present application refers to a pixel luminance component value and a pixel chrominance component value collectively.

In addition to being represented in RGB format, images may also be represented in YUV format, where Y represents brightness (Luma or Luma), i.e. gray value, U represents Chroma (Chroma), and V represents density (Chroma).

The pixel luminance component value may be represented as Y, and the pixel chrominance component value may include U and V.

With reference to the first aspect, in certain implementations of the first aspect, the first video image is a high dynamic range HDR image or a standard dynamic range SDR image.

The decoding method is not only suitable for Standard Dynamic Range (SDR) images, but also suitable for High Dynamic Range (HDR) images.

In a second aspect, a video encoding method is provided, the method comprising: performing down-sampling and coding processing on the initial video image to obtain a main coding stream; decoding the main coding stream to obtain a first video image; performing super-resolution processing on the first video image by adopting a neural network model to obtain a second video image with the same resolution as the initial video image; determining a residual value of the initial video image relative to the second video image; and coding the residual error value to obtain an auxiliary coding stream.

Optionally, the neural network model is obtained by training according to a training sample image and a training target image, the training target image is a video image with the same resolution as the initial video image, and the training sample image is a video image obtained by down-sampling, encoding and decoding the training target image.

It is to be understood that the video encoding method in the second aspect may be a method corresponding to the video decoding method in the first aspect, and the primary coded stream and the secondary coded stream decoded by the video decoding method in the first aspect may be generated by the video encoding method in the second aspect.

The target video image in the first aspect described above may be of the same resolution as the initial video image in the second aspect described above.

It will be appreciated that the image content of the target video image in the above-described second aspect substantially corresponds to the image content of the original video image (the target video image may have some deviation from the original video image due to some loss in the encoding and decoding processes).

The model parameters of the neural network model in the second aspect described above may be identical to the model parameters of the neural network model in the first aspect. The definition and explanation of the neural network model in the first aspect described above are equally applicable to the neural network model in the second aspect.

Optionally, the determining a residual value of the initial video image relative to the second video image includes: residual values of pixel values of the initial video image relative to pixel values of the second video image are determined.

Specifically, the residual value may be a difference between a pixel value of a pixel point in the initial video image and a pixel value of a corresponding pixel point in the second video image.

According to the method and the device, after the main coding stream is generated, super-resolution processing can be performed on the video image obtained by decoding the main coding stream through the neural network model, so that the video image with the same resolution as the initial video image is obtained, then, the residual value of the initial video image relative to the video image obtained by the super-resolution processing can be directly generated, the auxiliary coding stream is generated according to the residual value, and compared with the mode that multiple layers of video images with different resolution levels and residual values with corresponding levels need to be obtained according to the initial video image in the traditional coding scheme, the method and the device can simplify the coding process. In addition, compared with the prior art that the video images with different multi-layer resolution levels need to be transmitted to the decoding end,

with reference to the second aspect, in some implementations of the second aspect, performing super-resolution processing on the first video image by using a neural network model to obtain a second video image with the same resolution as the initial video image includes: performing super-resolution processing on the pixel point brightness component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value of the second video image; determining residual values of the initial video image relative to the second video image, comprising: and determining the difference value of the pixel point brightness component value of the initial video image relative to the pixel point brightness component value of the second video image as a residual value.

It is understood that the human eye is more sensitive to the luminance of the image than to the chrominance. Therefore, the neural network model can be adopted to perform super-resolution processing on the brightness component values of the pixel points of the first video image, and the chromaticity component values of the pixel points of the second video image can be obtained through calculation of the traditional interpolation algorithm, so that the visual experience can be guaranteed, and the calculation complexity of video decoding can be reduced.

Optionally, the performing super-resolution processing on the first video image by using the neural network model to obtain a second video image includes: and performing super-resolution processing on the pixel point brightness component value and the pixel point chromaticity component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value and the pixel point chromaticity component value of the second video image.

Optionally, determining a residual value of the initial video image relative to the second video image comprises: and determining the difference value of the pixel point brightness component value and the pixel point chromaticity component value of the initial video image relative to the difference value of the pixel point brightness component value and the pixel point chromaticity component value of the second video image as the residual value.

In the present application, the neural network model may be used to calculate the pixel luminance component value and the pixel chrominance component value of the second video image, or only the neural model may be used to calculate the pixel luminance component value of the second video image, and other methods (such as an interpolation method) may be used to obtain the pixel chrominance component value of the second video image.

In combination with the second aspect, in certain implementations of the second aspect, the initial video image is a high dynamic range HDR image or a standard dynamic range SDR image.

In a third aspect, a video decoding method is provided, which includes: decoding the main coding stream to obtain a first video image, wherein the first video image is a standard dynamic range SDR video image; processing the first video image by adopting a neural network model to obtain a second video image, wherein the second video image is a High Dynamic Range (HDR) video image, and the resolution of the second video image is greater than that of the first video image; decoding the auxiliary encoded stream to obtain residual values; and superposing the pixel value and the residual value of the second video image to obtain a target video image.

The neural network model is obtained by training according to a training sample image and a training target image, the training target image is a video image with the same resolution as that of the target video image, and the training sample image is a video image obtained by performing down-sampling, encoding and decoding on the training target image.

The neural network model in the third aspect is similar to the neural network model in the first aspect, and the definitions and explanations regarding the neural network in the first aspect also apply to the neural network model in the third aspect.

In the method, when the decoded SDR video image is obtained, the neural network model is used for processing, the HDR video image with higher resolution can be obtained, next, the final target video image can be obtained by performing one-time superposition on the second video image and the residual error value obtained by decoding, and the method and the device can be suitable for equipment which only supports SDR video image coding and decoding.

In addition, compared with the mode that the video image with low resolution and the residual values of multiple levels are overlaid layer by layer to obtain the final video image in the traditional scheme, the method and the device can simplify the decoding process. The neural network model can be flexibly optimized and adjusted according to needs, and the flexibility of the scheme is higher.

With reference to the third aspect, in some implementations of the third aspect, processing the first video image by using a neural network model to obtain a second video image includes: and performing super-resolution processing and reverse tone mapping processing on the first video image by adopting a neural network model to obtain a second video image.

The super-resolution processing is used for improving the resolution of the first video image and obtaining a video image with higher resolution, and the reverse tone mapping processing is used for improving the pixel precision of the video image with higher resolution to obtain a second video image.

With reference to the third aspect, in some implementation manners of the third aspect, when the residual value only includes a residual corresponding to a pixel luminance component value, processing the first video image by using a neural network model to obtain a second video image, includes: processing the pixel brightness component value of the first video image by adopting a neural network model to obtain the pixel brightness component value of the second video image; superposing the pixel value and the residual value of the second video image to obtain a target video image, comprising: and superposing the pixel point brightness component value of the second video image with the residual value to obtain the pixel point brightness component value of the target video image.

With reference to the third aspect, in certain implementations of the third aspect, the method further includes: and carrying out interpolation processing on the pixel point chromaticity component values of the first video image to obtain the pixel point chromaticity component values of the target video image.

In a fourth aspect, a video encoding method is provided, the method comprising: processing the initial video image to obtain a processed video image, wherein the initial video image is a high dynamic range HDR video image, and the processed video image is a standard dynamic range SDR video image; coding the processed video image to obtain a main coding stream; decoding the main coding stream to obtain a first video image; processing the first video image by adopting a neural network model to obtain a second video image, wherein the second video image is an HDR video image, and the resolution of the second video image is the same as that of the initial video image; determining a residual value of the initial video image relative to the second video image; and coding the residual error value to obtain an auxiliary coding stream.

The neural network model is obtained by training according to a training sample image and a training target image, the training target image is a video image with the same resolution as the initial video image, and the training sample image is a video image obtained by performing down-sampling, encoding and decoding on the training target image.

It is to be understood that the video encoding method in the fourth aspect may be a method corresponding to the video decoding method in the second aspect, and the primary coded stream and the secondary coded stream decoded by the video decoding method in the third aspect described above may be generated by the video encoding method in the fourth aspect described above.

The target video image in the third aspect described above may be of the same resolution as the original video image in the fourth aspect described above.

The model parameters of the neural network model in the fourth aspect described above may be identical to the model parameters of the neural network model in the first aspect. The definition and explanation of the neural network model in the first aspect described above are equally applicable to the neural network model in the fourth aspect.

In the application, for the HDR image, the HDR image can be firstly converted into the SDR image, and then the subsequent coding processing is carried out, so that the method and the device can be suitable for the device which only supports coding the SDR image.

Furthermore, after the main coding stream is generated, super-resolution processing can be performed on a video image obtained by decoding the main coding stream through the neural network model, so that a video image with the same resolution as that of the initial video image is obtained, then, a residual value of the initial video image relative to the video image obtained by the super-resolution processing can be directly generated, and an auxiliary coding stream is generated according to the residual value.

With reference to the fourth aspect, in some implementations of the fourth aspect, the processing the initial video image by using a neural network model to obtain a processed video image includes: and performing downsampling and tone mapping processing on the initial video image by adopting a neural network model to obtain a processed video image.

The down-sampling is used for reducing the resolution of the initial video image to obtain a video image with lower resolution, and the tone mapping processing is used for processing the video image with lower resolution to improve the pixel precision of the video image with lower resolution to obtain a processed video image.

With reference to the fourth aspect, in some implementations of the fourth aspect, processing the first video image by using a neural network model to obtain a second video image includes: and performing super-resolution processing and reverse tone mapping processing on the first video image by adopting a neural network model to obtain a second video image.

With reference to the fourth aspect, in some implementations of the fourth aspect, processing the first video image by using a neural network model to obtain a second video image includes: processing the pixel brightness component value of the first video image by adopting a neural network model to obtain the pixel brightness component value of the second video image; determining residual values of the initial video image relative to the second video image, comprising: and determining the difference value of the pixel point brightness component value of the initial video image relative to the pixel point brightness component value of the second video image as the residual value.

Optionally, the processing the first video image by using the neural network model to obtain a second video image includes: and processing the pixel point brightness component value and the pixel point chroma component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value and the pixel point chroma component value of the second video image.

The processing of the pixel luminance component value and the pixel chrominance component value of the first video image by using the neural network model specifically includes: and performing super-resolution processing and reverse tone mapping processing on the pixel point brightness component values and the pixel point chroma component values of the first video image by adopting a neural network model.

In a fifth aspect, there is provided an apparatus for decoding video data, the apparatus comprising: the receiver is used for receiving the input of a main coding stream and an auxiliary coding stream of video data and inputting the main coding stream and the auxiliary coding stream into a video decoder for decoding; a video decoder for carrying out part or all of the steps of any one of the methods of the first or third aspects.

In a sixth aspect, there is provided an apparatus for encoding video data, the apparatus comprising: a receiver for receiving an initial video image and inputting the initial video image to a video encoder for encoding; a video encoder for carrying out part or all of the steps of any one of the methods of the second or fourth aspects.

In a seventh aspect, an embodiment of the present application provides an apparatus for decoding video data, including: a memory and a processor that invokes program code stored in the memory to perform some or all of the steps of any of the methods of the first or third aspects.

In an eighth aspect, an embodiment of the present application provides an apparatus for encoding video data, including: a memory and a processor that invokes program code stored in the memory to perform some or all of the steps of any of the methods of the second or fourth aspects.

Optionally, the memory is a non-volatile memory.

Optionally, the memory and the processor are coupled to each other.

In a ninth aspect, embodiments of the present application provide a computer-readable storage medium storing program code, where the program code includes instructions for performing some or all of the steps of the method in any one of the first, second, third and fourth aspects.

In a tenth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to perform some or all of the steps of the method in any one of the first, second, third and fourth aspects.

Drawings

FIG. 1 is a schematic block diagram of an example of a video encoding and decoding system for implementing embodiments of the present application;

FIG. 2 is a block schematic diagram of an example video encoder for implementing embodiments of the present application;

FIG. 3 is a block schematic diagram of an example video decoder for implementing embodiments of the present application;

FIG. 4 is a block schematic diagram of an example video processing system for implementing embodiments of the present application;

FIG. 5 is a block schematic diagram of an example video processing device for implementing embodiments of the present application;

FIG. 6 is a schematic block diagram of an example of an encoding apparatus or a decoding apparatus for implementing embodiments of the present application;

fig. 7 is a schematic diagram of a flow of video encoding and video decoding of an embodiment of the present application;

fig. 8 is a schematic flow chart of a video decoding method of an embodiment of the present application;

fig. 9 is a flowchart of a video decoding method according to an embodiment of the present application;

FIG. 10 is a schematic flow chart diagram of training a neural network model according to an embodiment of the present application;

FIG. 11 is a schematic block diagram of a training system for training a neural network model in an embodiment of the present application;

FIG. 12 is a schematic diagram of processing a first video image using a neural network model according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application;

fig. 14 is a schematic flow chart of a video encoding method of an embodiment of the present application;

fig. 15 is a schematic flow chart of a video encoding method of an embodiment of the present application;

fig. 16 is a schematic diagram of a flow of video encoding and video decoding of an embodiment of the present application;

fig. 17 is a schematic flow chart of a video decoding method of an embodiment of the present application;

fig. 18 is a schematic flow chart of a video encoding method of an embodiment of the present application;

fig. 19 is a flowchart of a video decoding method according to an embodiment of the present application;

fig. 20 is a flowchart of a video encoding method according to an embodiment of the present application;

fig. 21 is a schematic diagram of a flow of video encoding and video decoding of an embodiment of the present application;

FIG. 22 is a schematic block diagram of a video decoder of an embodiment of the present application;

FIG. 23 is a schematic block diagram of a video encoder of an embodiment of the present application;

FIG. 24 is a schematic block diagram of a video decoder of an embodiment of the present application;

fig. 25 is a schematic block diagram of a video encoder of an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The embodiments described herein may be applied not only in the context described herein, but may also be used in other respects, and may include structural or logical changes not depicted in the drawings. The detailed description herein is not to be taken in a limiting sense, and the scope of the present application is defined by the appended claims.

The disclosure of the methods described herein may equally apply to the corresponding apparatus or system performing the method, and vice versa. In particular, in the present application, if respective steps of a method are described, the corresponding device may comprise units or modules for performing the respective steps of the method, even if the corresponding device is not explicitly described or illustrated in the drawings as comprising the respective units or modules.

The video encoding and decoding related to the embodiments of the present application can be applied not only to the existing video encoding standards (e.g., h.264, High Efficiency Video Coding (HEVC), etc.), but also to the future video encoding standards (e.g., h.266 standard). The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.

In order to better understand the scheme of the embodiments of the present application, a brief description of some basic concepts that may be involved in video encoding and video decoding is provided below.

Video coding generally refers to processing a sequence of pictures that form a video or video sequence. In the field of video coding, the terms "picture", "frame" or "image" may be used as synonyms. Video encoding is performed on the source side, typically including processing (e.g., by compressing) the original video picture to reduce the amount of data required to represent the video picture for more efficient storage and/or transmission. Video decoding is performed at the destination side, typically involving inverse processing with respect to the encoder, to reconstruct the video pictures. In the embodiment of the present application, the combination of the encoding portion and the decoding portion is also referred to as codec (encoding and decoding).

A video sequence comprises a series of images (pictures) which can be further divided into slices (slices) which can be further divided into blocks (blocks). In video coding, the coding process is generally performed in units of blocks, and in some new video coding standards, the concept of blocks is further expanded. For example, in the h.264 standard, there is a Macroblock (MB), which may be further divided into a plurality of prediction blocks (partitions) that can be used for predictive coding. In the high performance video coding (HEVC) standard, basic concepts such as a Coding Unit (CU), a Prediction Unit (PU), a Transform Unit (TU), and the like are used in coding, and these basic units may be further divided based on a tree structure division manner. For example, a CU may be divided into smaller CUs in a quadtree, and the smaller CUs may be further divided to form a quadtree structure, and the CU is a basic unit for dividing and encoding an encoded image. There is also a similar tree structure for PU and TU, and PU may correspond to a prediction block, which is the basic unit of predictive coding. The CU is further partitioned into PUs according to a partitioning pattern. A TU may correspond to a transform block, which is a basic unit for transforming a prediction residual. However, CU, PU and TU are basically concepts of blocks (or image blocks).

In order to better understand the application scenarios of the video encoding method and the video decoding method of the embodiments of the present application, a system architecture of a possible application of the embodiments of the present application is described below with reference to fig. 1.

Fig. 1 is a schematic block diagram of an example of a video encoding and decoding system for implementing embodiments of the present application.

As shown in fig. 1, a video encoding and decoding system 10 includes a source device 12 and a destination device 14, the source device 12 may output encoded video data, and the source device 12 may be referred to as a video encoding apparatus. Destination device 14 may decode the encoded video data output by source device 12, and destination device 14 may be referred to as a video decoding apparatus.

Source device 12, destination device 14 may contain one or more processors and memory coupled to the one or more processors. The memory can include, but is not limited to, read-only memory (ROM), Random Access Memory (RAM), erasable programmable read-only memory (EPROM), flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures that can be accessed by a computer, as described herein.

Source apparatus 12 and destination apparatus 14 may comprise a variety of devices, including desktop computers, mobile computing devices, notebook (e.g., laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, televisions, cameras, display devices, digital media players, video game consoles, on-board computers, wireless communication devices, or the like.

As shown in fig. 1, the source device 12 and the destination device 14 may be separate devices (in this case, the video encoding and decoding system 10 is composed of two separate devices, the source device 12 and the destination device 14), or the source device 12 and the destination device 14 may be only part of the video encoding and decoding system 10 (in this case, the video encoding and decoding system 10 is a single device, and the source device 12 and the destination device 14 constitute different modules of the device).

The source device 12 and the destination device 14 may be communicatively coupled via a link 13, and the destination device 14 may receive encoded video data from the source device 12 via the link 13. Link 13 may include one or more media or devices capable of moving encoded video data from source device 12 to destination device 14. In one example, link 13 may include one or more communication media that enable source device 12 to transmit encoded video data directly to destination device 14 in real-time. In this example, source apparatus 12 may modulate the encoded video data according to a communication standard, such as a wireless communication protocol, and may transmit the modulated video data to destination apparatus 14.

The one or more communication media may include wireless and/or wired communication media such as a Radio Frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media described above may form part of a packet-based network, such as a local area network, a wide area network, or a global network (e.g., the internet). The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from source device 12 to destination device 14.

Source device 12 may include an encoder 20. Optionally, source device 12 may also include a picture source 16, a picture preprocessor 18, and a communication interface 22. In particular implementations, encoder 20, picture source 16, picture preprocessor 18, and communication interface 22 may be hardware components within source device 12 or may be software programs within source device 12. The respective modules or units in the source device 12 are described below.

The picture source 16, may include (or be) any type of picture capture device. The picture capturing device herein may be a generating device for capturing a real world picture and/or any kind of picture or comment (for screen content encoding, some text on the screen is also considered as part of the picture or image to be encoded), for example, a computer graphics processor for generating a computer animation picture, or any kind of device for acquiring and/or providing a real world picture, a computer animation picture (e.g., screen content, a Virtual Reality (VR) picture), and/or any combination thereof (e.g., an Augmented Reality (AR) picture).

Picture pre-processor 18 is configured to receive original picture data 17 and perform pre-processing on original picture data 17 to obtain pre-processed picture 19 or pre-processed picture data 19. For example, the pre-processing performed by picture pre-processor 18 may include trimming, color format conversion (e.g., from RGB format to YUV format), toning, or de-noising.

An encoder 20 (or referred to as a video encoder 20) is configured to receive the pre-processed picture data 19, and encode the pre-processed picture data 19 using an associated prediction mode (e.g., inter prediction mode, intra prediction mode), so as to obtain encoded picture data 21.

In some embodiments, the encoder 20 may be configured to perform the steps of the encoding method described hereinafter to implement the application of the video encoding method described herein on the encoding side.

A communication interface 22, which may be used to receive encoded picture data 21 and may transmit encoded picture data 21 over link 13 to destination device 14 or any other device (e.g., memory), which may be any device for decoding or storage. The communication interface 22 may, for example, be used to encapsulate the encoded picture data 21 into a suitable format, such as a data packet, for transmission over the link 13.

The destination device 14 includes a decoder 30. Optionally, destination device 14 may also include a communication interface 28, a picture post-processor 32, and a display device 34. The various modules or units in the destination device 14 are described below.

Communication interface 28 may be used to receive encoded picture data 21 from source device 12 or any other source, such as a storage device, such as an encoded picture data storage device. The communication interface 28 may be used to transmit or receive the encoded picture data 21 over a link 13 between the source device 12 and the destination device 14, or over any type of network, such as a direct wired or wireless connection, any type of network, such as a wired or wireless network or any combination thereof, or any type of private and public networks, or any combination thereof. The communication interface 28 may, for example, be used to decapsulate data packets transmitted by the communication interface 22 to obtain encoded picture data 21.

Both communication interface 28 and communication interface 22 may be configured as a one-way communication interface or a two-way communication interface, and may be used, for example, to send and receive messages to establish a connection, acknowledge and exchange any other information related to a communication link and/or data transfer, such as an encoded picture data transfer.

A decoder 30 (or referred to as decoder 30) for receiving the encoded picture data 21 and providing decoded picture data 31 or decoded pictures 31 (structural details of the decoder 30 will be described further below based on fig. 3 or fig. 4 or fig. 5). In some embodiments, the decoder 30 may be configured to perform various embodiments described hereinafter to implement the application of the chroma block prediction method described herein on the decoding side.

A picture post-processor 32, configured to perform post-processing on the decoded picture data 31 (the decoded picture data 31 is also referred to as reconstructed picture data) to obtain picture data 33 (the picture data 33 is the picture data obtained after the picture post-processor 32 performs post-processing on the picture data 31). Post-processing performed by picture post-processor 32 may include: color format conversion (e.g., from YUV format to RGB format), toning, trimming or resampling, or any other process may also be used for transmission of picture data 33 to display device 34.

A display device 34 for receiving picture data 33 for displaying pictures to, for example, a user or viewer. Display device 34 may be or may include any type of display for presenting the reconstructed picture, such as an integrated or external display or monitor. In addition, the display may include a Liquid Crystal Display (LCD), an Organic Light Emitting Diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS), a Digital Light Processor (DLP), or any other display of any kind.

It will be apparent to those skilled in the art from this description that the existence and (exact) division of the functionality of the different elements or source device 12 and/or destination device 14 shown in fig. 1 may vary depending on the actual device and application. Source device 12 and destination device 14 may comprise any of a variety of devices, including any type of handheld or stationary device, such as a notebook or laptop computer, a mobile phone, a smartphone, a tablet or tablet computer, a desktop computer, a set-top box, a television, a camera, a vehicle mounted device, a display device, a digital media player, a video game console, a video streaming device (e.g., a content service server or a content distribution server), a broadcast receiving device, a broadcast transmitting device, etc., and may not use or use any type of operating system.

Both encoder 20 and decoder 30 may be implemented as any of a variety of suitable circuits, such as one or more microprocessors, Digital Signal Processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, or any combinations thereof. If the techniques are implemented in part in software, the device may store instructions of the software in a suitable non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform the techniques of this application. Any of the foregoing, including hardware, software, a combination of hardware and software, etc., may be considered as one or more processors.

In some cases, the video encoding and decoding system 10 shown in fig. 1 is merely an example, and the techniques of this application may be applicable to video encoding settings (e.g., video encoding or video decoding) that do not necessarily involve any data communication between the encoding and decoding devices. In other examples, the data may be retrieved from local storage, streamed over a network, and so on. A video encoding device may encode and store data to a memory, and/or a video decoding device may retrieve and decode data from a memory. In some examples, the encoding and decoding are performed by devices that do not communicate with each other, but merely encode data to and/or retrieve data from memory and decode data.

Fig. 2 is a schematic block diagram of an example video encoder for implementing embodiments of the present application.

The encoder 20 shown in fig. 2 may be referred to as a hybrid video encoder or a video encoder according to a hybrid video codec. In the example of fig. 2, encoder 20 includes a residual calculation unit 204, a transform processing unit 206, a quantization unit 208, an inverse quantization unit 210, an inverse transform processing unit 212, a reconstruction unit 214, a buffer 216, a loop filter unit 220, a Decoded Picture Buffer (DPB) 230, a prediction processing unit 260, and an entropy encoding unit 270.

The prediction processing unit 260 may include an inter prediction unit 244, an intra prediction unit 254, and a mode selection unit 262. The inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown in fig. 2).

The encoder 20 receives, e.g., via an input 202, a picture 201 or an image block 203 of a picture 201, e.g., a picture in a sequence of pictures forming a video or a video sequence. The image blocks 203 may also be referred to as current picture blocks or picture blocks to be encoded, and the picture 201 may be referred to as current picture or picture to be encoded (especially when the current picture is distinguished from other pictures in video encoding).

An embodiment of the encoder 20 may comprise a partitioning unit (not shown in fig. 2) for partitioning the picture 201 into a plurality of blocks, e.g. image blocks 203, typically into a plurality of non-overlapping blocks. The partitioning unit may be used to use the same block size for all pictures in a video sequence and a corresponding grid defining the block size, or to alter the block size between pictures or subsets or groups of pictures and partition each picture into corresponding blocks.

In one example, prediction processing unit 260 of encoder 20 may be used to perform any combination of the above-described segmentation techniques.

The residual calculation unit 204 is configured to calculate the residual block 205 based on the picture image block 203 and the prediction block 265 (calculating the residual block 205, e.g., by subtracting sample values of the prediction block 265 from sample values of the picture image block 203 on a sample-by-sample (pixel-by-pixel) basis to obtain the residual block 205 in a sample domain.

The transform processing unit 206 is configured to apply a transform, such as a Discrete Cosine Transform (DCT) or a Discrete Sine Transform (DST), on the sample values of the residual block 205 to obtain transform coefficients 207 in a transform domain. The transform coefficients 207 may also be referred to as transform residual coefficients and represent the residual block 205 in the transform domain.

The transform processing unit 206 may be used to apply integer approximations of DCT/DST, such as the transform specified for HEVC/h.265. Such integer approximations are typically scaled by some factor compared to the orthogonal DCT transform. To maintain the norm of the residual block processed by the forward transform and the inverse transform, an additional scaling factor is applied as part of the transform process. The scaling factor is typically selected based on certain constraints, e.g., the scaling factor is a power of 2 for a shift operation, a trade-off between bit depth of transform coefficients, accuracy and implementation cost, etc. For example, a specific scaling factor may be specified on the decoder 30 side for the inverse transform by, for example, inverse transform processing unit 212 (and on the encoder 20 side for the corresponding inverse transform by, for example, inverse transform processing unit 212), and correspondingly, a corresponding scaling factor may be specified on the encoder 20 side for the forward transform by transform processing unit 206.

Quantization unit 208 is used to quantize transform coefficients 207, e.g., by applying scalar quantization or vector quantization, to obtain quantized transform coefficients 209. Quantized transform coefficients 209 may also be referred to as quantized residual coefficients 209. The quantization process may reduce the bit depth associated with some or all of transform coefficients 207. For example, an n-bit transform coefficient may be rounded down to an m-bit transform coefficient during quantization, where n is greater than m. The quantization level may be modified by adjusting a Quantization Parameter (QP). For example, for scalar quantization, different scales may be applied to achieve finer or coarser quantization. Smaller quantization steps correspond to finer quantization and larger quantization steps correspond to coarser quantization. An appropriate quantization step size may be indicated by a Quantization Parameter (QP). For example, the quantization parameter may be an index of a predefined set of suitable quantization step sizes.

The inverse quantization unit 210 is configured to apply inverse quantization of the quantization unit 208 on the quantized coefficients to obtain inverse quantized coefficients 211, e.g., to apply an inverse quantization scheme of the quantization scheme applied by the quantization unit 208 based on or using the same quantization step as the quantization unit 208. The dequantized coefficients 211 may also be referred to as dequantized residual coefficients 211, corresponding to transform coefficients 207, although the loss due to quantization is typically not the same as the transform coefficients.

The inverse transform processing unit 212 is configured to apply an inverse transform of the transform applied by the transform processing unit 206, for example, an inverse Discrete Cosine Transform (DCT) or an inverse Discrete Sine Transform (DST), to obtain an inverse transform block 213 in the sample domain. The inverse transform block 213 may also be referred to as an inverse transform dequantized block 213 or an inverse transform residual block 213.

The reconstruction unit 214 (e.g., summer 214) is used to add the inverse transform block 213 (i.e., the reconstructed residual block 213) to the prediction block 265 to obtain the reconstructed block 215 in the sample domain, e.g., to add sample values of the reconstructed residual block 213 to sample values of the prediction block 265.

The loop filter unit 220 (or simply "loop filter" 220) is used to filter the reconstructed block 215 to obtain a filtered block 221, so as to facilitate pixel transition or improve video quality. Loop filter unit 220 is intended to represent one or more loop filters, such as a deblocking filter, a sample-adaptive offset (SAO) filter, or other filters, such as a bilateral filter, an Adaptive Loop Filter (ALF), or a sharpening or smoothing filter, or a collaborative filter. Although loop filter unit 220 is shown in fig. 2 as an in-loop filter, in other configurations, loop filter unit 220 may be implemented as a post-loop filter. The filtered block 221 may also be referred to as a filtered reconstructed block 221. The decoded picture buffer 230 may store the reconstructed encoded block after the loop filter unit 220 performs a filtering operation on the reconstructed encoded block.

Embodiments of encoder 20 (correspondingly, loop filter unit 220) may be configured to output loop filter parameters (e.g., sample adaptive offset information), e.g., directly or after entropy encoding by entropy encoding unit 270 or any other entropy encoding unit, e.g., such that decoder 30 may receive and apply the same loop filter parameters for decoding.

Decoded Picture Buffer (DPB) 230 may be a reference picture memory that stores reference picture data for use by encoder 20 in encoding video data. DPB 230 may be formed from any of a variety of memory devices, such as Dynamic Random Access Memory (DRAM) including Synchronous DRAM (SDRAM), Magnetoresistive RAM (MRAM), Resistive RAM (RRAM), or other types of memory devices. The DPB 230 and the buffer 216 may be provided by the same memory device or separate memory devices. In a certain example, a Decoded Picture Buffer (DPB) 230 is used to store filtered blocks 221. Decoded picture buffer 230 may further be used to store other previous filtered blocks, such as previous reconstructed and filtered blocks 221, of the same current picture or of a different picture, such as a previous reconstructed picture, and may provide the complete previous reconstructed, i.e., decoded picture (and corresponding reference blocks and samples) and/or the partially reconstructed current picture (and corresponding reference blocks and samples), e.g., for inter prediction. In a certain example, if reconstructed block 215 is reconstructed without in-loop filtering, Decoded Picture Buffer (DPB) 230 is used to store reconstructed block 215.

Prediction processing unit 260, also referred to as block prediction processing unit 260, is used to receive or obtain image block 203 (current image block 203 of current picture 201) and reconstructed picture data, e.g., reference samples of the same (current) picture from buffer 216 and/or reference picture data 231 of one or more previously decoded pictures from decoded picture buffer 230, and to process such data for prediction, i.e., to provide prediction block 265, which may be inter-predicted block 245 or intra-predicted block 255.

The mode selection unit 262 may be used to select a prediction mode (e.g., intra or inter prediction mode) and/or a

corresponding prediction block

245 or 255 used as the prediction block 265 to calculate the residual block 205 and reconstruct the reconstructed block 215.

Embodiments of mode selection unit 262 may be used to select prediction modes (e.g., from those supported by prediction processing unit 260) that provide the best match or the smallest residual (smallest residual means better compression in transmission or storage), or that provide the smallest signaling overhead (smallest signaling overhead means better compression in transmission or storage), or both. The mode selection unit 262 may be configured to determine a prediction mode based on Rate Distortion Optimization (RDO), i.e., select a prediction mode that provides the minimum rate distortion optimization, or select a prediction mode in which the associated rate distortion at least meets the prediction mode selection criteria.

Fig. 3 is a schematic block diagram of an example video decoder for implementing embodiments of the present application.

As shown in fig. 3, video decoder 30 is to receive encoded picture data 21 (e.g., an encoded bitstream), e.g., encoded by encoder 20, to obtain a decoded picture 231. During the decoding process, video decoder 30 receives video data, such as an encoded video bitstream representing picture blocks of an encoded video slice and associated syntax elements, from video encoder 20.

In the example of fig. 3, decoder 30 includes entropy decoding unit 304, inverse quantization unit 310, inverse transform processing unit 312, reconstruction unit 314 (e.g., summer 314), buffer 316, loop filter 320, decoded picture buffer 330, and prediction processing unit 360. The prediction processing unit 360 may include an inter prediction unit 344, an intra prediction unit 354, and a mode selection unit 362. In some examples, video decoder 30 may perform a decoding pass that is substantially reciprocal to the encoding pass described with reference to video encoder 20 of fig. 2.

Entropy decoding unit 304 is to perform entropy decoding on encoded picture data 21 to obtain, for example, quantized coefficients 309 and/or decoded encoding parameters (not shown in fig. 3), e.g., any or all of inter prediction, intra prediction parameters, loop filter parameters, and/or other syntax elements (decoded). The entropy decoding unit 304 is further for forwarding the inter-prediction parameters, the intra-prediction parameters, and/or other syntax elements to the prediction processing unit 360. Video decoder 30 may receive syntax elements at the video slice level and/or the video block level.

Inverse quantization unit 310 may be functionally identical to inverse quantization unit 110, inverse transform processing unit 312 may be functionally identical to inverse transform processing unit 212, reconstruction unit 314 may be functionally identical to reconstruction unit 214, buffer 316 may be functionally identical to buffer 216, loop filter 320 may be functionally identical to loop filter 220, and decoded picture buffer 330 may be functionally identical to decoded picture buffer 230.

Prediction processing unit 360 may include inter prediction unit 344 and intra prediction unit 354, where inter prediction unit 344 may be functionally similar to inter prediction unit 244 and intra prediction unit 354 may be functionally similar to intra prediction unit 254. The prediction processing unit 360 is typically used to perform block prediction and/or to obtain a prediction block 365 from the encoded data 21, as well as to receive or obtain (explicitly or implicitly) prediction related parameters and/or information about the selected prediction mode from, for example, the entropy decoding unit 304.

Inverse quantization unit 310 may be used to inverse quantize (i.e., inverse quantize) the quantized transform coefficients provided in the bitstream and decoded by entropy decoding unit 304. The inverse quantization process may include using quantization parameters calculated by video encoder 20 for each video block in the video slice to determine the degree of quantization that should be applied and likewise the degree of inverse quantization that should be applied.

Inverse transform processing unit 312 is used to apply an inverse transform (e.g., an inverse DCT, an inverse integer transform, or a conceptually similar inverse transform process) to the transform coefficients in order to produce a block of residuals in the pixel domain.

The reconstruction unit 314 (e.g., summer 314) is used to add the inverse transform block 313 (i.e., reconstructed residual block 313) to the prediction block 365 to obtain the reconstructed block 315 in the sample domain, e.g., by adding sample values of the reconstructed residual block 313 to sample values of the prediction block 365.

Loop filter unit 320 (either during or after the encoding cycle) is used to filter reconstructed block 315 to obtain filtered block 321 to facilitate pixel transitions or improve video quality. In one example, loop filter unit 320 may be used to perform any combination of the filtering techniques described below. Loop filter unit 320 is intended to represent one or more loop filters, such as a deblocking filter, a sample-adaptive offset (SAO) filter, or other filters, such as a bilateral filter, an Adaptive Loop Filter (ALF), or a sharpening or smoothing filter, or a collaborative filter. Although loop filter unit 320 is shown in fig. 3 as an in-loop filter, in other configurations, loop filter unit 320 may be implemented as a post-loop filter.

Decoded video block 321 in a given frame or picture is then stored in decoded picture buffer 330, which stores reference pictures for subsequent motion compensation.

Decoder 30 is used to output decoded picture 31, e.g., via output 332, for presentation to or viewing by a user.

Other variations of video decoder 30 may be used to decode the compressed bitstream. For example, decoder 30 may generate an output video stream without loop filter unit 320. For example, the non-transform based decoder 30 may directly inverse quantize the residual signal without the inverse transform processing unit 312 for certain blocks or frames. In another embodiment, video decoder 30 may have inverse quantization unit 310 and inverse transform processing unit 312 combined into a single unit.

Specifically, in the embodiment of the present application, the decoder 30 is used to implement the video decoding method described in the following embodiments.

It should be understood that the video encoder in the present application may include only a part of the modules in the video encoder 30, for example, the video encoder in the present application may include a partition unit and an image encoding unit. Wherein the image encoding unit may be composed of one or more units of a prediction unit, a transform unit, a quantization unit, and an entropy encoding unit.

Fig. 4 is a schematic block diagram of an example of a video processing system for implementing an embodiment of the present application.

As shown in fig. 4, video processing system 40 may be a system that includes encoder 20 of fig. 2 and/or decoder 30 of fig. 3. The video processing system 40 may comprise an imaging device 41, a video processing device 46 (including the encoder 20 or decoder 30), an antenna 42, and a display device 45. Wherein encoder 20 and decoder 30 may be implemented using logic circuitry that may comprise Application Specific Integrated Circuit (ASIC) logic, a graphics processor, a general purpose processor, etc.

As shown in fig. 4, the imaging device 41, the antenna 42, the video processing device 46, and the display device 45 can communicate with each other.

In some instances, antenna 42 may be used to transmit or receive an encoded bitstream of video data. Additionally, in some instances, display device 45 may be used to present video data. The imaging device is used for acquiring original video images, and can be a camera.

In some examples, encoder 20 and decoder 30 may include an image buffer and a graphics processing unit. The graphics processing unit may be communicatively coupled to the image buffer.

Fig. 5 is a schematic block diagram of an example of a video processing apparatus for implementing an embodiment of the present application.

The video processing device 400 in fig. 5 is suitable for implementing the embodiments described herein. In one embodiment, video processing device 400 may comprise a video decoder (e.g., decoder 30 of fig. 3) or a video encoder (e.g., encoder 20 of fig. 2). At this time, the video processing apparatus may be referred to as an apparatus for decoding video data or an apparatus for encoding video data.

In another embodiment, the video processing device 400 may be one or more components of the decoder 30 of fig. 3 or the encoder 20 of fig. 2 described above.

The video processing apparatus 400 includes: an ingress port 410 and a receiver 420 for receiving data, an encoder for video encoding or a decoder for video decoding, a transmitter 440 and an egress port 450 for transmitting data, and a memory 460 for storing data. The video processing device 400 may further comprise an opto-electrical conversion assembly and an electro-optical (EO) assembly coupled to the inlet port 410, the receiver unit 420, the transmitter unit 440 and the outlet port 450 for the outlet or inlet of optical or electrical signals.

When the video processing device is a device for decoding video data, the receiver 420 may be configured to receive a main encoded stream and an auxiliary encoded stream of the video data and input the main encoded stream and the auxiliary encoded stream to the encoder. When the video processing device is a device that encodes video data, the receiver 420 may also receive an initial video image and input the initial video image to a decoder for processing.

The memory 460, which may include one or more disks, tape drives, and solid state drives, may be used as an over-flow data storage device for storing programs when such programs are selectively executed, and for storing instructions and data that are read during program execution.

The memory 460 may be volatile and/or nonvolatile, and may be Read Only Memory (ROM), Random Access Memory (RAM), random access memory (TCAM), and/or Static Random Access Memory (SRAM).

Further, the above-described video processing apparatus may include a processor for processing an instruction input from the ingress port to enable the above video encoder to perform video encoding or to enable the above video decoder to perform video decoding.

Fig. 6 is a schematic block diagram of an example of an encoding apparatus or a decoding apparatus for implementing an embodiment of the present application.

Fig. 6 is a simplified block diagram of an apparatus 500 that may be used as either or both of source device 12 and destination device 14 in fig. 1, according to an example embodiment. Apparatus 500 may implement the techniques of this application. In other words, fig. 6 is a schematic block diagram of an implementation manner of an encoding device or a decoding device (abbreviated as processing device 500) of the embodiment of the present application. The processing device 500 may include, among other things, a processor 510, a memory 530, and a bus system 550. Wherein the processor is connected with the memory through the bus system, the memory is used for storing instructions, and the processor is used for executing the instructions stored by the memory. The memory of the processing device stores program code, and the processor may invoke the program code stored in the memory to perform the various video encoding or decoding methods described herein, and in particular the various new image block partitioning methods. To avoid repetition, it is not described in detail here.

In the embodiment of the present application, the processor 510 may be a Central Processing Unit (CPU), and the processor 510 may also be other general-purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), ready-made programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 530 may include a Read Only Memory (ROM) device or a Random Access Memory (RAM) device. Any other suitable type of memory device may also be used for memory 530. Memory 530 may include code and data 531 to be accessed by processor 510 using bus 550. Memory 530 may further include an operating system 533 and application programs 535, the application programs 535 including at least one program that allows processor 510 to perform the video encoding or decoding methods described herein. For example, the application programs 535 may include applications 1 through N, which further include a video encoding or decoding application (simply a video processing application) that performs the video encoding or decoding methods described herein.

The bus system 550 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various buses are designated in the figure as bus system 550.

Optionally, the processing device 500 may also include one or more output devices, such as a display 570. In one example, the display 570 may be a touch-sensitive display that incorporates a display with a touch-sensitive unit operable to sense touch input. A display 570 may be connected to the processor 510 via the bus 550.

In addition, in the application, the neural network model can be used for performing super-resolution processing on the low-resolution video image to obtain a video image with higher resolution. The neural network may be a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), etc., and the basic concept of these neural networks and other related contents of these neural networks are described below.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

The processed operation obtains an output vector

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large.

The definition of these parameters in DNN is as follows: taking coefficient W as an example: suppose that in a three-layered DNN, the 4 th neuron of the second layer goes toThe linear coefficient of the 2 nd neuron of the third layer is defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final purpose of training the deep neural network, that is, learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained deep neural network.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

In order to better understand the overall process of the video encoding method and the video decoding method according to the embodiment of the present application, the overall process of the video encoding and the video decoding according to the embodiment of the present application will be described in general with reference to fig. 7.

Fig. 7 is a schematic flow chart of a video encoding and video decoding process of an embodiment of the present application.

As shown in fig. 7, the image to be encoded is encoded by the encoder to generate a main encoded stream and an auxiliary encoded stream, the decoder decodes the main encoded stream and the auxiliary encoded stream after acquiring the main encoded stream and the auxiliary encoded stream to obtain a video image to be displayed, and then the decoder can send the video image to be displayed to the display device for display. The encoder in fig. 7 may specifically be the encoder 20 in the above, and the decoder in fig. 7 may specifically be the decoder 30 in the above.

The video encoding method of the embodiment of the present application may be performed by the encoder shown in fig. 7, and the video decoding method of the embodiment of the present application may be performed by the decoder shown in fig. 7.

Fig. 8 is a schematic flow chart of a video decoding method according to an embodiment of the present application. The method shown in fig. 8 may be performed by a video decoder (e.g., the decoder 30 shown in fig. 1 or fig. 2), and the method shown in fig. 8 includes steps 1001 to 1004, which are described in detail below.

1001. And decoding the main coding stream to obtain a first video image.

The main coding stream may be a stream generated by a coding end firstly performing resolution reduction processing on an original image with higher resolution, and then coding a video image with lower resolution obtained after the resolution reduction processing.

For example, the resolution of an initial video image (which may also be referred to as an original video image) is 4K, and the encoding side performs downsampling processing on the initial video image to obtain a 2K video image, and then encodes the 2K video image to obtain a main encoding stream.

It should be understood that, in step 1001, the main encoded stream may be decoded according to a decoding manner specified in a video standard such as h.264/h.265/h.266 to obtain a first video image.

The first video image may be a single-frame image or a multi-frame (may be a continuous plurality of frames) image.

1002. And performing super-resolution processing on the first video image by adopting a neural network model to obtain a second video image.

Generally, super-resolution processing refers to a process of transforming (or referred to as super-resolution reconstruction) a low-resolution video image to obtain a high-resolution video image.

For example, super-resolution processing is performed on a video image with an image resolution of 2K (2K resolution), so as to obtain a video image with an image resolution of 4K (4K resolution).

The neural network model may be trained from a training sample image and a training target image, the training target image being a video image having the same resolution as the target video image, the training sample image being a video image obtained by down-sampling, encoding, and decoding the training target image.

The training sample image is input into the neural network model, the training target image is a target of the neural network model during training, and the difference between the video image output by the neural network model during training and the training target image is as small as possible.

1003. And decoding the auxiliary coded stream to obtain a residual value.

It should be understood that in step 1003, the auxiliary coded stream may be decoded according to a decoding manner specified in a video decoding standard such as h.264/h.265/h.266 to obtain residual values.

1004. And obtaining the target video image according to the second video image and the residual value.

Optionally, the obtaining the target video image according to the second video image and the residual value specifically includes: and superposing the pixel value of the second video image and the residual value to obtain a target video image.

In the application, the super-resolution processing can be better performed on the first video image with lower resolution ratio through the neural network model, so that the second video image with higher resolution ratio is obtained, next, the final target video image can be obtained by performing one-time superposition on the second video image and the residual values obtained by decoding, and compared with a mode that the final video image is obtained by performing layer-by-layer superposition on the video image with low resolution ratio and the residual values of multiple levels in the traditional scheme, the decoding process can be simplified, the decoding efficiency is improved, and the performance of a decoder is improved.

Specifically, in the present application, the model parameters of the neural network model can be flexibly set according to the difference in the degree of resolution enhancement in the super-resolution processing.

For example, when the neural network model originally performs super-resolution processing, the resolution of a video image is raised from 1080p to 2K, and when the neural network model is required to perform super-resolution processing, the resolution is raised from 1080p to 4K. Then, some training data (which may include some video images with a resolution of 4K and 1080p video images obtained by down-sampling, encoding, and decoding these video images with a resolution of 4K) may be adopted to retrain the original neural network model, so as to adjust the model parameters of the original neural network model, and obtain a new neural network model.

Optionally, the first video image is a high dynamic range HDR image or a standard dynamic range SDR image.

The decoding method is not only suitable for SDR images, but also suitable for HDR images.

In order to better understand the decoding process of the video decoding method of the embodiment of the present application shown in fig. 8, the following describes a flow of the video decoding method of the embodiment of the present application with a specific example in conjunction with fig. 9.

Fig. 9 is a flowchart of a video decoding method according to an embodiment of the present application. The video decoding method shown in fig. 9 includes the steps of:

2001. decoding the auxiliary encoded stream to obtain residual values;

2002. decoding the main coding stream to obtain a video image 3 (2K);

2003. performing super-resolution processing on the video image 3(2K) to obtain a video image 4 (4K);

2004. and superposing the video image 4(4K) and the residual error value to obtain a final target video image.

It is understood that the specific implementation details of steps 2001 to 2004 may be as described in relation to the method shown in fig. 8.

In the decoding flow shown in fig. 9, a video image with a resolution of 2K is obtained by decoding the main encoding stream, then the resolution of the image can be increased from 2K to 4K by super-resolution processing, and finally a final target video image can be obtained by superimposing the video image 4(4K) and the residual value, where the resolution of the target video image is also 4K, and the target video image may be an SDR video image or an HDR video image.

In the present application, a large number of training sample images and corresponding training target images may be used for training before performing super-resolution processing on a first video image according to a neural network model.

When the neural network model is trained, the difference between the output image of the neural network model and the target training image can be calculated, and the parameter value of the model parameter of the neural network model when the difference between the output image of the neural network model and the training target image meets the preset requirement (for example, the difference value between the output image of the neural network model and the training target image is smaller than the preset value) is determined as the final value of the model parameter of the neural network model, so that the training of the neural network model is completed. The trained neural network model can be used for performing super-resolution processing on the first video image.

The relevant contents of the training of the neural network model in the present application are described below with reference to fig. 10 to 12.

FIG. 10 is a schematic flow chart diagram of training a neural network model according to an embodiment of the present application.

As shown in fig. 10, the training process of the neural network model includes steps 3001 to 3003, which are described below.

3001. And performing down-sampling and coding on the original video image to obtain a main coding stream.

After the original video image is downsampled, the resolution of the original video image can be reduced to obtain a video image with lower resolution, and then the video image with lower resolution can be encoded to obtain a code stream.

For example, if the resolution of the original video image is 4K and the resolution of the video image obtained by downsampling is 2K, the main encoding stream can be obtained by encoding the video image with the resolution of 2K.

3002. And decoding the main coding stream to obtain a low-resolution video image.

For example, assuming that the resolution of the video image obtained after downsampling in step 3001 is 2K, the resolution of the video image obtained by decoding the main encoding stream in step 3002 is also 2K.

3003. And training the neural network model according to the training sample image and the training target image.

The training sample image is the low-resolution video image obtained by decoding in step 3002, the training sample image is used as the video image input during the training of the neural network model, the training target image is the original video image, the training target image is used as the target during the training of the neural network model, the video image output by the neural network model is made to be as close as possible to the training target image through training, and the corresponding model parameter when the difference between the video image output by the neural network model and the training target image meets the preset requirement is used as the final model parameter of the neural network model.

FIG. 11 is a schematic block diagram of a training system for training a neural network model according to an embodiment of the present application.

As shown in FIG. 11, training system 100 includes an execution device 110, a training device 120, a database 130, and an image acquisition device 140. These devices will be briefly described below.

The image capturing device 140 is configured to capture a video image used in neural network training, the video image captured by the image capturing device 140 may include the training sample image and the training target image, and the image capturing device 140 may store the captured video image in the database 130.

The training device 120 is configured to obtain video images for training (specifically, a training sample image and a training target image) from a database, and train a target model/rule 101 according to the video images, where the target model/rule 101 may be a neural network model in the embodiment of the present application.

The execution device 110 is a device comprising a computation module 111, an I/O interface 112 and a pre-processing module 113, wherein the computation module 111 comprises the target model/rule 101 (equivalent to a neural network model implemented in the computation module). After the training device has completed training the target model/rule 101, the execution device 110 may process the input video image to obtain an output video image.

Before the computing module 111 of the execution device 110 processes the input video image, the input video image may be subjected to some preprocessing, and then the computing module 111 processes the input video image (or the computing module 111 may not perform preprocessing on the input video image but directly process the input video image).

The execution device 110 may be a video encoder or a video decoder, and when the execution device 110 is a video decoder, the execution device 110 may perform super-resolution processing on a decoded video image with a lower resolution to obtain a video image with a higher resolution. When the execution device 110 is a video encoder, the execution device 110 may perform super-resolution processing on a video image obtained by down-sampling an original video image to obtain a video image with higher resolution.

The residual value decoded in step 1003 may only include the residual corresponding to the pixel luminance component value, or may include both the residual corresponding to the pixel luminance component value and the residual corresponding to the pixel chrominance component value. The specific implementation of the super-resolution processing and the final target video image in the two cases are different, and will be described in detail below.

The first condition is as follows: the residual value obtained by decoding in step 1003 includes both the residual corresponding to the pixel point luminance component value and the residual corresponding to the pixel point chrominance component value.

In a first case, the performing super-resolution processing on the first video image by using the neural network model in step 1002 to obtain the second video image specifically includes: performing super-resolution processing on the pixel point brightness component value and the pixel point chromaticity component value of the first video image by adopting a neural network model to obtain a pixel point brightness component value and a pixel point chromaticity component value of a second video image; in the step 1004, the superimposing the pixel value of the second video image and the residual value to obtain the target video image specifically includes: and superposing the pixel point brightness component value and the pixel point chromaticity component value of the second video image with the residual value to obtain the target video image.

It should be understood that, in the first case, the residual value includes both the residual corresponding to the luminance component value of the pixel point and the residual corresponding to the chrominance component value of the pixel point.

The above-mentioned superimposing the pixel luminance component value and the pixel chrominance component value of the second video image with the residual value to obtain the target video image is specifically to superimpose the pixel luminance component value of the second video image with the residual corresponding to the pixel luminance component value to obtain the pixel luminance component value of the target video image, and superimpose the pixel chrominance component value of the second video image with the residual corresponding to the pixel chrominance component value to obtain the pixel chrominance component value of the target video image, thereby obtaining the pixel luminance component value and the pixel chrominance component value of the target video image (obtaining the pixel luminance component value and the pixel chrominance component value of the target video image, and also obtaining the pixel value of the target video image).

Case two: the residual error value obtained by decoding in step 1003 only contains the residual error corresponding to the pixel point brightness component value.

In a second case, the performing super-resolution processing on the first video image by using the neural network model in step 1002 to obtain a second video image specifically includes: performing super-resolution processing on the pixel point brightness component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value of the second video image; in the step 1004, the superimposing the pixel value of the second video image and the residual value to obtain the target video image includes: and superposing the pixel point brightness component value of the second video image with the residual value to obtain the pixel point brightness component value of the target video image.

In particular, the human eye is more sensitive to the luminance of the image than to the chrominance. Therefore, the neural network model can be adopted to perform super-resolution processing on the brightness component values of the pixel points of the first video image, and the chromaticity component values of the pixel points of the second video image can be obtained through calculation of the traditional interpolation algorithm, so that the visual experience can be guaranteed, and the calculation complexity of video decoding can be reduced.

Optionally, the method shown in fig. 8 further includes: and carrying out interpolation processing on the pixel point chromaticity component values of the first video image to obtain the pixel point chromaticity component values of the target video image.

The video image in the present application may be represented in RGB format, or in YUV format, where Y represents brightness (Luma) or gray value, U represents Chroma (Chroma), and V represents density (Chroma).

Fig. 12 is a schematic diagram of processing a first video image by using a convolutional neural network according to an embodiment of the present application.

As shown in fig. 12, a convolutional neural network may be used to perform super-resolution processing on the luminance component value of the pixel point of the first video image, so as to obtain the luminance component value of the pixel point of the second video image. In addition, in order to improve the accuracy of the luminance component value of the pixel point of the finally obtained second video image, a convolutional neural network and an interpolation algorithm (for example, Bicubic interpolation) may be comprehensively used to obtain the luminance component value of the pixel point of the second video image.

As shown in fig. 12, a convolutional neural network may be used to process the pixel luminance component value of the first video image to obtain a pixel luminance component value of the second video image, an interpolation processing algorithm is used to process the pixel luminance component value of the first video image to obtain a luminance component value of a pixel of the second video image, and then the pixel luminance component values obtained in the two manners are summed to obtain a pixel luminance component value of the second video image.

It should be understood that, in fig. 12, the first video image may be a lower resolution video image obtained after down-sampling the original video image, and the second video image may be a video image having the same resolution as the first video image.

When processing the pixel luminance component values of the first video image, in order to facilitate the processing by the neural network, the value range of the pixel luminance component values of the first video image may be linearized from [0,255] to [0,1], and then 0.5 may be subtracted, so that the value range of the pixel luminance component values of the first video image is changed to [ -0.5,0.5], and then the convolution neural network is adopted to perform super-resolution processing on the pixel luminance component values of the first video image.

In fig. 12, the convolutional neural network only needs to perform super-resolution processing on the pixel luminance component values of the first video image, so that the complexity of the convolutional neural network processing can be reduced.

It should be understood that the flow shown in fig. 12 calculates the luminance component value of the pixel point of the second video image, and the chrominance component value of the pixel point of the second video image can be obtained by performing interpolation operation on the chrominance component value of the pixel point of the first video image, so that the chrominance component value and the luminance component value of the pixel point of the second video image are obtained, and the second video image is obtained.

In addition, in fig. 12, when the convolutional neural network is trained, the loss of the second video image relative to the original video image may be calculated again to correct the parameters of the convolutional neural network, so as to reduce the loss of the second video image relative to the original video image. Furthermore, the loss of the second video image relative to the original video image can be determined here by means of a calculation of a loss function. The loss function here may be Mean Squared Error (MSE) or the like.

Fig. 13 is a schematic diagram of the structure of a convolutional neural network according to an embodiment of the present application.

It is to be understood that the specific structure of the convolutional neural network in fig. 12 may be as shown in fig. 13. In fig. 13, the convolutional neural network includes a plurality of convolutional layers, a plurality of activation function layers, and a pixel reconstruction layer, and a part of the results of the intermediate layers of the convolutional neural network is jumpily transmitted backward.

The convolutional layer is a basic structural unit of a convolutional neural network, and can extract features of an image and output a feature matrix. In fig. 13, the number of channels in the last convolutional layer is 4, the number of channels in the remaining convolutional layers is 64, and each channel corresponds to a convolution kernel for outputting a feature matrix. For the last convolution layer, 4 feature matrices are output, and then the pixel reconstruction layer can superpose the 4 feature matrices into a high-resolution feature matrix, and the number of pixels is changed to 4 times of the original number.

In addition, the coefficient of the activation function layer in fig. 13 may be 0.1.

It should be understood that the structure of the convolutional neural network in fig. 13 and the number of channels of the convolutional layers and the coefficients of the activation function layers in the convolutional neural network are only examples, and the convolutional neural network in the present application may have other structures and the convolutional layers in the convolutional neural network in the present application may have other numbers.

The video decoding method according to the embodiment of the present application is described in detail above with reference to fig. 8 from the perspective of a decoding end, and the video encoding method according to the embodiment of the present application is described below with reference to fig. 14 from the perspective of an encoding end, it should be understood that the encoding method shown in fig. 14 corresponds to the decoding method shown in fig. 8, the decoding method shown in fig. 8 can decode the main encoded stream and the auxiliary encoded stream generated by the encoding method shown in fig. 14 to obtain a final target video image, and repeated descriptions are appropriately omitted below when describing the encoding method shown in fig. 14 for simplicity.

Fig. 14 is a schematic flow chart of a video encoding method of an embodiment of the present application. The method shown in fig. 14 may be performed by a video encoder (e.g., encoder 20 shown in fig. 1 above or encoder 20 shown in fig. 2), it being understood that the method shown in fig. 14 may be encoded and/or decoded in a manner specified in a video standard such as h.264/h.265/h.266. The method shown in fig. 14 includes steps 4001 to 4005, which are described below.

4001. And performing downsampling and coding processing on the initial video image to obtain a main coding stream.

In step 4001, a video image with a lower resolution (the resolution of the video image with the lower resolution is lower than that of the initial video image) can be obtained by performing downsampling on the initial video image, and then the video image with the lower resolution is encoded to obtain a main encoding stream.

For example, the resolution of the initial video image is 4K, the encoding end performs downsampling processing on the initial video image to obtain a 2K video image, and then encodes the 2K video image to obtain a main encoding stream. The initial video image may be from an imaging device, such as a camera.

In addition, the encoding process in step 4001 may specifically include processes such as prediction, transformation, quantization, and entropy encoding, and the primary encoded stream can be finally obtained through these processes.

Optionally, the initial video image is an HDR image or an SDR image.

The video encoding method shown in fig. 14 according to the embodiment of the present application may be applied to both SDR images and HDR images.

4002. And decoding the main coding stream to obtain a first video image.

The specific process implemented in step 4002 is the same as step 1001, and the specific process implemented in step 4002 may be referred to in the related description of step 1001.

4003. And performing super-resolution processing on the first video image by adopting a neural network model to obtain a second video image with the same resolution as the initial video image.

The specific process implemented in step 4003 is the same as step 1002, and the specific process implemented in step 4003 may be referred to in the related description of step 1002.

4004. Residual values of the initial video image relative to the second video image are determined.

In the present application, the main encoding stream is decoded in step 4002, and then the first video image obtained by decoding is encoded in step 4003, so that the residual value finally determined in step 4004 can reflect the encoding loss, and the decoding end can obtain the initial video image according to the residual value obtained by decoding.

4005. And coding the residual error value to obtain an auxiliary coding stream.

In step 4005, the residual value may be encoded in the RVC technique to generate an auxiliary encoded stream.

It should be understood that the model parameters of the neural network model in the method illustrated in fig. 14 (step 4003) may be consistent with the model parameters of the neural network model in the method illustrated in fig. 8 (step 1002), and the definition and explanation of the neural network model in the method illustrated in fig. 8 are equally applicable to the neural network model in the method illustrated in fig. 14.

According to the method and the device, after the main coding stream is generated, super-resolution processing can be performed on a video image obtained by decoding the main coding stream through the neural network model, so that a video image with the same resolution as that of an initial video image is obtained, then, a residual value of the initial video image relative to the video image obtained by the super-resolution processing can be directly generated, an auxiliary coding stream is generated according to the residual value, and compared with a mode that multiple layers of video images with different resolution levels and residual values with corresponding levels need to be obtained according to the initial video image in a traditional coding scheme, the method and the device can simplify the coding process.

Optionally, as an embodiment, performing super-resolution processing on the first video image by using a neural network model in step 4003 to obtain a second video image with the same resolution as the initial video image includes: performing super-resolution processing on the pixel point brightness component value of the first video image by adopting a neural network model to obtain the pixel point brightness component value of the second video image; 4004, determining a residual value of the initial video image relative to the second video image includes: and determining the difference value of the pixel point brightness component value of the initial video image relative to the pixel point brightness component value of the second video image as a residual value.

It should be understood that, when the residual value obtained by the encoding in the method shown in fig. 14 is the difference value between the pixel luminance component value of the initial video image and the pixel luminance component value of the second video image, after the decoding end decodes the auxiliary encoded stream to obtain the residual, the decoding end also superimposes the residual value and the obtained pixel luminance component value of the second video image, so as to obtain the pixel luminance component value of the final target video image, and the pixel chrominance component value of the target video image may be obtained by other interpolation methods.

Optionally, as an embodiment, performing super-resolution processing on the first video image by using a neural network model in step 4003 to obtain a second video image with the same resolution as the initial video image includes: performing super-resolution processing on the pixel point brightness component value and the pixel point chromaticity component value of the first video image by adopting a neural network model to obtain a pixel point brightness component value and a pixel point chromaticity component value of a second video image; 4004, determining a residual value of the initial video image relative to the second video image includes: and determining the pixel point brightness component value and the pixel point chromaticity component value of the initial video image as residual values relative to the difference value of the pixel point brightness component value and the pixel point chromaticity component value of the second video image respectively.

It should be understood that, when the residual value obtained by the encoding in the method shown in fig. 14 is the difference between the pixel luminance component value and the pixel chrominance component value of the initial video image and the pixel luminance component value and the pixel chrominance component value of the second video image respectively, after the decoding end decodes the auxiliary encoded stream to obtain the residual, the decoding end also superimposes the residual value with the obtained pixel luminance component value and the obtained pixel chrominance component value of the second video image to obtain the pixel luminance component value and the pixel chrominance component value of the final target video image, thereby directly obtaining the final target video image. (the target video image here corresponds to the initial video image in the method shown in fig. 14, but there may be a little distortion with respect to the initial video image).

The human eye is more sensitive to the luminance of the image than to the chrominance. Therefore, the neural network model can be adopted to perform super-resolution processing on the brightness component values of the pixel points of the first video image, and the chromaticity component values of the pixel points of the second video image can be obtained through calculation of the traditional interpolation algorithm, so that the visual experience can be guaranteed, and the calculation complexity of video decoding can be reduced.

In order to better understand the encoding process of the video encoding method of the embodiment of the present application shown in fig. 14, the following describes a flow of the video encoding method of the embodiment of the present application with a specific example in conjunction with fig. 15.

Fig. 15 is a flowchart of a video encoding method according to an embodiment of the present application. The video encoding method shown in fig. 15 includes the steps of:

5001. and carrying out downsampling operation on the video image 1 to obtain a video image 2.

Wherein, the video image 1 is a video image with a resolution of 4K, and the video image 2 is a video image with a resolution of 2K, the resolution of the input video image is reduced by one time (from 4K to 2K) through the down-sampling operation in step 5001.

5002. And coding the video image 2 to obtain a main coding stream.

5003. And decoding the main coding stream to obtain a video image 3.

The video image 3 is a video image with a resolution of 2K, the resolution of the video image 3 is the same as that of the video image 2, and the image content of the video image 3 is substantially the same as that of the video image 2 (some distortion may be caused by encoding and decoding, so that the image content of the video image 3 is different from that of the video image 2).

The video image 3 with the resolution of 2K is obtained again by encoding and decoding the video image 2 through the steps 5002 and 5003.

5004. And performing super-resolution processing on the video image 3 to obtain a video image 4.

The resolution of the video image 4 is 4K, and the resolution of the video image can be doubled by performing super-resolution processing on the video image 3. In step 5004, the super-resolution processing may be performed on the video image 3 by using a neural network model, so as to obtain a video image 4, which is similar to the above step 4003 and will not be described in detail here.

5005. Residual values are determined from video image 1 and video image 4.

Both the video image 1 and the video image 4 are video images with a resolution of 4K, and a residual value (a difference between a pixel value of the video image 1 and a pixel value of the video image 4) can be obtained by subtracting a pixel value of the video image 1 from a pixel value of the video image 4.

In addition, the residual value may be only the difference between the luminance component value of the pixel point of the video image 1 and the luminance component value of the pixel point of the video image 4, or the residual value may be the difference between the luminance component value and the chrominance component value of the pixel point of the video image 1 and the luminance component value and the chrominance component value of the pixel point of the video image 4, respectively.

5006. And coding the residual error value to obtain an auxiliary coding stream.

The encoding process for the video image 1 is completed through the above steps 5001 to 5006, and a main encoding stream and an auxiliary encoding stream are obtained.

It should be understood that the encoding process shown in fig. 15 may correspond to the decoding process shown in fig. 9, and the decoding process shown in fig. 9 can decode the primary encoded stream and the secondary encoded stream encoded by the encoding process shown in fig. 15.

The video decoding method and the video encoding method of the embodiment of the present application are described in detail above from the perspective of the decoding end and the encoding end in conjunction with fig. 8 and 14, respectively, and in order to better understand the whole process of video encoding and video decoding of the embodiment of the present application, the video encoding process and the video decoding process of the embodiment of the present application are described below in conjunction with fig. 16.

Fig. 16 is a schematic flow chart of a video decoding method and a video encoding process of an embodiment of the present application. The method shown in fig. 16 may be performed jointly by a video decoder (e.g., decoder 30 shown in fig. 1 or 2) and a video encoder (e.g., encoder 20 shown in fig. 1 or encoder 20 shown in fig. 2 above),

the video encoding and video decoding process shown in fig. 16 includes steps 6001 to 6010, which are described below.

6001. And carrying out down-sampling on the initial video image to obtain a video image obtained after the down-sampling.

6002. And coding the video image obtained after down-sampling to obtain a main coding stream.

6003. And decoding the main coding stream to obtain a first video image.

6004. And performing super-resolution processing on the first video image to obtain a second video image.

Specifically, super-resolution processing may be performed on the first video image by using a neural network model to obtain a second video image, and specific contents may be as described above in relation to step 4003.

6005. Residual values are determined from the initial video image and the second video image.

6006. And coding the residual error value to obtain an auxiliary coding stream.

Steps 6001 to 6006 may be performed by a video encoder, and steps 6001 to 6006 correspond to steps 4001 to 4005 in the method shown in fig. 14 (all are processing an input initial video image to obtain a main encoded stream and an auxiliary encoded stream). Step 6001 and step 6002 are equivalent to step 4001, and are configured to perform downsampling processing and encoding processing on the initial video image to obtain a main encoding stream. Step 6003 corresponds to step 4002, step 6004 corresponds to step 4003, step 6005 corresponds to step 4004, step 6006 corresponds to step 4005, and the description of steps 6001 to 6006 can be referred to the relevant contents of steps 4001 to 4005. The specific implementation process refers to the description of the above embodiments.

The main coding stream and the auxiliary coding stream are output to a decoding end, and the decoding end executes the following decoding process.

6007. And decoding the auxiliary coding stream to obtain a residual value.

6008. And decoding the main coding stream to obtain a first video image.

6009. And performing super-resolution processing on the first video image to obtain a second video image.

6010. And superposing the second video image and the residual error value to obtain a target video image.

The steps 6007 to 6010 may be performed by a video encoder, and the steps 6007 to 6010 correspond to the steps 1001 to 1004 of the method shown in fig. 8 (by decoding the primary encoded stream and the secondary encoded stream to obtain the final target video image). Step 6008 corresponds to step 1003 in the method shown in fig. 8, step 6008 corresponds to step 1001 in the method shown in fig. 8, step 6009 may specifically use a neural network model to perform super-resolution processing on the first video image, step 6009 corresponds to step 1002 in the method shown in fig. 8, and step 6010 corresponds to step 1004 in the method shown in fig. 8. The specific implementation process refers to the description of the above embodiments.

For some video encoding devices or video decoding devices, only encoding or decoding of SDR video images may be supported, and HDR video images cannot be directly encoded or decoded, in which case the HDR video images may be first converted into SDR video images and then processed, and the video encoding and video decoding in this case are described in detail below with reference to fig. 17 and 18.

Fig. 17 is a schematic flow chart of a video decoding method according to an embodiment of the present application. The method shown in fig. 17 may be performed by a video decoder (e.g., decoder 30 shown in fig. 1 or fig. 2), and the method shown in fig. 17 includes steps 7001 to 7006, which are described in detail below.

7001. Decoding a main coding stream to obtain a first video image, wherein the first video image is an SDR video image;

7002. processing the first video image by adopting a neural network model to obtain a second video image;

the second video image is a high dynamic range HDR video image, and the resolution of the second video image is greater than that of the first video image;

7003. decoding the auxiliary encoded stream to obtain residual values;

7004. and superposing the pixel value of the second video image and the residual value to obtain a target video image.

The neural network model is obtained by training according to a training sample image and a training target image, the training target image is a video image with the same resolution as the target video image, and the training sample image is a video image obtained by performing down-sampling, encoding and decoding on the training target image.

The neural network model in the method shown in fig. 17 is similar to the neural network model in the method shown in fig. 8, and the definition and explanation regarding the neural network in the method shown in fig. 8 also apply to the neural network model in the method shown in fig. 17.

In the method, when the decoded SDR video image is obtained, the neural network model is used for processing, the HDR video image with higher resolution can be obtained, next, the final target video image can be obtained by performing one-time superposition on the second video image and the residual error value obtained by decoding, and the method and the device can be suitable for supporting the encoding and decoding of the SDR video image.

Optionally, as an embodiment, the processing the first video image by using the neural network model in the step 7002 to obtain the second video image includes: and performing super-resolution processing and reverse tone mapping processing on the first video image by adopting a neural network model to obtain a second video image.

Optionally, as an embodiment, when the residual value only includes a residual corresponding to a pixel luminance component value, the processing the first video image by using the neural network model in the step 7002 to obtain the second video image includes: processing the pixel brightness component value of the first video image by adopting a neural network model to obtain the pixel brightness component value of the second video image; superposing the pixel value and the residual value of the second video image to obtain a target video image, comprising: and superposing the pixel point brightness component value of the second video image with the residual value to obtain the pixel point brightness component value of the target video image.

Optionally, as an embodiment, the method shown in fig. 17 further includes: and carrying out interpolation processing on the pixel point chromaticity component values of the first video image to obtain the pixel point chromaticity component values of the target video image.

Fig. 18 is a schematic flow chart of a video encoding method of an embodiment of the present application. The method shown in fig. 18 may be performed by a video encoder (e.g., encoder 20 shown in fig. 1 above or encoder 20 shown in fig. 2), it being understood that the method shown in fig. 18 may be encoded and/or decoded in a manner specified in a video standard such as h.264/h.265/h.266. The method shown in fig. 18 comprises steps 8001 to 8006, which are described below.

8001. And processing the initial video image to obtain a processed video image.

The initial video image is a high dynamic range HDR video image, and the processed video image is a standard dynamic range SDR video image.

8002. And coding the processed video image to obtain a main coding stream.

8003. And decoding the main coding stream to obtain a first video image.

8004. And processing the first video image by adopting a neural network model to obtain a second video image, wherein the second video image is an HDR video image, and the resolution of the second video image is the same as that of the initial video image.

8005. Residual values of the initial video image relative to the second video image are determined.

8006. And coding the residual error value to obtain an auxiliary coding stream.

It should be understood that the encoding process shown in fig. 18 may correspond to the decoding process shown in fig. 17, and the encoding process shown in fig. 18 can decode the primary encoded stream and the secondary encoded stream encoded by the encoding process shown in fig. 17.

The neural network model in the method shown in fig. 18 is similar to the neural network model in the method shown in fig. 14, and the definition and explanation regarding the neural network in the method shown in fig. 18 also apply to the neural network model in the method shown in fig. 14.

Optionally, as an embodiment, the determining, in step 8005, a residual value of the initial video image relative to the second video image includes: residual values of pixel values of the initial video image relative to pixel values of the second video image are determined.

Optionally, as an embodiment, the processing, performed in step 8004, on the initial video image by using a neural network model to obtain a processed video image includes: and performing downsampling and tone mapping processing on the initial video image by adopting a neural network model to obtain a processed video image.

Optionally, as an embodiment, the processing, in the step 8004, the first video image by using a neural network model to obtain a second video image includes: and performing super-resolution processing and reverse tone mapping processing on the first video image by adopting a neural network model to obtain a second video image.

Optionally, as an embodiment, the processing, in the step 8004, the first video image by using a neural network model to obtain a second video image includes: processing the pixel brightness component value of the first video image by adopting a neural network model to obtain the pixel brightness component value of the second video image; the determining the residual value of the initial video image relative to the second video image in step 8004 includes: and determining the difference value of the pixel point brightness component value of the initial video image relative to the pixel point brightness component value of the second video image as the residual value.

Optionally, the processing (which may be super-resolution processing and inverse tone mapping processing) performed on the first video image by using the neural network model in step 8004 above to obtain a second video image includes: performing super-resolution processing on the pixel point brightness component value and the pixel point chromaticity component value of the first video image by adopting a neural network model to obtain a pixel point brightness component value and a pixel point chromaticity component value of a second video image; the determining the residual value of the initial video image relative to the second video image in step 8005 includes: and determining the difference value of the pixel point brightness component value and the pixel point chromaticity component value of the initial video image relative to the difference value of the pixel point brightness component value and the pixel point chromaticity component value of the second video image as the residual value.

In order to better understand the decoding process of the video decoding method according to the embodiment of the present application shown in fig. 17 and the encoding process of the video encoding method according to the embodiment of the present application shown in fig. 18, the following describes the flow of the video decoding method and the flow of the video encoding method according to the embodiment of the present application with reference to fig. 19 and 20 by way of specific examples.

Fig. 19 is a flowchart of a video decoding method according to an embodiment of the present application. The video decoding method shown in fig. 19 includes the steps of:

9001. decoding the auxiliary encoded stream to obtain residual values;

9002. decoding the main coding stream to obtain a video image 4(2K, SDR);

9003. performing super-resolution processing and reverse tone mapping on the video image 4(2K, SDR) to obtain a video image 5(4K, HDR);

9004. and superposing the video image 5(4K, HDR) and the residual value to obtain a final target video image.

Where the video image 4 is an SDR image, the resolution of the video image 4 is 2K, the resolution can be raised from 2K to 4K after the super-resolution processing is performed on the video image 4, and the SDR image can be converted into an HDR image by inverse tone mapping, so that the resulting video image 5 is an HDR image with a resolution of 4K.

Fig. 20 is a flowchart of a video encoding method according to an embodiment of the present application. The video encoding method shown in fig. 20 includes the steps of:

10001. and carrying out downsampling operation on the video image 1 to obtain a video image 2.

The video image 1 is an HDR image with a resolution of 4K, the video image 2 obtained through downsampling is an HDR image with a resolution of 2K, and the resolution of the input video image is reduced by one time (from 4K to 2K) through the downsampling operation in step 5001.

10002. The video image 2 is tone-mapped to obtain a video image 3.

In the application, through tone mapping processing, the video image can be converted from the HDR image into the SDR image, so that equipment which cannot directly encode/decode the HDR video image originally can also process the HDR image, the compatibility of coding and decoding equipment can be improved, and the coding and decoding equipment can process both the HDR image and the SDR image.

For example, the video image 2 is an HDR image with a pixel precision of 10 bits, and an SDR image with a pixel precision of 8 bits can be obtained by tone mapping, and then, the encoding apparatus can encode the SDR image.

10003. And coding the video image 3 to obtain a main coding stream.

10004. And decoding the main coding stream to obtain a video image 4.

The video image 4 is an HDR video image with a resolution of 2K, the resolution of the video image 4 is the same as that of the video image 3, and the image content of the video image 4 is substantially the same as that of the video image 3 (after encoding and decoding, some distortion is caused as much as possible, so that the image content of the video image 4 is different from that of the video image 3).

The video image 4 with the resolution of 2K is obtained by encoding and decoding the video image 3 through the step 10003 and the step 10004.

10005. The video image 3 is subjected to super-resolution processing and reverse tone mapping processing to obtain a video image 5.

The video image 5 is an HDR image having a resolution of 4K, and the resolution of the video image can be doubled by performing super-resolution processing on the video image 4. In step 10005, the super-resolution processing and inverse tone mapping may be performed on the video image 4 by using a neural network model to obtain the video image 5, which is similar to step 8004 in the method shown in fig. 18 and will not be described in detail here.

10006. Residual values are determined from video image 1 and video image 5.

Both the video image 1 and the video image 4 are HDR images with a resolution of 4K, and residual values (difference values between the pixel values of the video image 1 and the pixel values of the video image 5) can be obtained by subtracting the pixel values of the video image 1 and the pixel values of the video image 5.

Of course, the residual value may be only the difference between the luminance component value of the pixel point of the video image 1 and the luminance component value of the pixel point of the video image 5, or the residual value may be the difference between the luminance component value and the chrominance component value of the pixel point of the video image 1 and the luminance component value and the chrominance component value of the pixel point of the video image 5, respectively.

10007. And coding the residual error value to obtain an auxiliary coding stream.

The encoding process for the video image 1 is completed through the above steps 10001 to 10007, and a main encoding stream and an auxiliary encoding stream are obtained.

It should be understood that the encoding process shown in fig. 20 may correspond to the decoding process shown in fig. 19, and the decoding process shown in fig. 19 can decode the primary encoded stream and the secondary encoded stream encoded by the encoding process shown in fig. 20.

The video decoding method and the video encoding method according to the embodiment of the present application are described in detail above from the perspective of the decoding end and the encoding end in conjunction with fig. 17 and fig. 18, respectively, and in order to better understand the overall processes of video encoding and video decoding according to the embodiment of the present application, the video encoding process and the video decoding process according to the embodiment of the present application are described below in conjunction with fig. 21.

Fig. 21 is a schematic flow chart of a video decoding method and a video encoding process of an embodiment of the present application. The method shown in fig. 21 may be performed jointly by a video decoder (e.g., decoder 30 shown in fig. 1 or 2) and a video encoder (e.g., encoder 20 shown in fig. 1 or encoder 20 shown in fig. 2 above),

the video encoding and video decoding process shown in fig. 21 includes steps 20001 to 20011, which are described below.

20001. And carrying out downsampling processing on the video image 1 to obtain a video image obtained after downsampling.

20002. And carrying out tone mapping processing on the video image obtained after the down sampling to obtain the video image obtained after tone mapping.

20003. And coding the video image obtained after tone mapping to obtain a main coding stream.

20004. And decoding the main coding stream to obtain a first video image.

20005. And performing super-resolution processing and reverse tone mapping on the first video image by adopting a neural network model to obtain a second video image.

The second video image is an HDR video image, and the resolution of the second video image is the same as the resolution of the initial video image.

20006. Residual values of the initial video image relative to the second video image are determined.

20007. And coding the residual error value to obtain an auxiliary coding stream.

The above steps 20001 to 20007 can be executed by a video encoder, and the above steps 20001 to 20007 are equivalent to steps 8001 to 8005 in the method shown in fig. 18 (all are processing an input initial video image to obtain a main encoded stream and an auxiliary encoded stream).

20008. And decoding the auxiliary coded stream to obtain a residual value.

20009. Decoding the main coding stream to obtain a first video image, wherein the first video image is an SDR video image;

20010. and performing super-resolution processing and reverse tone mapping processing on the first video image by adopting a neural network model to obtain a second video image.

the neural network model is obtained by training according to a training sample image and a training target image, wherein the training target image is a video image with the same resolution as that of a target video image, and the training sample image is a video image obtained by performing down-sampling, encoding and decoding on the training target image.

20011. And superposing the pixel value and the residual value of the second video image to obtain a target video image.

The above-mentioned steps 20008 to 20011 can be executed by a video decoder, and the above-mentioned steps 20008 to 20011 correspond to the steps 7001 to 7004 in the method shown in fig. 17 (all by decoding the main encoded stream and the auxiliary encoded stream to obtain the final target video image).

The video decoding method and the video encoding method according to the embodiment of the present application are described in detail above with reference to the drawings, and the video decoder and the video encoder according to the embodiment of the present application are described below with reference to fig. 22 to 25, respectively, it is to be understood that the video decoder and the video encoder shown in fig. 22 to 25 can perform each step in the video decoding method and the video encoding method according to the embodiment of the present application, respectively. In order to avoid unnecessary repetition, the following description will appropriately omit repeated description when introducing the video decoder and the video encoder of the embodiments of the present application.

Fig. 22 is a schematic block diagram of a video decoder of an embodiment of the present application. The video decoder 30000 shown in fig. 22 includes: a decoding module 30010 and a processing module 30020.

The decoding module 30010 and the processing module 30020 in the video decoder 30000 may perform the steps of the video decoding method of the embodiment of the present application, for example, the decoding module 30010 and the processing module 30020 can perform both the steps 1001 to 1004 in the method shown in fig. 8 and the steps 7001 to 7004 in the method shown in fig. 17.

When the decoding module 30010 and the processing module 30020 execute steps 1001 to 1004 in the method illustrated in fig. 8, the specific functions of the decoding module 30010 and the processing module 30020 are as follows:

a decoding module 30010, configured to decode the main encoding stream to obtain a first video image;

the processing module 30020 is configured to perform super-resolution processing on the first video image by using a neural network model to obtain a second video image;

the decoding module 30010 is further configured to decode the auxiliary encoded stream to obtain a residual value;

the processing module 30020 is further configured to superimpose the pixel value of the second video image and the residual value to obtain a target video image;

optionally, the neural network model is trained according to a training sample image and a training target image, the training target image is a video image with the same resolution as the target video image, and the training sample image is a video image obtained by performing downsampling, encoding and decoding on the training target image.

When the decoding module 30010 and the processing module 30020 perform steps 7001 to 7004 in the method illustrated in fig. 17, the decoding module 30010 and the processing module 30020 function specifically as follows:

a decoding module 30010, configured to decode the main encoding stream to obtain a first video image, where the first video image is a standard dynamic range SDR video image;

the processing module 30020 is configured to process the first video image by using a neural network model to obtain a second video image, where the second video image is a high dynamic range HDR video image, and a resolution of the second video image is greater than a resolution of the first video image;

a decoding module 30010, configured to decode the auxiliary encoded stream to obtain a residual value;

the processing module 30020 is further configured to superimpose the pixel value of the second video image and the residual value to obtain a target video image.

Fig. 23 is a schematic block diagram of a video encoder of an embodiment of the present application. The video encoder 40000 shown in fig. 23 includes: an encoding module 40010, a decoding module 40020 and a processing module 40030.

The decoding module 30010 and the processing module 30020 in the video encoder 40000 may execute the steps of the video encoding method according to the embodiment of the present application, for example, the decoding module 30010 and the processing module 30020 can execute both the steps 4001 to 4005 in the method shown in fig. 14 and the steps 8001 to 8005 in the method shown in fig. 18.

When the video encoder 40000 performs steps 4001 to 4005 in the method illustrated in fig. 14, the respective modules in the video encoder 40000 function specifically as follows.

The encoding module 40010 is configured to perform downsampling and encoding on the initial video image to obtain a main encoding stream;

a decoding module 40020, configured to decode the main encoding stream to obtain a first video image;

a processing module 40030, configured to perform super-resolution processing on the first video image by using a neural network model to obtain a second video image with the same resolution as the initial video image;

the processing module 40030 is also for determining residual values of the initial video image relative to the second video image;

the encoding module 40010 is further configured to encode the residual value to obtain an auxiliary encoded stream.

Optionally, the neural network model is obtained by training according to a training sample image and a training target image, the training target image is a video image with the same resolution as the initial video image, and the training sample image is a video image obtained by downsampling, encoding and decoding the training target image.

When the video encoder 40000 performs steps 8001 to 8005 in the method illustrated in fig. 18, the respective blocks in the video encoder 40000 function specifically as follows.

The processing module 40030 is configured to process an initial video image to obtain a processed video image, where the initial video image is an HDR video image with a high dynamic range, and the processed video image is an SDR video image with a standard dynamic range;

the encoding module 40010 is configured to encode the processed video image to obtain a main encoding stream;

the decoding module 40020 is configured to decode the main encoding stream to obtain a first video image; processing the first video image by adopting a neural network model to obtain a second video image, wherein the second video image is an HDR video image, and the resolution of the second video image is the same as that of the initial video image;

the processing module 40030 is further configured to determine residual values of the initial video image relative to the second video image;

the processing module 40030 is further configured to encode the residual value to obtain an auxiliary encoded stream.

Fig. 24 is a schematic block diagram of a video decoder of an embodiment of the present application. The video decoder 50000 shown in fig. 24 includes:

a memory 50010 for storing programs;

a processor 50020 for executing the programs stored in the memory 50010, wherein when the programs stored in the memory 50010 are executed, the processor 50020 is configured to perform the steps of the video decoding method according to the embodiment of the present application.

Fig. 25 is a schematic block diagram of a video encoder of an embodiment of the present application. The video encoder 60000 shown in fig. 25 includes:

a memory 60010 for storing programs;

a processor 60020 configured to execute the programs stored by the memory 60010, when the programs stored by the memory 60010 are executed, the processor 60020 is configured to perform the steps of the video encoding method according to the embodiments of the present application.

A specific structure of the video decoder 30000 in fig. 22 and the video decoder 50000 in fig. 24 may be as shown in the encoder 20 in fig. 2, and a specific structure of the video encoder 40000 in fig. 23 and the video encoder 60000 in fig. 25 may be as shown in the encoder 20 in fig. 2.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video decoding method, comprising:

decoding the main coding stream to obtain a first video image;

performing super-resolution processing on the first video image by adopting a neural network model to obtain a second video image;

decoding the auxiliary encoded stream to obtain residual values;

and superposing the pixel value of the second video image and the residual value to obtain a target video image.

2. The method of claim 1, wherein the neural network model is trained from training sample images and training target images, the training target images being video images having the same resolution as the target video images, the training sample images being video images resulting from downsampling, encoding, and decoding the training target images.

3. The method of claim 1 or 2, wherein the super-resolution processing of the first video image by using the neural network model to obtain the second video image comprises:

performing super-resolution processing on the pixel point brightness component values of the first video image by adopting the neural network model to obtain pixel point brightness component values of the second video image;

the superimposing the pixel value of the second video image and the residual value to obtain the target video image includes:

and superposing the pixel point brightness component value of the second video image with the residual value to obtain the pixel point brightness component value of the target video image.

4. The method of claim 3, wherein the method further comprises:

and carrying out interpolation processing on the pixel point chromaticity component values of the first video image to obtain the pixel point chromaticity component values of the target video image.

5. The method of any of claims 1-4, wherein the first video image is a High Dynamic Range (HDR) image or a Standard Dynamic Range (SDR) image.

6. A video encoding method, comprising:

performing down-sampling and coding processing on the initial video image to obtain a main coding stream;

decoding the main coding stream to obtain a first video image;

performing super-resolution processing on the first video image by adopting a neural network model to obtain a second video image with the same resolution as the initial video image;

determining a residual value of the initial video image relative to the second video image;

and coding the residual value to obtain an auxiliary coding stream.

7. The method of claim 6, wherein the neural network model is trained based on training sample images and training target images, the training target images being video images of the same resolution as the initial video images, the training sample images being video images obtained by downsampling, encoding, and decoding the training target images.

8. The method of claim 6 or 7, wherein the super-resolution processing of the first video image by using the neural network model to obtain a second video image with the same resolution as the initial video image comprises:

the determining residual values of the initial video image relative to the second video image comprises:

and determining the difference value of the pixel point brightness component value of the initial video image relative to the pixel point brightness component value of the second video image as the residual value.

9. The method of any of claims 6-8, wherein the initial video image is a High Dynamic Range (HDR) image or a Standard Dynamic Range (SDR) image.

10. A video decoder, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the processor, when executing the memory-stored program, being configured to:

decoding the main coding stream to obtain a first video image;

decoding the auxiliary encoded stream to obtain residual values;

superposing the pixel value of the second video image and the residual value to obtain a target video image;

11. the video decoder of claim 10, wherein the neural network model is trained from training sample images and training target images, the training target images being video images of the same resolution as the target video images, the training sample images being video images resulting from downsampling, encoding, and decoding the training target images.

12. The video decoder of claim 10 or 11, wherein the processor is configured to:

13. The video decoder of claim 12, wherein the processor is further configured to:

14. The video decoder of any of claims 10-13, wherein the first video image is a high dynamic range, HDR, image or a standard dynamic range, SDR, image.

15. A video encoder, comprising:

a memory for storing a program;

decoding the main coding stream to obtain a first video image;

and coding the residual value to obtain an auxiliary coding stream.

16. The video encoder of claim 15, wherein the neural network model is trained based on training sample images and training target images, the training target images being video images of the same resolution as the initial video images, the training sample images being video images obtained by downsampling, encoding, and decoding the training target images.

17. The video encoder of claim 15 or 16, wherein the processor is configured to:

18. The video encoder of any of claims 15-17, wherein the initial video image is a high dynamic range, HDR, image or a standard dynamic range, SDR, image.

19. An apparatus for decoding video data, comprising:

a receiver for receiving a primary encoded stream of video data and a secondary encoded stream of said video data and inputting said primary encoded stream and said secondary encoded stream to said video decoder for decoding, and a video decoder according to any of claims 10-14.

20. An apparatus for encoding video data, comprising:

a receiver for receiving an initial video image and inputting the initial video image to the video encoder for encoding, and a video encoder as claimed in any of claims 15 to 18.