CN118216149A - Decoding method, encoding method, decoder, encoder, and encoding/decoding system - Google Patents

Decoding method, encoding method, decoder, encoder, and encoding/decoding system Download PDF

Info

Publication number
CN118216149A
CN118216149A CN202180104061.0A CN202180104061A CN118216149A CN 118216149 A CN118216149 A CN 118216149A CN 202180104061 A CN202180104061 A CN 202180104061A CN 118216149 A CN118216149 A CN 118216149A
Authority
CN
China
Prior art keywords
scale
feature
features
flow information
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180104061.0A
Other languages
Chinese (zh)
Inventor
马展
刘浩杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN118216149A publication Critical patent/CN118216149A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiment of the application provides a decoding method, an encoding method, a decoder, an encoder and a coding and decoding system. According to the embodiment of the application, the first decoding characteristic of the first image and the optical flow information of different scales of the first decoding characteristic are obtained through the code stream, at least two reference characteristics of different scales of the reconstructed image of the second image are obtained, and then the first predicted image is determined according to the at least two optical flow information of different scales and the reference characteristics of different scales. Because at least two optical flow information with different scales and reference features with different scales can better capture motion information with different speeds, the scheme of the embodiment of the application can adapt to videos with different resolutions, thereby being beneficial to improving the stability of video prediction and further being beneficial to improving or eliminating the phenomena of blurring or ghosting of the motion edge or the shielding region of the video and the like.

Description

Decoding method, encoding method, decoder, encoder, and encoding/decoding system Technical Field
Embodiments of the present application relate to the field of video compression, and more particularly, to a decoding method, an encoding method, a decoder, an encoder, and a codec system.
Background
Video compression technology mainly compresses huge digital video data, so as to facilitate transmission, storage and the like. Media data such as video images occupy 80% of network transmission and storage space, and still grow in bursts for the next decades, which presents a great challenge to future video compression techniques and transmission methods. Video compression technology has undergone several technological changes since its development, successfully landed and widely served video services throughout the world. Meanwhile, the improvement of video coding efficiency mainly benefits from the continuously complicated technical iteration in the hybrid coding framework, and a great deal of coding complexity is gradually sacrificed to replace the performance improvement, so that the design of a hardware architecture also gradually presents higher challenges and demands.
As deep learning evolves and matures, learning-based video image processing and encoding is widely studied. Existing techniques for part of deep learning start from various modules of a traditional hybrid codec, replace the modules therein by a training method of a neural network and improve performance, such as block division, mode selection, loop filtering, intra prediction, inter prediction, and the like. Such methods mix neural network algorithms while keeping the whole of the traditional coding framework unchanged. Different from the coding mode, the end-to-end coding and decoding technology takes data as a drive, adopts a method of combining an all-neural network with end-to-end training optimization, and combines the rate distortion optimization of the whole framework to obtain the network-based coder and decoder. But the performance of end-to-end video codec needs to be further improved.
Disclosure of Invention
The embodiment of the application provides a decoding method, an encoding method, a decoder, an encoder and a coding and decoding system, which can further improve the performance of end-to-end video coding and decoding.
In a first aspect, there is provided an end-to-end based decoding method, the method comprising:
Acquiring a code stream;
Decoding the code stream to obtain a first decoding characteristic of a first image;
determining optical flow information of at least two different scales according to the first decoding characteristics;
Extracting features of the reconstructed image of the second image to obtain at least two reference features with different scales;
And determining a first predicted image according to the optical flow information of at least two different scales and the reference features of at least two different scales.
In a second aspect, there is provided an end-to-end based encoding method, comprising:
Performing feature extraction on the reconstructed images of the first image and the second image to obtain coding features of the first image;
quantizing the coding feature to obtain the quantized coding feature;
And coding the quantized coding features to obtain a code stream.
In a third aspect, there is provided an end-to-end based decoding method, including:
an acquisition unit for acquiring a code stream;
the first decoding unit is used for decoding the code stream to obtain a first decoding characteristic of a first image;
The second decoding unit is used for determining optical flow information of at least two different scales according to the first decoding characteristics;
the feature extraction unit is used for extracting features of the reconstructed image of the second image to obtain at least two reference features with different scales;
and the determining unit is used for determining a first predicted image according to the optical flow information of the at least two different scales and the reference characteristics of the at least two different scales.
In a fourth aspect, there is provided an end-to-end based encoder comprising:
The feature extraction unit is used for extracting features of the reconstructed images of the first image and the second image to obtain coding features of the first image;
the quantization unit is used for quantizing the coding features to obtain quantized coding features;
And the coding unit is used for coding the quantized coding features to obtain a code stream.
In a fifth aspect, there is provided an end-to-end based codec system comprising the decoder of the third aspect and the encoder of the fourth aspect.
In a sixth aspect, an electronic device is provided that includes a processor and a memory. The memory is for storing a computer program and the processor is for calling and running the computer program stored in the memory for performing the method of the first aspect or the method of the second aspect described above.
In a seventh aspect, a chip is provided, including: a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method of the first aspect or the method of the second aspect as described above.
In an eighth aspect, a computer-readable storage medium is provided for storing a computer program that causes a computer to execute the method in the first aspect or the method in the second aspect.
In a ninth aspect, there is provided a computer program product comprising computer program instructions for causing a computer to perform the method of the first aspect, or the method of the second aspect, as described above.
In a tenth aspect, there is provided a computer program which, when run on a computer, causes the computer to perform the method of the first or second aspect described above.
In the embodiment of the application, a first decoding feature of a first image and optical flow information of different scales of the first decoding feature are obtained through a code stream, at least two reference features of different scales of a reconstructed image of a second image are obtained, and then a first predicted image is determined according to the at least two optical flow information of different scales and the reference features of different scales. The scheme of the embodiment of the application can be better suitable for videos with different resolutions, thereby being beneficial to improving the stability of video prediction and further being beneficial to improving or eliminating the phenomena of blurring or ghosting of the moving edge or the shielding region of the video and the like.
Drawings
FIG. 1 is a schematic diagram of an end-to-end codec framework provided by an embodiment of the present application;
FIG. 2A is a schematic flow chart of an encoding method provided by an embodiment of the present application;
FIG. 2B is a schematic flow chart of a decoding method provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of another end-to-end based codec framework provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a multi-scale optical-flow decoder according to an embodiment of the present application;
FIG. 5 is a schematic illustration of a multi-scale reference feature provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of multi-scale motion compensation provided by an embodiment of the present application;
FIG. 7 is another schematic diagram of multi-scale motion compensation provided by an embodiment of the present application;
FIG. 8 is a schematic block diagram of a decoder provided by an embodiment of the present application;
FIG. 9 is a schematic block diagram of an encoder provided by an embodiment of the present application;
fig. 10 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
First, an applicable codec frame according to an embodiment of the present application will be described with reference to fig. 1. Fig. 1 shows a schematic diagram of an end-to-end codec frame 100 according to an embodiment of the present application. The codec framework 100 is a learning-based end-to-end video codec framework. As shown in fig. 1, the codec framework 100 includes an encoder 110 and a decoder 120. Illustratively, the encoding side may include an encoder 110 and the decoding side may include a decoding side 120, as the application is not limited in this regard.
The encoder 110 may be configured to encode a current frame (also referred to as a current encoded frame) in the video sequence based on machine learning, for example, perform processing such as feature transformation, probability estimation, quantization, and arithmetic encoding on the current frame, to obtain a code stream. The decoder 120 may decode the obtained code stream based on machine learning, for example, perform processing such as feature transformation, probability estimation, quantization, and arithmetic decoding on the code stream, to obtain a predicted image (may also be referred to as a predicted frame) of the current frame.
The end-to-end video codec frame may be configured in an electronic device, for example, without limitation, an intelligent video storage and playback device, such as an electronic product (mobile phone, television, computer, etc.) having a camera function, a video playback function, or a video storage function.
Machine learning, by way of example, includes deep learning, which may be implemented in particular by convolutional neural networks. As an example, the convolutional neural network may employ a learnable feature transformation method, a differentiable quantization method, and a dynamic prior and joint probability distribution to more efficiently remove information redundancy between video images, resulting in a more compact spatial representation of video image features, thereby facilitating reconstruction of higher quality video images.
In the embodiment of the present application, the encoder 110 and the decoder 120 may be optimized simultaneously by the overall rate distortion optimization of the codec frame 100. Illustratively, the feature optimization, probability estimation, quantization, etc. in the encoder 110 or decoder 120, if optimized continuously, helps to better rate-distortion optimization for the entire end-to-end codec framework.
Existing end-to-end codec techniques typically employ single-scale optical flow for motion compensation in inter-frame coding. However, motion information of a single scale is often difficult to adapt to videos with different resolutions, and especially in the case of small resolutions, video prediction often has instability, for example, poor immunity to small displacement, which may cause phenomena such as blurring or ghosting in the motion edges or occlusion areas of the video.
In view of this, an embodiment of the present application provides an end-to-end decoding method, in which, optical flow information of different scales of a first decoding feature of a first image in a code stream is obtained, at least two reference features of different scales of a reconstructed image of a second image are obtained, and then a first prediction image is determined according to the at least two optical flow information of different scales and the reference features of different scales. Here, one scale corresponds to one resolution. The scheme of the embodiment of the application can be better suitable for videos with different resolutions, thereby being beneficial to improving the stability of video prediction and further being beneficial to improving or eliminating the phenomena of blurring or ghosting of the moving edge or the shielding region of the video and the like.
In some embodiments, a first compensation feature may be obtained based on the optical flow information of at least two different scales and the at least two reference features of different scales, and then a first predicted image may be obtained based on the first compensation feature. The embodiment of the application applies at least two optical flow information with different scales and at least two reference features with different scales, which can help to better compensate for the blocking area or the rapid or irregular movement, thereby being beneficial to generating better compensation textures at the movement edge or the blocking area of the video so as to realize higher-precision inter-frame video prediction performance.
As a possible implementation manner, a first compensation pixel may be obtained according to the optical flow information of at least two different scales and the reconstructed image of the second image, and then the first prediction image may be obtained according to the first compensation feature and the first compensation pixel. The embodiment of the application can combine the pixel domain and the feature domain to perform motion compensation, is favorable for better predicting optical flow information in the feature domain, and further can be favorable for improving or eliminating phenomena such as blurring or ghosting of a motion edge or a shielding region of a video, so as to realize higher-precision inter-frame video prediction performance.
The end-to-end encoding method and decoding method provided by the embodiment of the application are described in detail below in conjunction with an end-to-end encoding/decoding framework.
Fig. 2A shows a schematic flow chart of an encoding method 2001 provided by an embodiment of the present application. The method 2001 may be applied in an end-to-end video codec system, such as an encoder in the codec framework 100 in fig. 1. As shown in fig. 2A, method 2001 includes steps 210 through 230.
Fig. 2B shows a schematic flow chart of a decoding method 2002 provided by an embodiment of the present application. The method 2002 may be applied in an end-to-end video codec system, such as a decoder in the codec framework 100 of fig. 1. As shown in fig. 2B, method 2002 includes steps 240 through 290.
Fig. 3 shows a schematic diagram of another end-to-end based codec framework 300 according to an embodiment of the present application, where the codec framework 300 may be a specific refinement of the codec framework 100, for example, the feature extraction 301, the quantization 302 and the encoding unit 303 may be modules or units included in the encoder 110, i.e. the feature extraction 301, the quantization 302 and the encoding unit 303 may be provided in the encoding end. The decoding unit 304, the optical flow decoder 305, the motion compensation 306 may be modules or units included in the decoder 120, i.e. the decoding unit 304, the optical flow decoder 305, the motion compensation 306 may be provided in the decoding end. Optionally, the codec frame 200 may further include a rate-distortion optimization 307 for simultaneously optimizing each module through the overall rate-distortion optimization of the codec frame 200.
The encoding method 2001 will be described in detail below with reference to fig. 3.
And 210, optionally, extracting features of the reconstructed frames of the first image and the second image to obtain coding features of the first image.
Here, the second image is an image of the previous frame of the first image, for example, an image of the previous time, and is not limited. The second image may be an image at any time before the first image, an image at any time after the first image, or the like, and is not limited. Illustratively, reconstructed images (which may also be referred to as reconstructed frames) of the first image and the second image may be concatenated in a channel dimension to obtain a high-dimensional input, which may then be inter-frame feature encoded (e.g., by an inter-frame feature encoder) to obtain encoded features of the first image.
Illustratively, taking the first image as the current frame (time t) and the second image as the image at the time (time t-1) prior to the current frame, the reconstructed frame at the time prior to (may be expressed as) And the current frame (which may be denoted as X t), e.g., byA cascaded graph X cat is obtained. As an example, when reconstructing a frameAnd when the current frame X t is a 3-channel video frame of Red Green Blue (RGB) domain, respectively, X cat is a 6-channel high-dimensional input after channel cascading.
The concatenated high-dimensional input X cat may then be input to an inter-frame feature encoder E F (one example of feature extraction 301), resulting in a time-domain encoded feature X F as the encoded feature of the first image. Here, X F=E F(X cat). In some embodiments, the inter-frame feature encoder E F may include a non-local self-attention feature extraction module (e.g., multi-level) and a downsampling module. As a specific example, the inter-frame feature encoder E F may include 4 non-local self-attention feature extraction modules and 4 downsampling modules. In other embodiments, the inter-frame feature encoder E F may use a feature extraction module such as a convolution module, a residual module, a dense connection module, or the like instead of the non-local self-attention feature extraction module, which is not limited by the present application.
220, Optionally, quantizing the encoded features to obtain quantized encoded features.
For example, the extracted time-domain coding feature X F may be input to quantization 302 to perform differential quantization, so as to obtain quantized integer time-domain coding features. For example, floating32 floating point time domain coding feature X F may be converted to quantized integer time domain coding feature(Which may also be referred to as quantized time-domain coding features). As a specific example, it is possible toQuantization is performed, wherein the forward propagation of quantization is calculated asWhere Round (-) is a rounding function and U (-0.5, 0.5) is a uniform noise distribution of plus or minus 0.5.
The quantization function can then be approximated as a linear function during retraining, i.e., pair The derivation is performed to approximate that the corresponding counter-propagating gradient is 1, and the counter-propagating parameter is updated using 1 as the counter-derived gradient. That is, during training, the time encoding features may be quantified using the parameter update procedure herein in the forward direction, and approximate gradients may be generated to optimize network parameters using approximate derivation during optimization and back propagation of the neural network as a whole. It should be noted that, the noise distribution is not limited to a uniform noise distribution, and for example, a mixed noise distribution may be used instead of training, so as to increase randomness and robustness in the training process. In addition, the noise distribution in forward propagation is limited to the training process, and the overall expectation of noise is 0, and the actual rounding function Round () can be replaced in the actual test process.
230, Optionally, encoding the quantized encoding features to obtain a code stream.
Illustratively, step 230 may be performed by the encoding unit 303. As a possible implementation, the code stream may be acquired through the following steps 3-1 to 3-7.
And step 3-1, transforming the quantized coding features to obtain more compact coding features. Illustratively, the above extracted time-domain coded feature X F may be further feature transformed and downsampled to yield a more compact feature X h, which may be obtained, for example, by 2 non-local self-attention transforms and 2 downsampling. It should be noted that, the feature extraction modules such as the convolution module, the residual module, the dense connection module and the like may be used to replace the non-local self-attention transformation module for feature extraction, which is not limited in the present application.
Step 3-2, the more compact feature X h may be quantized to obtain quantized compact coding features. Specifically, the quantization process may refer to the description of step 220, which is not repeated here.
And step 3-3, inputting the compact coding feature into a probability model to obtain the original probability distribution of the compact coding feature. Based on the original probability distribution, the compact coding feature can be coded by adopting arithmetic coding, and the binary code stream of the corresponding compact coding feature can be obtained by writing the binary code stream into a file.
Step 3-4, for the binary code stream of the compact coding feature, arithmetic decoding can be performed to obtain quantized compact decoding feature. Alternatively, the arithmetic decoding is a generic reversible lossless encoder corresponding to the above arithmetic encoding, and may losslessly write or reduce the code stream or encoding characteristics.
Step 3-5, inverse transforming the quantized compact decoding features, e.g. by 2 non-local self-attention transforms and 2 upsampling, to obtain reconstructed prediction features.
Step 3-6, based on the reconstructed prediction features, generating a mean and variance matrix of the same size as the quantized coding features, for each value in the matrix, generating a corresponding independent gaussian distribution, based on each gaussian distribution, predicting the quantized time-domain coding featuresProbability of each pixel in a block, yielding a time-domain coding feature that is quantized with the blockProbability matrices of the same size.
And 3-7, generating a binary code stream of the quantized time domain coding characteristic by adopting arithmetic coding based on the predicted probability matrix obtained in the step 3-6.
It should be noted that, steps 3-1 to 3-7 above show steps or operations for acquiring a code stream, but these steps or operations are only examples, and other operations or variations of each of steps 3-1 to 3-7 may also be performed in the embodiments of the present application. Furthermore, the various steps of steps 3-1 through 3-7 may be performed in a different order than presented above, and it is possible that not all of the operations therein are to be performed.
Decoding method 2002 is described in detail in conjunction with FIG. 3
And 240, decoding the code stream to obtain a first decoding characteristic of the first image.
Illustratively, the decoding unit 304 may perform step 240. In some embodiments, the decoding unit 304 may receive the code stream from an external device (e.g., other electronic device). When the decoding unit 304 receives a code stream from an external device, the encoder 110 (e.g., the feature extraction 301, the quantization 302, and the encoding unit 303) in the video codec frame 100 shown in fig. 1 or 3, for example, may be included therein.
In some embodiments, decoding unit 304 may receive the code stream from encoding unit 303 when training codec frame 300 to optimize codec frame 300 parameters.
As a possible implementation manner, the quantized time-domain decoding feature may be obtained by performing lossless decoding on the code stream point by using arithmetic decoding through the probability matrix obtained in the above steps 3-6, and restoring the quantized time-domain decoding feature as the first decoding feature. As an example, a generic arithmetic decoder may be used to recover quantized time-domain decoding features, wherein the probability matrix may be predicted from the compact coding features in step 230 described above. It should be noted that the accuracy of the restoration of compact coding features can directly affect the accuracy of the predicted probability matrix. And the higher the accuracy of the probability matrix, the smaller the resulting binary code stream.
In some embodiments, the restored quantized time-domain decoding features and quantized time-domain encoding features are the same matrix, which may also be described asThe difference between the two is that the quantized time domain decoding characteristic is generated at the decoding end of the encoding and decoding framework, and the quantized time domain encoding characteristic is generated at the encoding end of the encoding and decoding framework.
Based on the first decoding characteristics, optical flow information for at least two different scales is determined 250. For example, the first decoding feature may be input to an optical flow decoder, resulting in optical flow information of the at least two different scales. That is, step 250 may be performed by optical flow decoder 305.
In some alternative embodiments, the first decoding feature may be up-sampled at different scales to obtain optical flow information of the at least two different scales, where up-sampling at one scale results in optical flow information of a corresponding scale (e.g., the same scale).
As one possible implementation, the optical-flow decoder may comprise a multi-scale optical-flow decoder, which may comprise at least two different-scale optical-flow decoding modules in cascade, wherein the optical-flow decoding modules may comprise an upsampling unit. Specifically, the up-sampling unit may obtain optical flow information of different scales by up-sampling the decoded features. Therefore, the embodiment of the application can obtain the corresponding optical flow information of at least two different scales by cascading the optical flow decoding modules of at least two different scales and up-sampling the decoding characteristics by the optical flow decoding modules.
In the embodiment of the application, the first decoding feature can be input into the multi-scale optical flow decoder to obtain at least two optical flow information of different scales of the first decoding feature, wherein the optical flow decoding module of one scale corresponds to the optical flow information of the same scale. That is, the optical flow decoding module of one scale can obtain optical flow information of the same scale.
FIG. 4 shows a schematic diagram of a multi-scale optical-flow decoder 305 provided by an embodiment of the application. As shown in fig. 4, the multi-scale optical-flow decoder 305 may include 5 cascaded optical-flow decoding modules of different scales, and may output optical-flow information of 5 different scales. Wherein each optical flow decoding module comprises a non-local self-attention module and an up-sampling module. The non-local self-attention module can be used for carrying out better feature transformation capability on the input video, and the feature up-sampling module capable of being combined with non-local additional information to obtain space self-adaptive activation can be used for up-sampling the features output by the non-local self-attention module to obtain time domain decoding features with corresponding scales.
Optionally, a convolution and non-linear transform (ReLU) may also be provided after each optical flow decoding module for converting the multi-channel time domain decoding feature into a 2-channel decoded optical flow.
It should be understood that the multi-scale optical-flow decoder 305 in FIG. 4 is only one possible schematic, and should not be construed as limiting in any way on the embodiments of the present application. In a particular implementation, one or more layers of convolution or nonlinear transformation may be provided. And, by providing a multi-layer convolution stack and better nonlinear transformation, the accuracy of generating optical flow information can be improved. In addition, alternatively, multiple optical flow decoding modules may share the same convolution and nonlinear transforms, which are not limited by embodiments of the present application.
In some optional embodiments, the first decoding feature may be upsampled to obtain a first scale decoding feature; obtaining optical flow information of the first scale according to the decoding characteristics of the first scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the first scale; upsampling the first scale decoding feature to obtain a second scale decoding feature; and obtaining optical flow information of the second scale according to the decoding characteristics of the second scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the second scale.
As a possible implementation manner, the multi-scale optical-flow decoder may include a first-scale optical-flow decoding module and a second-scale optical-flow decoding module that are cascaded. At this time, as a specific implementation manner, the first decoding feature may be input to the optical flow decoding module of the first scale to obtain the decoding feature of the first scale, and then optical flow information of the first scale is obtained according to the decoding feature of the first scale, where the optical flow information of at least two different scales includes the optical flow information of the first scale. And then, inputting the decoding features of the first scale into a second-scale optical flow decoding module to obtain decoding features of the second scale, and obtaining optical flow information of the second scale according to the decoding features of the second scale, wherein the optical flow information of at least two different scales comprises the optical flow information of the second scale.
In some alternative embodiments, the resolution of the second scale is 2 times the resolution of the first scale.
As a specific example, the quantized time-domain decoding features obtained in step 240 may be usedA multi-scale optical-flow decoder (denoted as D F) is input to generate optical-flow information (denoted as) I.e.DF (), i.e. the multi-scale optical flow decoder D F, s is a natural number, representing the different resolutions of the optical flow, e.g. s=1,Representing the decoded optical flow (i.e., optical flow information) of the first scale.
Taking the multi-scale optical flow decoder in fig. 4 as an example, an optical flow decoder having 5 scales may be provided, and optical flow information corresponding to 5 scales may be obtained. In the following, a specific example of generating optical flow information of 5 different scales is described by means of steps 5-1 to 5-4.
Step 5-1, quantizing the time-domain decoded featuresAnd inputting the first-scale optical flow decoding module in the multi-scale optical flow decoder D F to obtain the first-scale time domain decoding characteristics. In particular, the temporal decoding feature of the first scale may be obtained by a non-local self-attention module and an upsampling module in the optical flow decoding module of the first scale. In some embodiments, the first-scale time-domain decoding feature is a multi-channel first-scale time-domain decoding feature. As a specific example, the resolution corresponding to the first scale may be 1/16 of the original video frame.
And step 5-2, converting the time domain decoding characteristics of the first scale of the multi-channel into a decoding optical flow of the first scale of the 2-channel through convolution and nonlinear transformation, and obtaining the optical flow information of the first scale.
And 5-3, inputting the time domain decoding features of the first scale into an optical flow decoding module of the second scale to obtain the time domain decoding features of the second scale, and converting the time domain decoding features of the second scale into the second scale decoding optical flow of the 2 channels through convolution and nonlinear transformation which are the same as those of the step 5-2. As a specific example, the resolution corresponding to the second scale may be 1/8 of the original video frame.
Step 5-4, similarly, the optical flow information of the second scale can be input into an optical flow decoding module of the third scale to obtain a time domain decoding characteristic of the third scale; inputting the time domain decoding features of the third scale into an optical flow decoding module of the fourth scale to obtain the time domain decoding features of the fourth scale; and inputting the time domain decoding features of the fourth scale into an optical flow decoding module of the fifth scale to obtain the time domain decoding features of the fifth scale. Alternatively, the time domain decoding features of the multiple channels of each scale can be converted into decoding optical streams of corresponding scales through convolution and nonlinear transformation respectively. As a specific example, the resolution corresponding to the third scale may be 1/4 of the original video frame, the resolution corresponding to the fourth scale may be 1/2 of the original video frame, and the resolution corresponding to the fifth scale may be the same as the original video frame.
In addition, in the embodiment of the application, the optical flow information for motion compensation at the decoding end can be directly obtained by decoding from the quantized decoding characteristics. The method for acquiring optical flow information according to the embodiment of the present application may be referred to as a "one-step method". There is an end-to-end video coding and decoding technology, which needs to rely on a pre-trained optical flow network to perform optical flow estimation at the coding end and perform secondary compression on the optical flow to obtain optical flow information at the decoding end, where the network has relative redundancy, and the method for obtaining the optical flow information can be called as a two-step method. Therefore, since the embodiment of the application does not need to perform explicit optical flow estimation at the encoding end, the embodiment of the application can help to simplify the encoding and decoding frame and the encoding and decoding calculation process, thereby improving the efficiency of video encoding and decoding.
And 260, extracting features of the reconstructed image of the second image to obtain at least two reference features with different scales. For example, the reconstructed image of the second image may be input to a feature extraction module, resulting in at least two reference features of different scales.
In some alternative embodiments, the reconstructed image of the second image may be subjected to multi-scale feature extraction and downsampling to obtain at least two different-scale reference features of the reconstructed image, wherein feature extraction and downsampling of one scale results in reference features of a corresponding scale (e.g., the same scale).
As one possible implementation, the feature extraction module may comprise a multi-scale feature extraction network, wherein the multi-scale feature extraction network may comprise a cascade of at least two feature extraction modules of different scales comprising a feature extraction unit and a downsampling unit. Specifically, the downsampling unit may obtain reference features of different scales of the reconstructed image by downsampling the reconstructed image of the second image. Therefore, according to the embodiment of the application, the corresponding at least two reference features with different scales can be obtained by cascading the at least two feature extraction modules with different scales and upsampling the reconstructed image through the feature extraction modules.
In the embodiment of the application, the reconstructed image of the second image can be input into the multi-scale feature extraction network to obtain at least two reference features of different scales of the reconstructed image, wherein the feature extraction module of one scale corresponds to the reference features of the same scale. That is, a reference feature of the same scale may be obtained by a feature extraction module of one scale. In the embodiment of the application, the reference features of each scale can correspond to the same scale time domain decoding optical flow, namely the optical flow information of the same scale.
As an example, a reconstructed frame (which may also be referred to as a reference frame) of a previous instant may be taken as an exampleInputting a multi-scale feature extraction network to extract the reconstructed frameTime domain reference features of corresponding dimensions of (a). FIG. 5 illustrates a schematic diagram of a multi-scale reference feature provided by an embodiment of the present application. As shown in fig. 5, the multi-scale feature network may obtain time-domain reference features F s with different scales through 4 feature extraction and 4 downsampling, where when s=1, F 1 represents the time-domain reference feature with the first scale.
In the embodiment of the application, after the optical flow information of at least two different scales and the reference features of at least two different scales are acquired, the first predicted image can be determined according to the optical flow information of at least two different scales and the reference features of at least two different scales. Alternatively, the first predicted image may be a predicted image of the first image, which is not limited.
In some alternative embodiments, a first compensation feature (e.g., of the first image) may be derived from the at least two different scale optical flow information and the at least two different scale reference features, and then a first prediction feature may be determined from the compensation feature. In particular, see steps 270 to 290 below. Illustratively, steps 270 through 290 may be performed by the motion compensation 306 of FIG. 3.
270, Obtaining a first compensation feature based on the at least two optical flow information of different scales and the at least two reference features of different scales.
Fig. 6 shows a schematic diagram of multi-scale motion compensation provided by an embodiment of the present application. As shown in fig. 6, optical flow information of multiple scales can be used to perform motion compensation on reference features of different scales, so as to obtain compensation features of corresponding reference features of each scale. For example, from the lowest scale, optical flow-based motion compensation may be performed on the reference features step by step, and the compensated reference features are fused step by step, so as to obtain the compensation feature of the highest scale, which is used as the first compensation feature.
In some alternative embodiments, the resolution corresponding to the largest scale in the optical flow information of at least two scales is the same as the resolution of the first image.
In some alternative embodiments, when the at least two different scales of optical-flow information include optical-flow information of a first scaleThe at least two different scale reference features comprise a first scale reference feature F 1, and the at least two different scale optical flow information comprises a second scale optical flow informationWhen the at least two reference features of different scales comprise the reference feature F 2 of the second scale, the optical flow information of the first scale can be usedCompensating the reference characteristic F 1 of the first scale to obtain a compensation characteristic of the first scaleThen for the first scale compensation featureUpsampling to obtain upsampled features of the second scale
Optical flow information according to the second scale can thenCompensating the reference feature F 2 of the second scale to obtain a compensation feature of the second scaleThen, the compensation feature according to the second scaleAnd upsampling features of a second scaleA first compensation characteristic is obtained.
In some alternative embodiments, the compensation feature for the second scale may beAnd upsampling features of a second scaleAnd fusing to obtain a fused characteristic of the second scale.
As a possible implementation, the second scale fusion feature may be used as the first compensation feature. Optionally, the resolution of the second scale is here the same resolution as the original video frame of the first image.
As another possible implementation manner, the up-sampling of the fused feature of the second scale may be continued to obtain an up-sampled feature of the third scaleAnd based on the upsampled features of the third dimensionThe first compensation characteristic is determined. Illustratively, the third scale upsampled feature may be usedAs a first compensation feature. Optionally, the resolution of the third scale is here the same resolution as the original video frame of the first image.
When the at least two optical flow information of different scales include optical flow information of a third scaleWhen the at least two reference features of different scales include the reference feature F 3 of the third scale, optical flow information of the third scale can be further usedCompensating the reference characteristic F 3 of the third scale to obtain a compensation characteristic of the third scaleThen according to the compensation characteristic of the third scaleAnd upsampling features of a third scaleThe first compensation characteristic is obtained.
In some alternative embodiments, the third scale compensation feature may beAnd upsampling features of a third scaleAnd fusing to obtain a third-scale fusion characteristic.
As a possible implementation, the third scale fusion feature may be used as the first compensation feature. Optionally, the resolution of the third scale is here the same resolution as the original video frame of the first image.
As another possible implementation manner, the up-sampling of the fused feature of the third scale may be continued to obtain an up-sampled feature of the fourth scaleAnd according to the upsampling feature of the fourth scaleThe first compensation characteristic is determined. The fourth scale upsampling feature may be illustrativelyAs a first compensation feature. Optionally, the resolution of the fourth scale is here the same resolution as the original video frame of the first image.
In some alternative embodiments, when the at least two scales include 4 scales, or more, the first compensation feature may be determined in the manner described above, which is not described in detail.
Continuing with the example above, when 5 scales of optical-flow information and 5 scales of reference features are obtained, a first compensation feature may be obtained in the manner shown in FIG. 7. Hereinafter, referring to fig. 7, a process of acquiring the first compensation characteristic will be described through steps 7-1 to 7-5.
In fig. 7, from the lowest scale (for example, the first scale), optical flow-based motion compensation is performed on the reference feature step by step, and the compensated reference feature is fused step by step to obtain the compensation feature of the highest scale (for example, the fifth scale), which is used as the first compensation feature.
Step 7-1, see block 701 in FIG. 7, of obtaining the first-scale temporal reference feature F 1 and the first-scale temporal decoded optical flowBy passing throughPerforming motion compensation on the time domain reference characteristic F 1 of the first scale to obtain a compensation characteristic of the first scaleWhere warping (-) is a bilinear interpolation backward warping, the optical flow can be decoded based on the time domain of the first scaleCompensating pixel information of a corresponding position in the time domain reference feature F1 of the first scale to generate a compensation feature of the first scale after motion compensationOptionally, all channels in the time domain reference feature of the first scale correspond to identical motion displacement, which is not limited to the fact that all channels adopt the same displacement, and different channels can be compensated by multiple optical flow information of the same scale.
Step 7-2, with continued reference to block 701, the generated motion compensated first scale compensation featureInput to an upsampling module (e.g., upsampling layer) to obtain upsampled featuresIllustratively, the upsampling module may be composed of a layer of transpose convolutions, a non-linear transform, and a layer of convolutions. Upsampling features by upsampling moduleThe resolution of (a) is the previous feature, i.eTwice the upsampling feature can therefore be usedReferred to as upsampling features of the second scale. In addition, upsampled featuresIs typically 1/2 of the number of channels of the previous feature.
Illustratively, the upsampling layer may include conv: 5X 5\2 ≡ReLU, conv: 3X 3.
Step 7-3 referring to block 702 of FIG. 7, step 7-1 is repeated to obtain a second scale compensation featureUpsampling features in step 7-2With the same resolution.
Step 7-4 the second scale compensation feature obtained in step 7-3 may be processed by a module 705And the upsampled features obtained in step 7-3And carrying out cascade connection among channels, and carrying out feature fusion on the characteristics after cascade connection through multilayer convolution and nonlinear transformation to obtain the characteristics after fusion. The fused features may then be input to an upsampling module to obtain new upsampled featuresHere, the upsampling feature obtained by the upsampling moduleIs the resolution of the previous feature, e.gOr (b)Twice the upsampling feature can therefore be usedReferred to as the third scale up-sampling feature.
And 7-5, repeating the steps 7-3 and 7-4, and performing multi-scale fusion and scale amplification on the generated compensation characteristic to finally obtain a fifth-scale compensation characteristic. For example, step 7-1 may be repeated to obtain a third scale compensation feature via block 703Repeating step 7-4 for compensation features of the third scaleAnd upsampling featuresFusing and further up-sampling to obtain the features after up-samplingThereafter, step 7-1 may be repeated to obtain fourth scale compensation features via block 704Repeating the compensation characteristic of the fourth scale in the step 7-4And upsampling featuresFusing and further up-sampling to obtain the features after up-samplingI.e. the fifth scale compensation feature, as the first compensation feature. Here, the resulting compensation feature of the fifth scale has the same resolution as the encoded encoding feature.
Therefore, according to the embodiment of the application, the reference features of different scales are respectively subjected to motion compensation through the optical flow information of a plurality of scales, for example, the reference features are subjected to optical flow-based motion compensation step by step from the lowest scale, and the compensated reference features are subjected to step fusion step by step to obtain the compensation features of the highest scale, so that the first compensation features can comprise the motion information of different resolutions as the first compensation features, thereby being beneficial to capturing the motion information of different speeds.
280, Optionally, obtaining the first compensation pixel (e.g. the compensation pixel of the first image) from the optical flow information of the at least two different scales and the reconstructed image of the second image.
In some alternative embodiments, the resolution corresponding to the largest scale in the at least two scale reference features is the same as the resolution of the first image.
As a possible implementation manner, the motion compensation may be performed on the reconstructed image of the second image according to the optical flow information of the maximum scale of the optical flow information of at least two different scales, so as to obtain the compensation pixel.
As an example, continuing with the above example, when the largest scale optical flow information of the at least two different scales of optical flow information is the fifth scale optical flow information, the optical flow information according to the fifth scale may beAnd compensating the reconstructed image of the second image to obtain a first compensation pixel. That is, the motion-compensated compensation frame X w can be obtained by motion-compensating the pixel domain. Specifically, the motion compensation method is similar to the motion compensation method in the above step 7-1, and the description of step 7-1 may be referred to, which is not repeated.
Here, the reconstructed image may also be referred to as a reference image. In addition, the reconstructed image of the second image has the same resolution as the fifth-scale compensation feature in step 270, and is obtained by performing motion compensation using the fifth-scale optical flow information obtained in step 250.
In some alternative embodiments, as a possible implementation of determining the first predicted image according to the first compensation feature, the first predicted image may be obtained through step 290, i.e. according to the first compensation feature and the first compensation pixel.
290, Optionally, obtaining the first frame predicted image according to the compensation characteristic of the first image and the compensation pixel of the first image.
In some possible implementations, the compensation feature and the compensation pixel may be cascaded in channel dimensions to obtain a mixed input of the feature channel and the pixel channel. The blended input is then transformed to the pixel domain to yield a first predicted image.
Continuing with the above example, the process of acquiring the first predicted image is described by steps 9-1 to 9-2.
Step 9-1. Concatenating the fifth scale compensation feature and the fifth scale compensation pixel in the channel dimension, e.g., via X fuse=Cat(F w,X w), results in a concatenated graph X fuse, where X fuse represents the mixed input of the feature channel and pixel RGB channel concatenation.
And 9-2, sending the mixed input into a multi-stage convolutional neural network for post-processing to obtain a first predicted image. Here, the multi-level convolutional neural network may employ a multi-layer stack of non-local self-attention transform modules. However, in the embodiment of the present application, the convolutional neural network is not limited to the non-local self-attention transforming module, for example, alternatively, a network structure such as a common convolutional module, a residual module, a dense linking module, or U-Net may be adopted, and the mixed input is reconverted to the pixel domain through post-processing of the multi-set convolutional neural network to obtain the final predicted frame X p, that is, the first predicted image.
Therefore, the embodiment of the application acquires the first decoding feature of the first image and the optical flow information of different scales of the first decoding feature through the code stream, acquires at least two reference features of different scales of the reconstructed image of the second image, and then determines the first predicted image according to the at least two optical flow information of different scales and the reference features of different scales. The scheme of the embodiment of the application can be better suitable for videos with different resolutions, thereby being beneficial to improving the stability of video prediction and further being beneficial to improving or eliminating the phenomena of blurring or ghosting of the moving edge or the shielding region of the video and the like.
Further, according to the embodiment of the application, the first compensation characteristic can be obtained according to the optical flow information of at least two different scales and the reference characteristic of at least two different scales, and then the first predicted image can be obtained according to the first compensation characteristic. The embodiment of the application applies at least two optical flow information with different scales and at least two reference features with different scales, so that better compensation can be conducted on an occlusion region or rapid or irregular motion, and better compensation textures can be generated on the motion edge or the occlusion region of a video, so that higher-precision inter-frame video prediction performance can be realized.
Furthermore, the embodiment of the application can also obtain a first compensation pixel according to at least two optical flow information with different scales and the reconstructed image of the second image, and then obtain a first predicted image according to the first compensation characteristic and the first compensation pixel. Therefore, the embodiment of the application can combine the pixel domain and the feature domain to perform motion compensation, is beneficial to better predicting optical flow information in the feature domain, and further can be beneficial to improving or eliminating phenomena such as blurring or ghosting of a motion edge or a shielding region of a video, so as to realize higher-precision inter-frame video prediction performance.
It should be noted that fig. 2A or fig. 2B illustrates steps or operations of an end-to-end encoding method or a decoding method, but these steps or operations are only examples, and other operations or variations of the operations in the drawings may also be performed by embodiments of the present application. Furthermore, the various steps in fig. 2A or 2B may be performed in a different order than presented in the figures, and it is possible that not all of the operations in fig. 2A are to be performed.
The specific embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described further. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be regarded as the disclosure of the present application.
It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. It is to be understood that the numbers may be interchanged where appropriate such that the described embodiments of the application may be practiced otherwise than as shown or described.
The method embodiments of the present application are described above in detail with reference to fig. 1 to 7, and the apparatus embodiments of the present application are described below in detail with reference to fig. 8 to 10.
Fig. 8 is a schematic block diagram of a decoder 600 of an embodiment of the present application. The decoder 600 is not limited, and may be, for example, the codec frame 100 in fig. 1 or the codec frame 300 in fig. 3. As shown in fig. 8, the decoder 600 may include an acquisition unit 610, a first decoding unit 620, a second decoding unit 630, a feature extraction unit 640, and a determination unit 620.
An acquisition unit 610, configured to acquire a code stream.
The first decoding unit 620 is configured to decode the code stream to obtain a first decoding characteristic of the first image.
The second decoding unit 630 is configured to determine optical flow information of at least two different scales according to the first decoding feature.
The feature extraction unit 640 is configured to perform feature extraction on the reconstructed image of the second image, so as to obtain at least two reference features with different scales.
A determining unit 650, configured to determine the first predicted image according to the optical flow information of the at least two different scales and the reference features of the at least two different scales.
Optionally, the determining unit 650 is specifically configured to:
Obtaining a first compensation feature according to the optical flow information of at least two different scales and the reference features of at least two different scales;
And obtaining the first predicted image according to the first compensation characteristic.
Optionally, the determining unit 650 is further configured to:
obtaining a first compensation pixel according to the optical flow information of at least two different scales and the reconstructed image of the second image; and
And obtaining the first predicted image according to the first compensation characteristic and the first compensation pixel.
Optionally, the determining unit 650 is specifically configured to determine optical flow information according to a first scaleCompensating the first-scale reference feature F 1 to obtain the first-scale compensation featureWherein the at least two different scales of optical-flow information include the first scale of optical-flow informationThe at least two different scale reference features include the first scale reference feature F 1;
compensation features for the first scale Upsampling to obtain upsampled features of the second scale
Optical flow information according to the second scaleCompensating the reference feature F 2 of the second scale to obtain a compensation feature of the second scaleWherein the at least two different scales of optical flow information include the second scale of optical flow informationThe at least two different scale reference features include the second scale reference feature F 2;
Compensation features according to the second scale And upsampling features of the second scaleThe first compensation characteristic is obtained.
Optionally, the determining unit 650 is specifically configured to:
Compensation features for the second scale And upsampling features of the second scaleFusing to obtain a fused feature of a second scale;
Upsampling the fused features of the second scale to obtain upsampled features of a third scale
Upsampling features according to the third scaleThe first compensation characteristic is determined.
Optionally, the determining unit 650 is specifically configured to:
Optical flow information according to the third scale Compensating the reference feature F 3 of the third scale to obtain a compensation feature of the third scaleWherein the at least two different scales of optical-flow information include the third scale of optical-flow informationThe at least two different scale reference features include the third scale reference feature F 3;
Compensation features according to the third scale And upsampling features of the third scaleThe first compensation characteristic is obtained.
Optionally, the resolution of the third scale is 2 times the resolution of the second scale.
Optionally, the second decoding unit 630 is specifically configured to:
And up-sampling the first decoding features in different scales to obtain optical flow information in at least two different scales, wherein up-sampling in one scale obtains optical flow information in a corresponding scale.
Optionally, the second decoding unit 630 is specifically configured to:
upsampling the first decoding feature in a first scale to obtain a decoding feature in the first scale;
Obtaining optical flow information of the first scale according to the decoding characteristics of the first scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the first scale;
upsampling the first scale decoding feature to obtain a second scale decoding feature;
And obtaining optical flow information of the second scale according to the decoding characteristics of the second scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the second scale.
Optionally, the feature extraction unit 640 is specifically configured to:
And carrying out multi-scale feature extraction and downsampling on the reconstructed image to obtain at least two reference features of different scales of the reconstructed image, wherein the feature extraction and downsampling of one scale obtain the reference features of the corresponding scale.
Optionally, the determining unit 650 is specifically configured to:
And performing motion compensation on the reconstructed image according to the maximum-scale optical flow information in the at least two different-scale optical flow information to obtain the first compensation pixel.
Optionally, the determining unit 650 is specifically configured to:
cascading the channel dimension of the first compensation feature and the first compensation pixel to obtain mixed input of a feature channel and a pixel channel;
And transforming the mixed input to a pixel domain to obtain the first predicted image.
Optionally, the resolution of the second scale is 2 times the resolution of the first scale.
Optionally, the second image is a previous frame image of the first image.
Optionally, the optical flow information of one scale corresponds to the reference feature of the same scale.
Optionally, the resolution corresponding to the largest scale in the optical flow information of at least two scales is the same as the resolution of the first image.
Optionally, the resolution corresponding to the largest scale in the reference features of the at least two scales is the same as the resolution of the first image.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, in this embodiment, the decoder 600 may correspond to a corresponding main body of the method 2002 for performing the embodiment of the present application, and the foregoing and other operations and/or functions of each module in the decoder 600 are respectively for implementing the corresponding flow in the method in fig. 2B, which is not described herein for brevity.
Fig. 9 is a schematic block diagram of an encoder 700 of an embodiment of the present application. The encoder 700 is not limited, and may be, for example, the codec frame 100 in fig. 1 or the codec frame 300 in fig. 3. As shown in fig. 9, the encoder 700 may include a feature extraction unit 710, a quantization unit 720, and an encoding unit 730.
A feature extraction unit 710, configured to perform feature extraction on reconstructed images of a first image and a second image, so as to obtain coding features of the first image;
A quantization unit 720, configured to quantize the coding feature to obtain a quantized coding feature;
And a coding unit 730, configured to code the quantized coding feature to obtain a code stream.
Optionally, the encoding unit 730 is specifically configured to:
Carrying out channel dimension cascading on the reconstructed images of the first image and the second image to obtain high-dimensional input;
and carrying out inter-frame feature coding on the high-dimensional input to obtain coding features of the first image.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, in this embodiment, the encoder 700 may correspond to a corresponding main body of the method 2001 for performing the embodiment of the present application, and the foregoing and other operations and/or functions of each module in the encoder 700 are respectively for implementing a corresponding flow in the method in fig. 2A, which is not described herein for brevity.
The apparatus and system of embodiments of the present application are described above in terms of functional modules in connection with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.
Fig. 10 is a schematic block diagram of an electronic device 800 provided by an embodiment of the application.
As shown in fig. 10, the electronic device 800 may include:
A memory 810 and a processor 820, the memory 810 being for storing a computer program and transmitting the program code to the processor 820. In other words, the processor 820 may call and run a computer program from the memory 810 to implement the decoding method, the encoding method in the embodiment of the present application.
For example, the processor 820 may be configured to perform the steps of the method 2001 or method 2002 described above according to instructions in the computer program.
In some embodiments of the application, the processor 820 may include, but is not limited to:
A general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
In some embodiments of the application, the memory 810 includes, but is not limited to:
Volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDR SDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus RAM (DR RAM).
In some embodiments of the application, the computer program may be partitioned into one or more modules that are stored in the memory 810 and executed by the processor 820 to perform the encoding methods provided by the present application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device 800.
Optionally, as shown in fig. 10, the electronic device 800 may further include:
A transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.
Processor 820 may control transceiver 830 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 830 may include a transmitter and a receiver. Transceiver 830 may further include antennas, the number of which may be one or more.
It should be appreciated that the various components in the electronic device 800 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.
According to an aspect of the present application, there is provided a codec device comprising a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory, to cause the encoder to perform the method of the above-described method embodiment.
According to an aspect of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.
According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method of the above-described method embodiments.
In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.
It should be understood that in embodiments of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.
In the description of the present application, unless otherwise indicated, "at least one" means one or more, and "a plurality" means two or more. In addition, "and/or" describes an association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be further understood that the description of the first, second, etc. in the embodiments of the present application is for illustration and distinction of descriptive objects, and is not intended to represent any limitation on the number of devices in the embodiments of the present application, nor is it intended to constitute any limitation on the embodiments of the present application.
It should also be appreciated that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (42)

  1. A decoding method, comprising:
    Acquiring a code stream;
    Decoding the code stream to obtain a first decoding characteristic of a first image;
    determining optical flow information of at least two different scales according to the first decoding characteristics;
    Extracting features of the reconstructed image of the second image to obtain at least two reference features with different scales;
    And determining a first predicted image according to the optical flow information of at least two different scales and the reference features of at least two different scales.
  2. The method of claim 1, wherein the determining the first predicted image from the at least two different scales of optical flow information and the at least two different scales of reference features comprises:
    Obtaining a first compensation feature according to the optical flow information of at least two different scales and the reference features of at least two different scales;
    And obtaining the first predicted image according to the first compensation characteristic.
  3. The method according to claim 2, wherein the method further comprises:
    Obtaining a first compensation pixel according to the optical flow information of at least two different scales and the reconstructed image of the second image;
    Wherein said obtaining said first predicted image based on said first compensation characteristic comprises:
    and obtaining the first predicted image according to the first compensation characteristic and the first compensation pixel.
  4. The method of claim 2, wherein the deriving the first compensation feature from the at least two different scales of optical flow information and the at least two different scales of reference features comprises:
    Optical flow information according to a first scale Compensating the first-scale reference feature F 1 to obtain the first-scale compensation featureWherein the at least two different scales of optical-flow information include the first scale of optical-flow informationThe at least two different scale reference features include the first scale reference feature F 1;
    compensation features for the first scale Upsampling to obtain upsampled features of the second scale
    Optical flow information according to the second scaleCompensating the reference feature F 2 of the second scale to obtain a compensation feature of the second scaleWherein the at least two different scales of optical flow information include the second scale of optical flow informationThe at least two different scale reference features include the second scale reference feature F 2;
    Compensation features according to the second scale And upsampling features of the second scaleThe first compensation characteristic is obtained.
  5. The method of claim 4, wherein the compensating feature according to the second scaleAnd upsampling features of the second scaleObtaining the first compensation feature comprises:
    Compensation features for the second scale And upsampling features of the second scaleFusing to obtain a fused feature of a second scale;
    Upsampling the fused features of the second scale to obtain upsampled features of a third scale
    Upsampling features according to the third scaleThe first compensation characteristic is determined.
  6. The method of claim 5, wherein the upsampled features according to the third scaleDetermining the first compensation characteristic includes:
    Optical flow information according to the third scale Compensating the reference feature F 3 of the third scale to obtain a compensation feature of the third scaleWherein the at least two different scales of optical-flow information include the third scale of optical-flow informationThe at least two different scale reference features include the third scale reference feature F 3;
    Compensation features according to the third scale And upsampling features of the third scaleThe first compensation characteristic is obtained.
  7. The method of claim 5 or 6, wherein the resolution of the third scale is 2 times the resolution of the second scale.
  8. The method of any of claims 1-7, wherein determining optical flow information for at least two different scales from the first decoding feature comprises:
    And up-sampling the first decoding features in different scales to obtain optical flow information in at least two different scales, wherein up-sampling in one scale obtains optical flow information in a corresponding scale.
  9. The method of claim 8, wherein the determining optical flow information for at least two different scales of the first decoding feature from the first decoding feature comprises:
    upsampling the first decoding feature in a first scale to obtain a decoding feature in the first scale;
    Obtaining optical flow information of the first scale according to the decoding characteristics of the first scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the first scale;
    upsampling the first scale decoding feature to obtain a second scale decoding feature;
    And obtaining optical flow information of the second scale according to the decoding characteristics of the second scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the second scale.
  10. The method according to any one of claims 1-9, wherein the feature extraction of the reconstructed image of the second image to obtain at least two reference features of different scales comprises:
    And carrying out multi-scale feature extraction and downsampling on the reconstructed image to obtain at least two reference features of different scales of the reconstructed image, wherein the feature extraction and downsampling of one scale obtain the reference features of the corresponding scale.
  11. The method of claim 3, wherein the deriving the first compensated pixel from the at least two different scales of optical flow information and the reconstructed image of the second image comprises:
    And performing motion compensation on the reconstructed image according to the maximum-scale optical flow information in the at least two different-scale optical flow information to obtain the first compensation pixel.
  12. A method according to claim 3, wherein said deriving said first predicted image from said first compensation feature and said first compensation pixel comprises:
    cascading the channel dimension of the first compensation feature and the first compensation pixel to obtain mixed input of a feature channel and a pixel channel;
    And transforming the mixed input to a pixel domain to obtain the first predicted image.
  13. The method of claim 4 or 9, wherein the resolution of the second scale is 2 times the resolution of the first scale.
  14. The method of any one of claims 1-13, wherein the second image is a previous frame image of the first image.
  15. The method of any of claims 1-14, wherein optical flow information of one scale corresponds to reference features of the same scale.
  16. The method of any of claims 1-15, wherein a resolution corresponding to a largest scale in the at least two scales of optical-flow information is the same as a resolution of the first image.
  17. The method according to any of claims 1-16, wherein a resolution corresponding to a largest scale of the at least two scale reference features is the same as a resolution of the first image.
  18. A method of encoding, comprising:
    Performing feature extraction on the reconstructed images of the first image and the second image to obtain coding features of the first image;
    quantizing the coding feature to obtain the quantized coding feature;
    And coding the quantized coding features to obtain a code stream.
  19. The method of claim 18, wherein the feature extraction of the reconstructed images of the first image and the second image to obtain the encoded features of the first image comprises:
    Carrying out channel dimension cascading on the reconstructed images of the first image and the second image to obtain high-dimensional input;
    And carrying out inter-frame feature coding on the high-dimensional input to obtain coding features of the first image.
  20. A decoder, comprising:
    an acquisition unit for acquiring a code stream;
    the first decoding unit is used for decoding the code stream to obtain a first decoding characteristic of a first image;
    The second decoding unit is used for determining optical flow information of at least two different scales according to the first decoding characteristics;
    the feature extraction unit is used for extracting features of the reconstructed image of the second image to obtain at least two reference features with different scales;
    and the determining unit is used for determining a first predicted image according to the optical flow information of the at least two different scales and the reference characteristics of the at least two different scales.
  21. Decoder according to claim 20, characterized in that the determining unit is specifically adapted to:
    Obtaining a first compensation feature according to the optical flow information of at least two different scales and the reference features of at least two different scales;
    And obtaining the first predicted image according to the first compensation characteristic.
  22. The decoder of claim 21, wherein the determining unit is further configured to:
    Obtaining a first compensation pixel according to the optical flow information of at least two different scales and the reconstructed image of the second image;
    and obtaining the first predicted image according to the first compensation characteristic and the first compensation pixel.
  23. Decoder according to claim 21, characterized in that the determining unit is specifically adapted to:
    Optical flow information according to a first scale Compensating the first-scale reference feature F 1 to obtain the first-scale compensation featureWherein the at least two different scales of optical-flow information include the first scale of optical-flow informationThe at least two different scale reference features include the first scale reference feature F 1;
    compensation features for the first scale Upsampling to obtain upsampled features of the second scale
    Optical flow information according to the second scaleCompensating the reference feature F 2 of the second scale to obtain a compensation feature of the second scaleWherein the at least two different scales of optical flow information include the second scale of optical flow informationThe at least two different scale reference features include the second scale reference feature F 2;
    Compensation features according to the second scale And upsampling features of the second scaleThe first compensation characteristic is obtained.
  24. Decoder according to claim 23, characterized in that the determining unit is specifically adapted to:
    Compensation features for the second scale And upsampling features of the second scaleFusing to obtain a fused feature of a second scale;
    Upsampling the fused features of the second scale to obtain upsampled features of a third scale
    Upsampling features according to the third scaleThe first compensation characteristic is determined.
  25. Decoder according to claim 24, characterized in that the determining unit is specifically adapted to:
    Optical flow information according to the third scale Compensating the reference feature F 3 of the third scale to obtain a compensation feature of the third scaleWherein the at least two different scales of optical-flow information include the third scale of optical-flow informationThe at least two different scale reference features include the third scale reference feature F 3;
    Compensation features according to the third scale And upsampling features of the third scaleThe first compensation characteristic is obtained.
  26. Decoder according to claim 24 or 25, characterized in that the resolution of the third scale is 2 times the resolution of the second scale.
  27. Decoder according to any of claims 20-26, wherein the second decoding unit is specifically configured to:
    And up-sampling the first decoding features in different scales to obtain optical flow information in at least two different scales, wherein up-sampling in one scale obtains optical flow information in a corresponding scale.
  28. Decoder according to claim 27, characterized in that the second decoding unit is specifically configured to:
    upsampling the first decoding feature in a first scale to obtain a decoding feature in the first scale;
    Obtaining optical flow information of the first scale according to the decoding characteristics of the first scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the first scale;
    upsampling the first scale decoding feature to obtain a second scale decoding feature;
    And obtaining optical flow information of the second scale according to the decoding characteristics of the second scale, wherein the optical flow information of the at least two different scales comprises the optical flow information of the second scale.
  29. Decoder according to any of claims 20-28, characterized in that the feature extraction unit is specifically configured to:
    And carrying out multi-scale feature extraction and downsampling on the reconstructed image to obtain at least two reference features of different scales of the reconstructed image, wherein the feature extraction and downsampling of one scale obtain the reference features of the corresponding scale.
  30. Decoder according to claim 22, characterized in that the determining unit is specifically adapted to:
    And performing motion compensation on the reconstructed image according to the maximum-scale optical flow information in the at least two different-scale optical flow information to obtain the first compensation pixel.
  31. Decoder according to claim 22, characterized in that the determining unit is specifically adapted to:
    cascading the channel dimension of the first compensation feature and the first compensation pixel to obtain mixed input of a feature channel and a pixel channel;
    And transforming the mixed input to a pixel domain to obtain the first predicted image.
  32. Decoder according to claim 23 or 28, characterized in that the resolution of the second scale is 2 times the resolution of the first scale.
  33. The decoder according to any of claims 20-32, wherein the second picture is a previous frame picture of the first picture.
  34. The decoder according to any of claims 20-33, wherein the optical flow information of one scale corresponds to a reference feature of the same scale.
  35. The decoder according to any of claims 20-34, wherein the resolution of the largest scale of the optical flow information of the at least two scales corresponds to the same resolution of the first image.
  36. The decoder according to any of claims 20-35, wherein a resolution corresponding to a largest scale of the at least two scale reference features is the same as a resolution of the first image.
  37. An encoder, comprising:
    The feature extraction unit is used for extracting features of the reconstructed images of the first image and the second image to obtain coding features of the first image;
    the quantization unit is used for quantizing the coding features to obtain quantized coding features;
    And the coding unit is used for coding the quantized coding features to obtain a code stream.
  38. The encoder according to claim 37, wherein the feature extraction unit is specifically configured to:
    Carrying out channel dimension cascading on the reconstructed images of the first image and the second image to obtain high-dimensional input;
    And carrying out inter-frame feature coding on the high-dimensional input to obtain coding features of the first image.
  39. A codec system comprising the decoder of any one of claims 20-36 and the encoder of claim 37 or 38.
  40. An electronic device comprising a processor and a memory;
    The memory is for storing a computer program, and the processor is for invoking and running the computer program stored in the memory to cause the electronic device to perform the method of any of claims 1-19.
  41. A computer readable storage medium storing a computer program for causing a computer to perform the method of any one of claims 1-19.
  42. A computer program product comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1-19.
CN202180104061.0A 2021-11-25 2021-11-25 Decoding method, encoding method, decoder, encoder, and encoding/decoding system Pending CN118216149A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/133139 WO2023092388A1 (en) 2021-11-25 2021-11-25 Decoding method, encoding method, decoder, encoder, and encoding and decoding system

Publications (1)

Publication Number Publication Date
CN118216149A true CN118216149A (en) 2024-06-18

Family

ID=86538444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180104061.0A Pending CN118216149A (en) 2021-11-25 2021-11-25 Decoding method, encoding method, decoder, encoder, and encoding/decoding system

Country Status (2)

Country Link
CN (1) CN118216149A (en)
WO (1) WO2023092388A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10701394B1 (en) * 2016-11-10 2020-06-30 Twitter, Inc. Real-time video super-resolution with spatio-temporal networks and motion compensation
WO2019208677A1 (en) * 2018-04-27 2019-10-31 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Coding device, decoding device, coding method, and decoding method
KR102646695B1 (en) * 2019-01-15 2024-03-12 포틀랜드 스테이트 유니버시티 Feature pyramid warping for video frame interpolation
US20220222776A1 (en) * 2019-05-03 2022-07-14 Huawei Technologies Co., Ltd. Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
CN110913218A (en) * 2019-11-29 2020-03-24 合肥图鸭信息科技有限公司 Video frame prediction method and device and terminal equipment
US11405626B2 (en) * 2020-03-03 2022-08-02 Qualcomm Incorporated Video compression using recurrent-based machine learning systems

Also Published As

Publication number Publication date
WO2023092388A1 (en) 2023-06-01

Similar Documents

Publication Publication Date Title
US20230069953A1 (en) Learned downsampling based cnn filter for image and video coding using learned downsampling feature
TWI834087B (en) Method and apparatus for reconstruct image from bitstreams and encoding image into bitstreams, and computer program product
JP5579936B2 (en) Optimized deblocking filter
US10542265B2 (en) Self-adaptive prediction method for multi-layer codec
TWI468018B (en) Video coding using vector quantized deblocking filters
JP7439841B2 (en) In-loop filtering method and in-loop filtering device
WO2023000179A1 (en) Video super-resolution network, and video super-resolution, encoding and decoding processing method and device
WO2022155974A1 (en) Video coding and decoding and model training method and apparatus
US20230076920A1 (en) Global skip connection based convolutional neural network (cnn) filter for image and video coding
WO2023279961A1 (en) Video image encoding method and apparatus, and video image decoding method and apparatus
CN115442618A (en) Time domain-space domain self-adaptive video compression based on neural network
CN113747242B (en) Image processing method, image processing device, electronic equipment and storage medium
US20240242467A1 (en) Video encoding and decoding method, encoder, decoder and storage medium
WO2022269415A1 (en) Method, apparatus and computer program product for providng an attention block for neural network-based image and video compression
CN116349225A (en) Content adaptive online training method and apparatus for deblocking in block-by-block image compression
CN115956363A (en) Content adaptive online training method and device for post filtering
WO2022266955A1 (en) Image decoding method and apparatus, image processing method and apparatus, and device
KR20230108286A (en) Video encoding using preprocessing
KR20230107627A (en) Video decoding using post-processing control
JP2024513693A (en) Configurable position of auxiliary information input to picture data processing neural network
WO2021056220A1 (en) Video coding and decoding method and apparatus
WO2023208638A1 (en) Post processing filters suitable for neural-network-based codecs
CN118216149A (en) Decoding method, encoding method, decoder, encoder, and encoding/decoding system
JP2024511587A (en) Independent placement of auxiliary information in neural network-based picture processing
WO2024140951A1 (en) A neural network based image and video compression method with integer operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination