CN114868390A

CN114868390A - Video encoding method, decoding method, encoder, decoder, and AI accelerator

Info

Publication number: CN114868390A
Application number: CN202080081315.7A
Authority: CN
Inventors: 周焰; 郑萧桢
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2022-08-05
Also published as: WO2022116165A1

Abstract

The present disclosure provides a video encoding method, the method comprising: performing encoding processing on video data, the encoding processing including encoding processing using a neural network model; and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.

Description

Video encoding method, decoding method, encoder, decoder, and AI accelerator

Technical Field

The present disclosure relates to the field of information technology, and in particular, to a video encoding method, a neural network-based video encoding method, a video decoding method, a neural network-based video decoding method, a video encoder, a video decoder, an AI accelerator for video encoding, and an AI accelerator for video decoding.

Background

With the continuous development of computer science technologies, neural network-based technologies are increasingly widely applied to various fields. Since the neural network based technology can generally obtain better effect than the conventional technology, some neural network based coding tools are also introduced in the video coding technology. However, most of the existing video coding technologies using the neural network are implemented based on a specific coding standard, and have high coupling with the specific coding standard, which reduces the application range of the video coding technologies based on the neural network.

Disclosure of Invention

The present disclosure provides a video encoding method, a neural network-based video encoding method, a video decoding method, a neural network-based video decoding method, a video encoder, a video decoder, an AI accelerator for video encoding, and an AI accelerator for video decoding, which can enable a video encoding technique using a neural network to be used independently of an existing encoding and decoding standard or compatibly with an existing encoding standard, and overcome the drawbacks of the related art.

In a first aspect, an embodiment of the present disclosure provides a video encoding method, where the method includes: performing encoding processing on video data, the encoding processing including encoding processing using a neural network model; and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.

In a second aspect, an embodiment of the present disclosure provides a method for encoding a video based on a neural network, the method including: coding the video data by using a neural network model; and sending the parameter set of the neural network model to a video encoder so that the video encoder generates a code stream carrying syntax elements based on the video data after encoding processing, wherein the syntax elements comprise information representing the parameter set of the neural network model.

In a third aspect, an embodiment of the present disclosure provides a video decoding method, where the method includes: analyzing the received video code stream to obtain a syntax element of the video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model; and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.

In a fourth aspect, an embodiment of the present disclosure provides a method for decoding a video based on a neural network, where the method includes: obtaining a syntax element obtained after a video decoder analyzes a video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model; and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.

In a fifth aspect, the disclosed embodiments provide a video encoder, the encoder comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program: performing encoding processing on video data, the encoding processing including encoding processing using a neural network model; and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.

In a sixth aspect, the disclosed embodiments provide a video decoder, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program: analyzing the received video code stream to obtain a syntax element of the video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model; and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.

In a seventh aspect, an embodiment of the present disclosure provides an AI accelerator for video coding, including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program: performing encoding processing on the video data; and sending the parameter set of the neural network to a video encoder so that the video encoder generates a code stream carrying syntax elements based on the video data after encoding processing, wherein the syntax elements contain information representing the parameter set of the neural network.

In an eighth aspect, an embodiment of the present disclosure provides an AI accelerator for video decoding, the AI accelerator including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program: obtaining a syntax element obtained after a video decoder analyzes a video code stream, wherein the syntax element comprises information of a parameter set representing a neural network; and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.

In a ninth aspect, an embodiment of the present disclosure provides a machine-readable storage medium, on which computer instructions are stored, and when executed, the computer instructions implement the method of any one of the first to fourth aspects of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, for the intelligent coding technology based on the neural network applied in the coding process, the coding of the video data based on the neural network is realized by carrying the syntax element containing the information of the parameter set representing the neural network model in the code stream based on the syntax element. Because the code stream carries the syntax element containing the information of the parameter set representing the neural network model, the syntax element can exist in the code stream without depending on the existing video coding standard, thereby independently realizing the video coding technology based on the neural network; the syntax elements may also be located in existing multiple coding standards and thus be compatible with existing video coding standards. By the method, the coupling between the intelligent video coding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video coding technology based on the neural network is expanded.

Drawings

Fig. 1 is a schematic diagram of a conventional hybrid video coding framework according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a conventional hybrid video decoding framework according to an embodiment of the disclosure.

Fig. 3 is a flowchart of a video encoding method according to an embodiment of the disclosure.

Fig. 4 is a schematic diagram of a designated position of a reserved field in a codestream according to an embodiment of the present disclosure.

Fig. 5 is a schematic view of a hierarchical code stream structure of a video coding standard according to an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of a code stream structure generated by a video encoding process according to an embodiment of the disclosure.

Fig. 7 is a schematic diagram of a partial structure of a code stream generated by a video coding process according to an embodiment of the present disclosure.

Fig. 8 is a schematic diagram of a video coding framework incorporating a neural network-based loop filtering technique according to an embodiment of the present disclosure.

Fig. 9 is a schematic diagram of a video coding framework incorporating two neural network-based techniques according to an embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a syntax element located in a reserved field according to an embodiment of the present disclosure.

Fig. 11 is a flowchart of a method for encoding a video based on a neural network according to an embodiment of the present disclosure.

Fig. 12 is a flowchart of a video decoding method according to an embodiment of the disclosure.

Fig. 13 is a flowchart of a method for decoding a video based on a neural network according to an embodiment of the present disclosure.

Fig. 14 is a schematic structural diagram of an encoder according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Due to the large data size of the uncoded video, the uncoded video occupies a large storage space when stored and occupies a large communication resource when transmitted. Therefore, for uncoded video data, video coding techniques are usually adopted to reduce the storage space of the video data and the communication resources used in the transmission process.

The video coding technology is to compress and code video data to be coded by some data processing methods to form a code stream, and store or send the code stream to a decoding end. And the stored code stream or the code stream received by the decoding end is reconstructed by a decoding technology to obtain reconstructed video data.

Through continuous development of conventional video coding technology, a mature video coding framework is gradually formed. Referring to fig. 1, there is a conventional hybrid video coding framework including predictive coding 101, transform coding 102, quantization 103, and entropy coding 104.

The predictive coding 101 is to remove redundant information of the video data to be coded in the time domain and the spatial domain by utilizing the spatial correlation of the intra-frame pixels and the temporal correlation of the inter-frame pixels of the image frame in the video data to be coded.

The currently more common predictive coding methods include intra-frame prediction and inter-frame prediction. The intra-frame prediction is used for removing redundant information of the video data in a spatial domain; inter prediction is used to remove redundant information of video data in the time domain. Prior to predictive coding, an image frame of video data currently to be coded is typically partitioned. The intra-frame prediction method generates a prediction block based on pixels of adjacent blocks around a block to be coded; the inter-frame prediction method obtains a prediction block by searching an image block which is most matched with a current block to be coded in a reference frame image.

And for the prediction block obtained by intra-frame prediction or inter-frame prediction, subtracting the corresponding pixel values of the block to be coded and the corresponding prediction block to obtain a residual error, and combining the obtained residual errors corresponding to the blocks to be coded together to obtain residual error data of the image frame to be coded.

The transform coding 102 transforms the residual data of the image frame to be coded from the spatial domain to the frequency domain to obtain a transform coefficient, so as to remove the correlation of the spatial signal and improve the coding efficiency.

Quantization 103 is a process that reduces the accuracy of the data representation. For the transform coefficient obtained by transform coding 102, the quantized transform coefficient is obtained by quantization, and the amount of data to be coded can be reduced, thereby further improving the compression efficiency.

The entropy coding 104 performs rate compression by using the information entropy of the source. And performing entropy coding on the quantized transform coefficients and prediction mode information (including intra-frame prediction mode, motion vector information, reference frame and other information) generated in the prediction coding process, so as to remove the statistical redundant information still existing after prediction, transformation and quantization and obtain a code stream.

Referring to fig. 2, a decoding framework corresponding to the conventional hybrid video coding framework described above is presented: including entropy decoding 201, inverse quantization 202, inverse transformation 203, and prediction reconstruction 204. Video decoding is the inverse of video encoding and will not be described in detail here.

In recent years, some neural network-based encoding techniques have been introduced in video encoding techniques. Neural network-based coding techniques are often able to achieve superior results relative to conventional techniques. However, most of the existing neural network-based video coding techniques are implemented based on a specific existing coding standard, and have high coupling with the specific coding standard, which reduces the application range of the neural network-based video coding techniques.

Based on this, the present disclosure provides a video encoding method to overcome the drawbacks of the related art. As shown in fig. 3, a flow chart of a video encoding method provided by the present disclosure is given, the method includes:

step 301: the video data is subjected to an encoding process including an encoding process using a neural network model.

Step 302: and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.

The encoding processing of the video data can be realized by using an end-to-end neural network without depending on a traditional mixed video encoding framework. Namely, a user can select a specific deep learning framework, construct a data set by using video data, and train a neural network model, so that the trained neural network model can be utilized to input video data to be coded into the trained neural network model, and then a code stream meeting certain requirements can be generated. The obtained code stream can be decoded, and related parameters such as compression ratio, distortion degree and the like of video data reconstructed based on the code stream can meet the requirements of users.

Of course, it should be understood by those skilled in the art that the encoding process for the video data may also be implemented based on a conventional hybrid video encoding framework.

In some embodiments, the conventional hybrid video coding framework may be a coding framework as shown in fig. 1, in which some of the steps may be replaced by neural network based techniques. For example, replacing predictive coding in a traditional hybrid video framework with neural network-based predictive coding techniques, replacing transform coding in a traditional hybrid video framework with neural network-based transform coding, replacing quantization in a traditional hybrid video framework with neural network-based quantization, and so forth; neural network based techniques can also be incorporated into traditional hybrid coding frameworks, such as: adding neural network based coding techniques after entropy coding, etc., and the disclosure is not limited thereto.

Of course, those skilled in the art should understand that the conventional hybrid video coding framework may also be modified, and some or all of the steps in the modified hybrid video coding framework may be replaced by the neural network-based coding technology, or the neural network-based coding technology may be added to the modified hybrid video coding framework, which is not limited by the present disclosure.

Of course, the video coding framework used by the coding process may be in other forms, and the disclosure is not limited thereto.

After the video data to be coded is coded by using a coding technology containing a neural network model, a code stream subjected to coding processing can be obtained. In order to be able to obtain reconstructed video data based on the code stream, the generated code stream carries syntax elements that contain information characterizing the parameter set of the neural network model. Based on the information of the parameter set representing the neural network model, the coded code stream can be decoded by using the neural network model, and the reconstructed video data can be obtained.

Because the code stream carries the syntax element containing the information of the parameter set representing the neural network model, the syntax element can exist in the code stream without depending on the existing video coding standard, thereby independently realizing the video coding technology based on the neural network; the syntax elements may also be located in existing multiple coding standards and thus be compatible with existing video coding standards. By the method, the coupling between the intelligent video coding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video coding technology based on the neural network is expanded.

In some embodiments, the syntax element containing parameter set information characterizing the neural network model is located at a specified position of the codestream, and the specified position is a reserved field of the codestream.

In fig. 4, a schematic diagram of a code stream 401 having a reserved field after being encoded is given, where an area of the code stream 401 except for the reserved field may be all video data, or may be video data and other information fields, and the disclosure is not limited thereto. In addition, it should be understood by those skilled in the art that the codestream 401 may also have two or more reserved fields, and the syntax element containing parameter set information characterizing the neural network model is carried separately, which is not limited by the present disclosure.

In some embodiments, the location of the reserved field in the codestream may be determined based on a predefined independent video coding standard.

The predefined independent video coding standard can be independently used from other video coding standards, including H.264, H.265, VCC, AVS +, AVS2, AVS3 or AV1, etc. established by organizations such as ITU-T and ISO/IEC. For example, when the video data to be encoded is encoded based on the end-to-end neural network model, the parameter set of the neural network model may be directly placed in the reserved field of the generated bitstream. When the coded code stream is decoded, the specific position of the reserved field of the coded code stream can be determined based on the predefinition, so that the parameter set of the neural network model in the reserved field can be extracted, and the coded code stream is decoded.

The predefined independent video coding standard can also be referred to and used by other video coding standards. For example: the predefined independent video coding standard defines field "010100011110" as the starting position of the reserved field and field "101011100001" as the ending position of the reserved field. Other video coding standards may also refer to this definition to determine the specific location of the reserved field in the code stream undergoing the encoding process. It will be appreciated by those skilled in the art that the above examples are merely illustrative, and that the independent video coding standard may define the specific location of the reserved field in the bitstream in other forms, and thus may be referred to by other video coding standards, and the disclosure is not limited thereto.

In some embodiments, the codestream is generated based on a specified encoding standard, and the reserved field is a specific field of the specified encoding standard.

The specified encoding criteria may include: h.264 or H.265 or VCC or AVS + or AVS2 or AVS3 or AV1, etc.

The specified coding standard may have a certain code stream structure, as shown in fig. 5, a specific code stream structure is given, and the code stream structure is a hierarchical code stream structure including an image group layer, an image layer, a slice header, a macroblock layer, and a block layer. Each layer of data in turn includes header information and video data information. Thus, the specific field may be header information of the specified encoding standard. Of course, the specified encoding standard may also have other code stream structures, and the specific field may be a field predefined by the specified encoding standard.

For specific coding standards, including Video coding standards such as h.264 or h.265 or VCC or AVS + or AVS2 or AVS3 or AV1, the specific fields may be Video Parameter Sets (VPS), and \ or Sequence Parameter Sets (SPS), and \ or Neural Network model Parameter sets (NNPS), and \ or image Parameter sets (PPS), and \ or Slice headers, and \ or auxiliary Enhancement Information (Supplemental Enhancement Information, SEI), and extension data of \ or syntax element Parameter sets, and \ or user data, and \ or Open bit stream Units (Open bit streams, ts), Sequence headers, image group headers, image headers, macroblock Information, etc.

Of course, those skilled in the art should understand that the given coding standard is given by way of illustration only and is not exhaustive, and the disclosure does not limit the given coding standard and other coding standards. The specific fields are also exemplary, not exhaustive, and may be other specific fields of a given encoding standard, as the present disclosure is not limited thereto.

In some embodiments, the specific position of the reserved field in the code stream after the encoding process may also be determined by: before encoding video data and generating a code stream carrying syntax elements, an encoding end may negotiate with an opposite end device that wants to obtain the code stream in a wired or wireless communication manner, and determine a specific position of the reserved field. The peer device may be a memory, a decoder, etc., and the disclosure is not limited thereto.

Therefore, because the syntax element carried in the code stream can exist in the reserved field of the code stream without depending on the existing video coding standard, the video coding technology based on the neural network can be independently realized; the syntax element may also be located in a specific field of existing multiple coding standards or reference a predefined independent video coding standard, thereby being compatible with existing video coding standards. By the method, the coupling between the intelligent video coding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video coding technology based on the neural network is expanded.

In some embodiments, the reserved field is located in a header of a data packet of the codestream.

When the designated position of the syntax element is determined based on the predefined independent video coding standard, the reserved field may be located in the header of the data packet of the bitstream, and the bitstream structure is as shown in fig. 6. When the reserved field is located in the header of the data packet of the code stream, a specific character or a character group can be used as an end mark of the syntax element, a specified byte length can also be set, the content in the byte length is the syntax element, and when the byte length is insufficient, the syntax element is complemented by 0.

Of course, the reserved field may also be located in the middle of a data packet of the code stream and at the end of the packet. When the reserved field is located in the middle or at the tail of a data packet of the code stream, the start and the end of a certain byte can be represented by a specific character or a character group and the like; it is also possible to directly characterize from a certain number of bytes the number of bytes stored in the packet header, which is a reserved field for storing syntax elements containing information characterizing the parameter set of the neural network model. Of course, other approaches are possible, and the disclosure is not limited thereto.

For a specified coding standard, the reserved field is located in a header of a data packet of the code stream, and the header of the data packet may refer to a header information syntax parameter set of the specified coding standard. For partially specified coding standards, e.g., for h.26x system standards (including h.264, h.265, VCC, etc.), the reserved field is typically located in the extension data of VPS, SPS, NNPS, PPS, Slice Header, and may also be located in SEI; for a portion of the specified encoding standards, e.g., for AVS series of standards (including VCC, AVS +, AVS2, AVS3, etc.), the reserved field may be located in the extension and user data syntax; for a portion of the specified encoding standard, the reserved field may be located in the OBU or in the extension data for the AV1 standard. Of course, those skilled in the art should understand that the above examples are only illustrative and not exhaustive, and the reserved field may also be located in other header positions of the specified encoding standard, and the disclosure is not limited thereto.

As can be seen from the above embodiments, because syntax elements carried in a code stream may exist in a packet header of a reserved field of the code stream without depending on the existing video coding standard, a video coding technology based on a neural network can be independently implemented; the syntax element may also be located in a header syntax parameter set of an existing multiple coding standard, thereby being compatible with an existing video coding standard. By the method, the coupling between the intelligent video coding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video coding technology based on the neural network is expanded.

The neural network model that can be used for the encoding process is generally determined by a parameter set composed of a plurality of parameters. In some embodiments, the set of parameters of the neural network model includes at least one or more of input parameters, number of layers, weight parameters, hyper-parameters, number of nodes per layer, and activation functions.

When the neural network models utilized by the encoding process are different, the set of parameters of the neural network models contain different content. The number of parameters of the parameter set of the neural network model is not particularly limited in the present disclosure. Furthermore, it should be understood by those skilled in the art that the above parameters, such as the number of layers, the weighting parameter, the hyper-parameter, the number of nodes per layer, and the activation function, are merely exemplary and not exhaustive, and the parameter set of the neural network model may further include other parameters for determining the neural network model, and the disclosure is not limited thereto.

With the continuous development of deep learning technology, a variety of Neural Network common formats, such as Neural Network Exchange Format (NNEF) and Open Neural Network Exchange (ONNX), etc., are emerging for mapping the Neural Network framework onto the inference engine. The universal formats provide interfaces for conversion of the neural network model generated by the common deep learning framework, so as to realize interaction and universality of the neural network model among different deep learning frameworks.

In order to enable the neural network model utilized by the encoding process to be deployed and applied under different platforms and frameworks, in some embodiments, the video encoding method includes converting the neural network model into a common format.

In some embodiments, the common format may be NNFF or ONNX. Of course, the common format may also be other common formats to implement that the neural network model is common between different deep learning frameworks, which is not limited by the present disclosure.

In some embodiments, the information characterizing the parameter set of the neural network model included in the syntax element may be corresponding information obtained by converting the parameter set of the neural network model into a general format, that is, based on the corresponding information, the parameter set of the neural network model applied in the encoding process under different deep learning frameworks can be obtained by converting again.

In some embodiments, the syntax element carried by the code stream generated through the encoding process further includes a format conversion enabling identifier, which is used to indicate whether the neural network model is converted into a general format during the encoding process. The format conversion enable flag may be represented by "1" indicating that format conversion is used, and "0" indicating that format conversion is not used. Of course, one skilled in the art will appreciate that other representations may also be used to indicate whether the neural network model is converted to a common format during the encoding process.

With the continuous development of deep learning technology, there are some compression technologies for compressing parameters of a neural network model for the neural network model at present. For example, Neural Network Representation (NNR) is a compression framework that compresses and represents Neural Network models in a manner similar to video compression coding, and includes a variety of compression techniques, such as Deflate compression techniques. In addition to NNRs, there are some neural network compression frameworks, such as that of AITISA, among others.

In order to reduce the complexity of the information characterizing the parameter sets of the neural network model, and save the overhead of storage resources and bandwidth resources for transmission, in some embodiments, the information characterizing the parameter sets of the neural network model may be compressed.

In some embodiments, the compression is a compression process based on compression techniques in an NNR-based compression framework or an AITISA compression framework. Of course, those skilled in the art will appreciate that the compression may also be achieved by other compression techniques for neural network models, and the disclosure is not limited thereto.

In some embodiments, the information characterizing the parameter set of the neural network model is corresponding information obtained by compressing the parameter set of the neural network model, so as to save storage resources and reduce bandwidth resources occupied by video data transmission.

In some embodiments, the syntax element further comprises a compression enable flag indicating whether to compress the set of parameters of the neural network model during the encoding process. The compression-enabling flag may be represented by "1" for the compression technique used and "0" for the compression technique not used. Of course, those skilled in the art will appreciate that other representations may be used to indicate whether the set of parameters of the neural network model are compressed during the encoding process.

In some embodiments, in the encoding process, the neural network model may be converted into a general format, and then the parameter set of the neural network model in the general format is compressed, so as to implement interaction and universality of the neural network model between learning frames of different depths, save storage resources, and reduce bandwidth resources occupied by video data transmission.

In some embodiments, the information characterizing the parameter set of the neural network model is information corresponding to the neural network model after the neural network model is converted into a general format and the parameter set of the converted neural network model is compressed.

In some embodiments, the syntax element further includes a format conversion enable flag and a compression enable flag.

As shown in fig. 7, a schematic diagram of a partial code stream 401a carrying syntax elements obtained by applying a neural network model format conversion technique and a parameter set compression technique to encode video data to be encoded by using a neural network model is given, where the syntax elements include three sub-syntax elements, which are respectively a format conversion enable identifier, a compression enable identifier, and information representing a parameter set of the neural network model. The format conversion enabling identification indicates that the utilized neural network model is converted into a general format in the encoding process of the code stream; the compression enabling identification indicates that the parameter set of the neural network model converted into the common format is compressed; the information of the parameter set for representing the neural network model is the parameter set after the neural network model is converted and compressed in the encoding process. The three sub-syntax elements may be located in the same reserved field or in multiple reserved fields, for example, the three sub-syntax elements are located in three reserved fields respectively. The specific position of the reserved field may be, as described above, located at a code stream position defined by a predefined independent coding standard, or located at a specific field of another standard, and so on, which is not described herein again. The corresponding relationship between the reserved field and the sub-syntax element may be determined based on a predefined independent coding standard, may also be additionally defined and confirmed in other standards, may also be determined by referring to a predefined independent coding standard by other standards, and the like, which is not limited by the present disclosure.

In some embodiments, the syntax element further includes an enable identification of an encoding process using a neural network model for determining whether the encoding process uses the neural network model.

The following is described with reference to a specific embodiment:

referring to fig. 8, a video coding framework incorporating neural network-based loop filtering techniques. When the encoding process is performed based on the video encoding framework, the procedure is as follows: partitioning an image frame of video data to be processed, and then performing predictive coding 101, wherein the predictive coding 101 comprises intra-frame prediction and inter-frame prediction; for the data after the predictive coding process, continuing to perform transform coding 102 and quantization 103 to obtain quantized transform coefficients; meanwhile, inverse quantization 202 and inverse transform coding 203 are carried out on the quantized data, prediction reconstruction is carried out on the basis of the data obtained after prediction coding processing, and then in-loop filtering 801 is carried out to improve the image quality of the reconstructed video data; finally, entropy coding 104 is performed based on the quantized transform coefficients, the prediction mode information, the relevant information of the prediction reconstruction process, and the like, and a code stream after coding processing is obtained.

In encoding processing based on a conventional hybrid video encoding framework, the in-Loop filtering 801 generally includes conventional in-Loop filtering techniques such as Deblocking Filtering (DF) technique, Sample Adaptive Offset (SAO) technique, and Adaptive Loop Filter (ALF) technique.

The video coding framework shown in fig. 8, which incorporates Neural Network-based in-loop filtering techniques, adds Neural Network-based loop filtering (NNF) techniques after deblocking filtering in-loop filtering 801, as compared to conventional hybrid video coding frameworks. The reason is that the traditional filtering technology has limited image quality improvement capability on reconstructed data, and the filtering technology based on the neural network processes the data by applying potential 'rules' in the data 'learned' in the training process through a neural network model, and the processing effect is generally better than that of the traditional filtering technology. Using NNF techniques in combination with conventional filtering techniques tends to result in better quality reconstructed data.

Of course, those skilled in the art will appreciate that the location of the NNF in the in-loop filter 801 in fig. 8 is merely exemplary, that the NNF may be located elsewhere in the in-loop filter 801, such as before DF, after ALF, between ALF and SAO of the in-loop filter 801, and so on, and that multiple NNFs may be included in the in-loop filter 801, and the disclosure is not limited thereto.

When the NNF technique is added to the conventional video coding framework, the NNF technique can be used by default in the coding process. Of course, when the processing resources of the processor are insufficient or the quality requirement for the reconstructed video data is low during the encoding process, the NNF technique may not be used during the encoding process. Therefore, the syntax element carried in the code stream obtained through the encoding process may further include an enable flag for performing the encoding process using the neural network model, and the enable flag is used to determine whether the encoding process uses the neural network model. Taking the encoding framework shown in fig. 8 as an example, for the code stream generated after entropy encoding 104, the syntax element carried by the code stream contains, in addition to the information characterizing the parameter set of the neural network model, an enable flag for the encoding process using the neural network model, for example, a "1" indicates that the neural network model is used in the encoding process, and a "0" indicates that the neural network model is not used in the encoding process. Of course, other representations are possible to indicate whether neural network based techniques are used in the encoding process.

Therefore, when the syntax element further includes an enable identifier for performing encoding processing by using the neural network model, the opposite-end device receiving the generated code stream can quickly determine whether the encoding processing uses the neural network model according to the enable identifier, thereby accelerating the decoding speed of the generated code stream.

It should be understood by those skilled in the art that, when the encoding processing is performed on the video data, an end-to-end neural network technology is used for implementation, and a traditional hybrid video encoding framework is not relied on, syntax elements carried by the code stream may not include an enable identifier for performing encoding processing by using a neural network model, and an opposite-end device for obtaining the code stream after the encoding processing defaults that the code stream is generated by a technology based on a neural network.

In the case where the encoding process is implemented based on an improved conventional video coding framework, the improved conventional video coding framework may include at least one of the steps of prediction coding, transform coding, quantization, entropy coding, prediction reconstruction, in-loop filtering, and the like, and may also include other encoding process steps. Thus, the neural network model utilized by the encoding process may be located within a particular step, between two particular steps, and so on.

In some embodiments, the syntax element carried by the code stream further includes processing timing information of the neural network model in the encoding process, and the processing timing information is used to indicate a specific position of the neural network model in the encoding process.

For example, when the encoding processing step is implemented based on a modified conventional video encoding framework, which includes A, B, C and D four steps, the video data to be encoded is processed sequentially through A, B, C, D four steps to obtain a code stream. The syntax element carried by the code stream may further include processing timing information of the neural network model in the encoding process. For example, when the encoding process includes using an in-loop filtering technique based on a neural network between steps a and B, the syntax element carried by the code stream may include a processing timing information indicating: the in-loop filtering technology based on the neural network is between A and B steps of an encoding process. The processing timing information may be in the form of a specific character, for example, indicating that a neural network based technique is located before step a of the encoding process in "0001"; "0010" indicates that a neural network based technique is located in step A of the encoding process; "0011" indicates that a neural network-based technique is located between step a and step B of the encoding process; and so on. Of course, it should be understood by those skilled in the art that the processing timing information may also be represented in other forms, and the disclosure is not limited thereto.

For another example, when the encoding processing step is implemented based on a conventional video encoding framework, the processing timing information at least includes any one of the following information: information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding; and so on.

The specific form of the processing timing information may be a specific character, for example: a neural network based technique is indicated with "000001" for use in predictive coding, a neural network based technique is indicated with "000010" for use in transform coding, a neural network based technique is indicated with "001000" for use between predictive coding and transform coding, and so on. Of course, it should be understood by those skilled in the art that the specific form of the processing timing information may be other representations, and the disclosure is not limited thereto.

Furthermore, it should be understood by those skilled in the art that, for the encoding process using the neural network model based on the conventional video encoding framework, the above-mentioned process timing information is only an exemplary illustration and is not exhaustive, and the process timing information may further include other contents to indicate the specific position of the neural network model in the encoding process using the neural network model based on the conventional video encoding framework.

Therefore, when the syntax element further includes the processing timing information, the peer device receiving the generated code stream can determine which specific position in the encoding processing process is used by the neural network based technology according to the timing information, and can further correctly decode the generated code stream based on the information.

In the encoding process based on the conventional video encoding framework, as described above, an in-loop filtering technique based on a neural network may be added to the in-loop filtering 801 shown in fig. 8 to improve the filtering effect of the reconstructed data; accordingly, the processing timing information is information indicating after predictive reconstruction.

In addition to utilizing neural network-based in-loop filtering techniques, other neural network-based techniques may also be used. Continuing with the example of FIG. 8:

in the entropy coding 104, a context probability estimation method based on a neural network technology can be adopted to replace the traditional rule-based context probability prediction model, so that the efficiency of entropy coding is improved; accordingly, the processing timing information is information indicated in entropy encoding.

In the inter-frame prediction stage of the prediction coding 101, a neural network-based image super-resolution technology can be adopted to replace the traditional motion estimation technology so as to improve the motion estimation performance and further improve the inter-frame prediction efficiency; accordingly, the processing timing information is information indicating an inter prediction stage in prediction encoding.

In the intra-frame prediction stage of the predictive coding 101, an intra-frame prediction technology based on a neural network can be adopted, and compared with the traditional intra-frame prediction technology, the optimal intra-frame prediction method is decided for intra-frame prediction; accordingly, the processing timing information is information at an intra prediction stage of prediction encoding.

In the prediction reconstruction, a super-resolution reconstruction technology based on a neural network can be added to obtain reconstructed video data with higher quality; the processing timing information is information indicating after entropy encoding.

It should be understood by those skilled in the art that the specific neural network model that may be utilized in the above encoding process is merely an exemplary illustration and is not an enumeration, and the encoding process of the video data to be encoded may also include other neural network-based technologies, and accordingly, the processing timing information may also be processing timing information of other contents, and the disclosure is not limited thereto.

In the encoding process using the neural network model based on the conventional video encoding framework, a plurality of neural network models may be included, and the plurality of neural network models may be a plurality of specific neural network models described above, or may be other neural network models.

In the case that the encoding process uses a plurality of neural network models to encode the video data to be encoded, the number of the enable flags included in the syntax element and used for encoding using the neural network models may be multiple, and the enable flags are used to respectively indicate whether the corresponding neural network models are used for encoding the video data. Of course, the enabling identifier included in the syntax element and encoded by using the neural network model may also have a total enabling identifier and a certain number of enabling identifiers corresponding to each neural network model one to one, which is not limited in this disclosure.

In the case that the encoding process uses a plurality of neural network models to encode the video data to be encoded, the syntax elements included in the code stream may further include identification information of the neural network models. Through the identification information, a specific neural network model can be uniquely determined.

In some embodiments, the identification information may be an index value. For example, the neural network model used by the neural network-based in-loop filtering technique is indicated with an index value of "00000001"; a neural network model used by the neural network-based intra prediction technique is indicated with an index value of "00000010"; a neural network model used by the neural network-based image super-resolution technique is indicated with an index value of "00000011"; and so on.

Of course, the identification information may also be in other forms, and is used to uniquely determine the specific neural network model corresponding to the neural network-based technology, so that the peer device that acquires the code stream can uniquely determine which specific neural network model is used in the encoding process based on the identification information.

In the encoding process using the neural network model, when the encoding process does not convert the neural network model into a general format, the syntax elements carried by the code stream may further include frame information of deep learning of the neural network model used for the encoding process, indicating a frame used by the neural network model.

At present, there are many deep learning frameworks used by neural network models, including TensorFlow, Pytrch, Caffe2, Microsoft Cognitive Toolkit, Apache MXNet, AI hardware accelerator, etc. Each deep learning framework has its own unique advantages and disadvantages, such as: TensorFlow is an open source code software library, almost can satisfy the functions of all machine learning developments, but the function code is too bottom, and the learning cost is high; PyTorch is a fast and convenient experimental deep learning framework, but is highly packaged and inflexible; and so on. Therefore, the deep learning framework of the neural network applied in the encoding process may be a specific deep learning framework due to the application scenario, processor resources, and other factors. And the opposite-end equipment of the code stream may not adopt the same deep learning frame as that in the encoding processing process for some reasons, or a plurality of selectable deep learning frames exist. Therefore, the syntax element carried by the code stream generated by the encoding process may further include deep learning frame information of a neural network model used in the encoding process, so as to facilitate acquisition of the opposite-end device of the code stream, and determine whether decoding is possible according to the frame information. Further, if decoding is impossible due to the absence of the corresponding frame, decoding failure information including a reason for the decoding failure may be generated. The deep learning frame information may indicate a specific deep learning frame by using characters, or may be in other representation forms, and the disclosure is not limited thereto.

In some embodiments, the encoding process further comprises: determining a neural network framework and a video encoder; and training based on the neural network and a video encoder to obtain the neural network model.

In step 301, the neural network model for encoding process is a trained neural network model. To determine the neural network model, a deep learning framework and a video encoder used by the neural network model may be pre-selected and sample data constructed. And training based on the selected deep learning frame and the video encoder by using the constructed sample data, and when a preset condition is met, obtaining the trained neural network model for coding the video data to be coded.

The deep learning framework may be TensorFlow, Pytrch, Caffe2, Microsoft Cognitive Toolkit, Apache MXNet, or the like, and may also be an AI hardware accelerator. Of course, other types of deep learning frameworks are also possible. The video encoder can be a video encoding standard reference software platform such as VTM, HM, JM and the like. The preset condition may be that the loss function is minimum or convergence, etc. It should be understood by those skilled in the art that the foregoing examples are merely illustrative, and the present disclosure is not limited to specific types of deep learning frameworks, video encoders, and preset conditions.

In the following, taking an example of two neural network-based technologies included in the encoding process based on the conventional hybrid video encoding framework, the video encoding method described in the present disclosure is described with reference to fig. 9. The two technologies based on the neural network are the in-loop filtering technology based on the neural network in the encoding process shown in fig. 9 and the super-resolution reconstruction technology based on the neural network after data reconstruction. The encoding process includes predictive encoding 101, transform encoding 102, quantization 103, and predictive reconstruction shown in fig. 8. The neural network-based in-loop filtering technique replaces the conventional in-loop filtering technique in hybrid video coding techniques. The neural network-based super-resolution reconstruction technique is used to achieve up-sampling to obtain reconstructed video data with higher image quality.

Before the video data to be coded is coded, firstly, down-sampling is carried out to obtain sampled video data. The sampled video data is then encoded based on an encoding process. After prediction reconstruction is carried out, data are filtered based on an in-loop filtering technology of a neural network, then the filtered data are subjected to up-sampling by adopting a super-resolution reconstruction technology based on the neural network, and reconstructed video data are obtained. After the video data to be coded is coded, the generated code stream carries syntax elements.

The syntax element may include, as shown in fig. 10, a plurality of syntax sub-elements, such as identification information of the neural network model, an enable identifier for performing encoding processing using the neural network model, a format conversion enable identifier, a compression enable identifier, processing timing information, and information obtained by format converting and compressing a parameter set of the neural network. And each grammar sub-element can contain two sets of information, and the related information of the neural network-based in-loop filtering technology and the neural network-based super-resolution reconstruction technology is respectively indicated according to the sequence times of the information. For example, in the "identification information of the neural network model", the first identification information is "00000001", which indicates that the corresponding neural network model is the neural network model used by the neural network-based in-loop filtering technique; the second identification information is "00000011", which indicates that the corresponding neural network model is a neural network model used by the neural network-based image super-resolution technique. The indication correspondence between the information in the other sub-syntax elements and the neural network model is the same as the "identification information of the neural network model", and is not described herein again. The sub-syntax elements may be located in the same reserved field of the code stream, or may be located in multiple reserved fields of the code stream. The reserved field is as described above and will not be described herein.

The syntax element shown in fig. 10 can also be used as a syntax parameter unit, and each syntax parameter unit is used for storing the following information of a neural network model: identification information of the neural network model, an enabling identification for coding processing by using the neural network model, a format conversion enabling identification, a compression enabling identification, processing time sequence information, and information obtained by format converting and compressing a parameter set of the neural network. For techniques involving two neural networks, two syntax parameter units are included in the codestream. The two syntax parameter units can be located in the same reserved field or in different reserved fields according to the sequence. The reserved field is as described above, and is not described in detail here.

It will be appreciated by those skilled in the art that the above examples are merely illustrative, and that the syntax element and the reserved field may further include the related contents described above.

Through the embodiment, the syntax element carried in the code stream can exist in the code stream without depending on the existing video coding standard, so that the video coding technology based on the neural network can be independently realized; the syntax elements may also be located in existing multiple coding standards and thus be compatible with existing video coding standards. By the method, the coupling between the intelligent video coding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video coding technology based on the neural network is expanded.

The syntax element may include, in addition to information characterizing a parameter set of the neural network model, an enable flag for performing encoding processing using the neural network model to indicate whether the encoding processing uses a neural network-based technique, so that an opposite-end device that acquires the code stream performs fast decoding; the method can also comprise a format conversion enabling identifier to indicate whether the neural network model carries out format conversion or not in the encoding processing process so as to realize the universality in different deep learning frameworks; the method can also comprise a compression enabling identifier to indicate whether parameters of the neural network model are compressed in the encoding process so as to save storage space and communication resources; processing timing information may also be included to indicate a specific location in the encoding process where the neural network based technique is applied; identification information of the neural network model can be further included, so that when a plurality of neural network models are utilized in the encoding process, a single neural network model is uniquely indicated; and so on.

Those skilled in the art will appreciate that the above examples of the syntax elements are merely illustrative and not exhaustive. Based on the syntactic element, the relevant information of the neural network model in the encoding process can be represented, so that the opposite terminal equipment receiving the code stream can accurately and quickly decode the code stream, and a high-quality decoding effect is obtained.

In addition, as shown in fig. 11, the present disclosure also provides a neural network-based video encoding method, including:

step 1101: coding the video data by using a neural network model;

step 1102: and sending the parameter set of the neural network model to a video encoder so that the video encoder generates a code stream carrying syntax elements based on the video data after encoding processing, wherein the syntax elements comprise information representing the parameter set of the neural network model.

In step 1101, the neural network model used for encoding the video data may be determined by a parameter set composed of a plurality of parameters. In some embodiments, the parameter set of the neural network model includes one or more of input parameters, number of layers, weight parameters, hyper-parameters, number of nodes per layer, and activation functions.

When the neural network models utilized by the encoding process are different, the parameter sets of the neural network models contain different contents. The number of parameters of the parameter set of the neural network model is not particularly limited in the present disclosure. Furthermore, it should be understood by those skilled in the art that the above parameters, such as the number of layers, the weighting parameter, the hyper-parameter, the number of nodes per layer, and the activation function, are merely exemplary and not exhaustive, and the parameter set of the neural network model may further include other parameters for determining the neural network model, and the disclosure is not limited thereto.

In some embodiments, the encoding process of the video data using the neural network model in step 1101 can be implemented using an end-to-end neural network without relying on a conventional hybrid video encoding framework. Namely, a user can select a specific deep learning framework, construct a data set by using video data, and train a neural network model, so that the trained neural network model can be utilized to input video data to be coded into the trained neural network model, and then a code stream meeting certain requirements can be generated. The obtained code stream can be decoded, and related parameters such as compression ratio, distortion degree and the like of video data reconstructed based on the code stream can meet the requirements of users.

In some embodiments, the encoding process of the video data using the neural network model in step 1101 may be implemented in conjunction with a conventional hybrid encoding framework as shown in fig. 1. The encoding processing of the video data by using the neural network model comprises at least one of the following steps:

performing a neural network-based intra-prediction technique during an intra-prediction phase of predictive coding;

in the inter-frame prediction stage of prediction coding, an image super-resolution technology based on a neural network is executed for carrying out inter-frame motion estimation;

after predictive reconstruction, performing a neural network-based in-loop filtering technique;

after entropy encoding, performing a neural network-based image super-resolution technique for obtaining a reconstructed image;

in entropy coding, a neural network based context probability estimation technique is performed.

It should be understood by those skilled in the art that the specific neural network model that may be utilized in the above-described encoding process is merely an exemplary illustration and is not an enumeration, and the encoding process of the video data to be encoded may also include other neural network-based technologies, and accordingly, the specific location of the neural network model in the conventional encoding framework may also be adaptively determined according to the actual situation, and the present disclosure is not limited thereto.

Of course, the encoding process performed on the video data by using the neural network model in step 1101 may also be implemented by combining with other improved hybrid encoding frameworks, and accordingly, in the encoding process, the specific neural network model and the specific position in the hybrid encoding framework that are adopted may be adaptively determined according to the actual situation, which is not limited in this disclosure.

The related content of the syntax element carried by the code stream in the method may be the same as the related content of the syntax element in the video coding method described above.

In some embodiments, the syntax element carried by the generated code stream may be located at a specified position of the code stream, where the specified position is a reserved field of the code stream.

In some embodiments, the reserved field in which the codestream is located may be located in a header of a data packet of the codestream.

In some embodiments, the syntax element includes information characterizing the parameter set of the neural network model, which corresponds to the parameter set of the neural network model after being converted into a common format.

In some embodiments, the syntax element further includes a format conversion enable flag for indicating conversion of the neural network model to the common format.

In some embodiments, the information included in the syntax element and characterizing the parameter set of the neural network model is corresponding information obtained by compressing the parameter set of the neural network model.

In some embodiments, the syntax element further includes a compression enable flag indicating compression of the parameter set of the neural network model.

In some embodiments, the information included in the syntax element and characterizing the parameter set of the neural network model is corresponding information obtained by converting the parameter set of the neural network model into a general format and compressing the parameter set.

In some embodiments, the syntax element further includes an enable identification of an encoding process using a neural network model for determining whether the encoding process uses the neural network.

In some embodiments, the syntax element further includes processing timing information of the neural network model in the encoding process, the processing timing information indicating a specific position of the neural network model during the encoding process.

In some embodiments, when the encoding process of the video data using the neural network model is based on a conventional hybrid video coding framework or a modified hybrid video coding framework, the processing timing information included in the syntax element at least includes any one of the following information: information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding.

In some embodiments, the syntax element further includes identification information of the neural network model.

In some embodiments, the syntax element further includes framework information of the neural network model, the framework information indicating a framework used by the neural network model.

In some embodiments, the bitstream is generated based on a specified coding standard, the reserved field where the syntax element is located, the video parameter set and/or the sequence parameter set and/or the neural network model parameter set and/or the image parameter set and/or the slice header and/or the auxiliary enhancement information and/or the extension data of the syntax element parameter set and/or the user data and/or the open bitstream unit and/or the sequence header and/or the image group header and/or the slice header and/or the macro block information.

For specific coding standards, including Video coding standards such as h.264 or h.265 or VCC or AVS + or AVS2 or AVS3 or AV1, the reserved fields may be specific fields of these Video coding standards, may be Video Parameter Sets (VPS), and \ or Sequence Parameter Sets (SPS), and \ or Neural Network model Parameter sets (NNPS), and \ or image Parameter sets (Picture Parameter sets, PPS), and \ or Slice headers (Slice headers), and \ or auxiliary Enhancement Information (complementary Enhancement Information, SEI), and \ or extension data of syntax element Parameter sets, and \ or user data, and \ or Open Bitstream Units (bits Units, obuts), Sequence headers, image group headers, macroblock Header Information, and the like.

Of course, those skilled in the art should understand that the given coding standard is given by way of illustration only and is not exhaustive, and the disclosure is not limited to the given coding standard and other coding standards. The specific fields are also exemplary, not exhaustive, and may be other specific fields of a given encoding standard, as the present disclosure is not limited thereto.

The details of the above embodiments have been described in detail in the foregoing video encoding method, and are not repeated herein.

In order to make the neural network model utilized by the encoding process deployable and applicable under different platforms and frameworks, in some embodiments, the method for encoding video data by utilizing the neural network model further includes: converting the neural network model to a generic format.

By converting the neural network model to a generic format, the neural network framework can be mapped onto an inference engine. The universal formats provide interfaces for conversion of the neural network model generated by the common deep learning framework, so as to realize interaction and universality of the neural network model among different deep learning frameworks.

In some embodiments, the common format comprises NNFF or ONNX. Of course, the general format may also be other general formats to implement the general use of the neural network model between different deep learning frameworks, which is not limited by the present disclosure.

In order to reduce complexity of information characterizing the parameter set of the neural network model, and save overhead of storage resources and bandwidth resources for transmission, in some embodiments, the neural network-based video encoding method further includes: compressing the information characterizing the set of parameters of the neural network.

In some embodiments, the neural network model is determined by: determining a neural network framework and a video encoder; and training based on the neural network and a video encoder to obtain the neural network model.

In step 301, the neural network model for encoding process is a trained neural network model. To determine the neural network model, a deep learning framework and a video encoder used by the neural network model may be selected in advance, and sample data may be constructed. And training based on the selected deep learning frame and the video encoder by using the constructed sample data, and when a preset condition is met, obtaining the trained neural network model for coding the video data to be coded.

In some embodiments, the deep learning framework of the neural network model may be TensorFlow, Pytrch, Caffe2, Microsoft Cognitive Toolkit, Apache MXNet, etc., and may also be an AI hardware accelerator. Of course, other types of deep learning frameworks are also possible. The video encoder can be a video encoding standard reference software platform such as VTM, HM, JM and the like. The preset condition may be that the loss function is minimum or convergence, etc. It should be understood by those skilled in the art that the foregoing examples are merely illustrative, and the present disclosure is not limited to specific types of deep learning frameworks, video encoders, and preset conditions.

As can be seen from the above embodiments, since the code stream carries syntax elements including information characterizing the parameter set of the neural network model, the syntax elements may exist in the code stream without depending on the existing video coding standard, thereby enabling the video coding technology based on the neural network to be implemented independently; the syntax elements may also be located in existing multiple coding standards and thus be compatible with existing video coding standards. By the method, the coupling between the intelligent video coding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video coding technology based on the neural network is expanded.

Corresponding to a video encoding method provided by the present disclosure, as shown in fig. 12, the present disclosure also provides a video decoding method, including:

step 1201: analyzing the received video code stream to obtain a syntax element of the video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model;

step 1202: and decoding the video code stream by using a neural network model corresponding to the information of the characterization parameter set according to the syntax element.

The decoding process of the video data, corresponding to the corresponding encoding process, can be implemented using an end-to-end neural network without relying on a conventional hybrid video decoding framework.

Of course, it should be understood by those skilled in the art that the decoding process performed on the video data may also be implemented based on a conventional hybrid video decoding framework. In some embodiments, the conventional hybrid video decoding framework may be a decoding framework as shown in fig. 2.

Of course, those skilled in the art will appreciate that the conventional hybrid video decoding framework may be modified into other forms as well, and the present disclosure does not limit the video decoding framework for decoding.

Corresponding to the syntax element carried by the code stream obtained by the corresponding encoding process, the related content of the syntax element is the same as the related content of the syntax element in the video decoding method described above.

In some embodiments, the syntax element may be located at a specified position of the codestream, where the specified position is a reserved field of the codestream.

In some embodiments, the reserved field in which the syntax element is located may be located in a header of a data packet of the codestream.

In some embodiments, the set of parameters of the neural network model includes one or more of input parameters, number of layers, weight parameters, hyper-parameters, number of nodes per layer, and activation functions.

It should be understood by those skilled in the art that the above parameters, such as the number of layers, the weight parameter, the hyper-parameter, the number of nodes per layer, and the activation function, are merely illustrative and not exhaustive, and the parameter set of the neural network model may further include other parameters for determining the neural network model, and the disclosure is not limited thereto.

In some embodiments, the syntax element includes information characterizing the parameter set of the neural network model, which corresponds to the parameter set of the neural network model after being converted into a common format. When decoding, the decoding end may convert the parameter set of the neural network model in the format into the neural network model that can be used by the decoding end by default for decoding.

In some embodiments, the syntax element further includes a format conversion enable flag for indicating conversion of the neural network model to the common format. Therefore, the decoding end can determine whether to convert the parameter set information of the neural network model in the decoding process based on the format conversion enabling identification.

In some embodiments, the syntax element may further include information characterizing a framework of the neural network model.

In some embodiments, the framework of the neural network model may include: TensorFlow or Pytrch or Caffe2 or AI hardware accelerators. Of course, other types of deep learning frameworks are also possible, and the present application is not limited in this respect.

In some embodiments, the information included in the syntax element and characterizing the parameter set of the neural network model is corresponding information obtained by compressing the parameter set of the neural network model. When the decoding end decodes, the parameter set of the neural network model can be decompressed by default for decoding.

In some embodiments, the syntax element further includes a compression enable flag indicating compression of the parameter set of the neural network model. Therefore, the decoding end can determine whether to correspondingly decompress the parameter set information of the neural network model in the decoding process based on the compression enabling identification.

In some embodiments, the information included in the syntax element and characterizing the parameter set of the neural network model is corresponding information obtained by converting the parameter set of the neural network model into a general format and compressing the parameter set. When the decoding end decodes, the parameter set of the neural network model can be converted and decompressed by default for decoding.

In some embodiments, the syntax element further includes an enable identification of an encoding process using a neural network model for determining whether the encoding process uses the neural network. Therefore, when decoding, the decoding end can confirm whether the decoding process uses the corresponding neural network based on the enabling identification.

In some embodiments, the syntax element further includes processing timing information of the neural network model in the encoding process, the processing timing information indicating a specific position of the neural network model during the encoding process. Therefore, when decoding, the decoding end can determine which specific positions in the decoding process to perform corresponding decoding by using a neural network based technology based on the processing time sequence information.

In some embodiments, when the encoding process of the video data using the neural network model is based on a conventional hybrid video coding framework or a modified hybrid video coding framework, the processing timing information included in the syntax element at least includes any one of the following information: information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding. Therefore, the decoding end can determine specific positions in the decoding process to perform corresponding decoding by using a neural network-based technology based on the processing time sequence information.

For specific coding standards, including Video coding standards such as h.264 or h.265 or VCC or AVS + or AVS2 or AVS3 or AV1, the reserved fields may be specific fields of these Video coding standards, and may be Video Parameter Sets (VPS), and \ or Sequence Parameter Sets (SPS), and \ or image Parameter sets (Picture Parameter sets, PPS), and \ or Neural Network model Parameter sets (NNPS), and \ or Slice headers (Slice headers), and \ or auxiliary Enhancement Information (complementary Enhancement Information, SEI), and \ or extension data of syntax element Parameter sets, and \ or user data, and \ or Open Bitstream Units (bits Units, obuts), Sequence headers, image group headers, macroblock Header Information, and so on.

In some embodiments, in order to enable the neural network model utilized by the encoding process to be deployed and applied under different platforms and frameworks, the encoding process performed by the encoding end on the video data further includes: converting the neural network model to a generic format. Accordingly, the video decoding method further comprises: and when the format conversion enabling identification indicates that the neural network model is converted into the general format, converting the neural network model in the general format into the neural network model of the specified frame to be applied to decoding of the code stream.

In some embodiments, the common format comprises NNFF or ONNX. Of course, the common format may also be other common formats to implement that the neural network model is common between different deep learning frameworks, which is not limited by the present disclosure.

In order to reduce the complexity of the information characterizing the parameter set of the neural network model and save the overhead of storage resources and bandwidth resources for transmission, in some embodiments, the encoding end compresses the information characterizing the parameter set of the neural network. Accordingly, the video decoding method further comprises: the compression enabling identification indicates that the parameter set of the neural network model is compressed, and the parameter set of the compressed neural network model is decompressed.

In some embodiments, the decompressing is a decompression process based on a decompression technique corresponding to a compression technique in a compression framework of the NNR or the AITISA. Of course, those skilled in the art will appreciate that the decompression may also be achieved by other decompression techniques for neural network models, and the present disclosure is not limited thereto.

In some embodiments, the encoding end may encode the video data based on a neural network technology. Accordingly, the video decoding method may further include at least one of the following steps:

decoding processing by using a neural network model is an in-loop filtering technology based on a neural network, and the processing time sequence information is information after prediction reconstruction;

decoding processing is performed by using a neural network model to be an intra-frame prediction technology based on a neural network, and the processing time sequence information is information indicating intra-frame prediction in prediction reconstruction;

decoding processing is carried out by utilizing a neural network model, the decoding processing is an image super-resolution technology based on a neural network and is used for carrying out inter-frame motion estimation, and the processing time sequence information is information indicating an inter-frame prediction stage of prediction reconstruction;

decoding processing is performed by using a neural network model, and the decoding processing is an image super-resolution technology based on a neural network and is used for acquiring a reconstructed image, and the processing time sequence information is information indicating after prediction reconstruction;

decoding processing by using a neural network model is a context probability estimation technology based on a neural network, and the processing time sequence information is information indicated in entropy decoding; and so on.

It should be understood by those skilled in the art that the specific neural network model that may be utilized in the decoding process described above is merely an exemplary illustration and is not an enumeration, and the decoding process of the code stream may also include other neural network-based technologies, and accordingly, the process timing information may also be other content process timing information, and the disclosure is not limited thereto.

In some embodiments, the decoding process is implemented based on a video decoder. The decoder includes: memory, a processor, and a computer program stored on the memory and executable on the processor.

Therefore, because the syntax element carried in the code stream can exist in the reserved field of the code stream without depending on the existing video coding standard, the video decoding technology based on the neural network can be independently realized; the syntax element may also be located in a specific field of existing multiple coding standards or reference a predefined independent video coding standard, thereby being compatible with existing video coding standards. By the method, the coupling between the intelligent video decoding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video decoding technology based on the neural network is expanded.

In addition, as shown in fig. 13, the present disclosure also provides a neural network-based video decoding method, including:

step 1301: obtaining a syntax element obtained after a video decoder analyzes a video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model;

step 1302: and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.

In step 1301, the neural network model used for encoding the video data may be determined by a parameter set composed of a plurality of parameters. In some embodiments, the set of parameters of the neural network model includes one or more of input parameters, number of layers, weight parameters, hyper-parameters, number of nodes per layer, and activation functions.

In some embodiments, the decoding process of the video bitstream using the neural network model corresponding to the parameter set in step 1302 may be implemented in combination with a conventional hybrid coding framework as shown in fig. 2. The decoding process of the video data by using the neural network model may include at least one of the following steps:

in the intra-frame prediction stage of prediction reconstruction, performing a neural network-based intra-frame prediction technology;

in the inter-frame prediction stage of prediction reconstruction, an image super-resolution technology based on a neural network is executed for performing inter-frame motion estimation;

after predictive reconstruction, performing a neural network-based image super-resolution technique for obtaining a reconstructed image;

in entropy decoding, a neural network based context probability estimation technique is performed.

It should be understood by those skilled in the art that, in the above decoding process, a specific neural network model may be utilized, which is merely an exemplary illustration and is not an enumeration, and the decoding process of the video code stream may further include other neural network-based technologies, accordingly, a specific position of the neural network model in a conventional decoding framework may also be adaptively determined according to an actual situation of an encoding end, and the present disclosure does not limit this.

Of course, the decoding processing performed on the video code stream by using the neural network model in step 1302 may also be implemented by combining with other improved hybrid decoding frames, and accordingly, in the decoding processing process, the specific neural network model and the specific position in the hybrid decoding frame that are adopted may be adaptively determined according to the actual situation of the encoding end, which is not limited in this disclosure.

For specific coding standards, including Video coding standards such as h.264 or h.265 or VCC or AVS + or AVS2 or AVS3 or AV1, the reserved fields may be specific fields of these Video coding standards, may be Video Parameter Sets (VPS), and \ or Sequence Parameter Sets (SPS), and \ or Neural Network model Parameter sets (NNPS), and \ or image Parameter sets (Picture Parameter sets, PPS), and \ or Slice headers (Slice headers), and \ or auxiliary Enhancement Information (Supplemental Enhancement Information, SEI), and \ or extension data of syntax element Parameter sets, and \ or user data, and \ or Open Bitstream Units (Open, bitsub), Sequence headers, image group headers, macroblock Header Information, and so on.

In some embodiments, the decoding process is implemented based on a video decoder. The decoder includes: a memory, a processor, and a computer program stored on the memory and executable on the processor.

Therefore, because the syntax element carried in the code stream can exist in the reserved field of the code stream without depending on the existing video coding standard, the video decoding technology based on the neural network can be independently realized; the syntax element may also be located in a specific field of an existing multiple coding standard or reference a predefined independent video coding standard, thereby being compatible with an existing video coding standard. By the method, the coupling between the intelligent video decoding technology based on the neural network and the specific video coding standard can be reduced, and the application range of the video decoding technology based on the neural network is expanded.

Accordingly, the present disclosure also provides a video encoder, whose schematic structural diagram is shown in fig. 14, the video encoder includes: a memory 1401, a processor 1402 and a computer program stored on the memory and executable on the processor, which when executing the program implements the method of:

performing encoding processing on video data, the encoding processing including encoding processing using a neural network model;

and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.

The specific implementation method of video coding is as described above, and details are not repeated in this disclosure. It will be appreciated by those skilled in the art that the encoder may also be used to implement the various video encoding embodiments described earlier in this disclosure.

Correspondingly, the present disclosure also provides a video decoder, whose structural schematic diagram can also be as shown in fig. 14, the video decoder includes: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program:

analyzing the received video code stream to obtain a syntax element of the video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model;

and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.

The specific implementation method of video decoding is as described above, and details are not repeated in this disclosure. It will be appreciated by those skilled in the art that the decoder may also be used to implement the various video decoding embodiments described earlier in this disclosure.

Accordingly, the present disclosure also provides an AI accelerator for video coding, whose structural schematic diagram can also be as shown in fig. 14, the AI accelerator including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program:

performing encoding processing on the video data;

and sending the parameter set of the neural network to a video encoder so that the video encoder generates a code stream carrying syntax elements based on the video data after encoding processing, wherein the syntax elements contain information representing the parameter set of the neural network.

The specific implementation of the method is as described above, and details are not repeated in this disclosure. It will be appreciated by those skilled in the art that the AI accelerator may also be used to implement the various video encoding embodiments described earlier in this disclosure.

Accordingly, the present disclosure also provides an AI accelerator for video decoding, whose structural schematic diagram can also be as shown in fig. 14, the AI accelerator including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program:

obtaining a syntax element obtained after a video decoder analyzes a video code stream, wherein the syntax element comprises information of a parameter set representing a neural network;

The specific implementation of the method is as described above, and details are not repeated in this disclosure. It will be appreciated by those skilled in the art that the AI accelerator may also be used to implement the various video decoding embodiments described earlier in this disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method of any of the preceding embodiments.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

Various technical features in the above embodiments may be arbitrarily combined as long as there is no conflict or contradiction in the combination between the features, but the combination is limited by the space and is not described one by one, and therefore, any combination of various technical features in the above embodiments also belongs to the scope of the present disclosure.

Claims

A method of video encoding, the method comprising:

performing encoding processing on video data, the encoding processing including encoding processing using a neural network model;

and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.
The method of claim 1, wherein the syntax element is located at a specified position of the codestream, and wherein the specified position is a reserved field of the codestream.
The method of claim 2, wherein the reserved field is located in a header of a data packet of the codestream.
The method of claim 1, wherein the set of parameters of the neural network model comprises at least one of: the input parameters, the number of layers, the weight parameters, the hyper-parameters, the number of nodes on each layer and the activation functions of the neural network.
The method of claim 1, further comprising: converting the neural network model to a generic format.
The method of claim 5, wherein the information characterizing the set of parameters of the neural network model is corresponding information after converting the set of parameters of the neural network model into a common format.
The method of claim 5, wherein the syntax element further comprises a format conversion enable flag indicating to convert the neural network model into the common format.
The method of claim 1, further comprising: compressing the information characterizing the set of parameters of the neural network model.
The method of claim 8, wherein the information characterizing the set of parameters of the neural network model is information corresponding to the compressed set of parameters of the neural network model.
The method of claim 8, wherein the syntax element further comprises a compression enable flag indicating compression of the parameter set of the neural network model.
The method of claim 1, wherein the information characterizing the parameter set of the neural network model is corresponding information obtained by converting the parameter set of the neural network model into a common format and compressing the common format.
The method according to any of claims 5 to 7, 11, wherein the generic format comprises NNFF or ONNX.
Method according to any of claims 8 to 11, wherein the compression is a compression process based on compression techniques in the NNR-based compression framework or the AITISA compression framework.
The method of claim 1, wherein the syntax element further comprises an enable flag for a coding process using a neural network model for determining whether the coding process uses the neural network model.
The method of claim 1, wherein the syntax element further comprises processing timing information of the neural network model in the encoding process, the processing timing information indicating a specific position of the neural network model in the encoding process.
The method according to claim 15, wherein the processing timing information includes at least any one of:

information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding.
The method according to claim 16, wherein the encoding process using the neural network model is a neural network-based in-loop filtering technique, and the process timing information is information indicating after prediction reconstruction;

and/or (c) and/or,

the encoding process using the neural network model is an intra-frame prediction technique based on a neural network, and the processing timing information is information indicating an intra-frame prediction stage in predictive encoding;

and/or (c) and/or,

the encoding processing using the neural network model is an image super-resolution technique based on a neural network, and is used for performing inter-frame motion estimation, and the processing timing information is information indicating an inter-frame prediction stage of prediction encoding;

and/or (c) and/or,

the encoding process using the neural network model is a neural network-based image super-resolution technique for acquiring a reconstructed image, and the processing timing information is information indicating after entropy encoding;

and/or (c) and/or,

the encoding process using the neural network model is a neural network-based context probability estimation technique, and the process timing information is information indicated in entropy encoding.
The method of claim 1, wherein the syntax element further comprises identification information of the neural network model.
The method of claim 1, wherein the syntax element further comprises framework information of the neural network model, and wherein the framework information indicates a framework used by the neural network model.
The method of claim 19, wherein the framework of the neural network model comprises: TensorFlow or Pytrch or Caffe2 or AI hardware accelerators.
The method of claim 1, wherein the encoding process further comprises:

determining a neural network framework and a video encoder;

and training based on the neural network framework and the video encoder to obtain the neural network model.
The method of claim 2, wherein the bitstream is generated based on a specified coding standard, and the reserved field is located in a video parameter set, and/or a sequence parameter set, and/or a neural network model parameter set, and/or a picture parameter set, and/or a slice header, and/or auxiliary enhancement information, and/or extension data of a syntax element parameter set, and/or user data, and/or an open bitstream unit, and/or a sequence header, and/or a picture group header, and/or a slice header, and/or macro block information of the bitstream.
A method for neural network-based video encoding, the method comprising:

coding the video data by using a neural network model;

and sending the parameter set of the neural network model to a video encoder so that the video encoder generates a code stream carrying syntax elements based on the video data after encoding processing, wherein the syntax elements comprise information representing the parameter set of the neural network model.
The method of claim 23, wherein the set of parameters of the neural network model comprises at least one of: the input parameters, the number of layers, the weight parameters, the hyper-parameters, the number of nodes of each layer and the activation functions of the neural network.
The method of claim 23, wherein said encoding video data using a neural network model comprises:

performing a neural network-based intra-prediction technique during an intra-prediction phase of predictive coding;

and/or (c) and/or,

in the inter-frame prediction stage of prediction coding, an image super-resolution technology based on a neural network is executed for carrying out inter-frame motion estimation;

and/or (c) and/or,

after predictive reconstruction, performing a neural network-based in-loop filtering technique;

and/or (c) and/or,

after entropy encoding, performing a neural network-based image super-resolution technique for obtaining a reconstructed image;

and/or (c) and/or,

in entropy coding, a neural network based context probability estimation technique is performed.
The method of claim 23, wherein the syntax element is located in a specified location of the codestream, and wherein the specified location is a reserved field of the codestream.
The method of claim 26, wherein the reserved field is located in a header of a data packet of the codestream.
The method of claim 23, further comprising: converting the neural network model to a common format.
The method of claim 28, wherein the information characterizing the set of parameters of the neural network model is corresponding information obtained by converting the set of parameters of the neural network model into a common format.
The method of claim 28, wherein the syntax element further comprises a format conversion enable flag indicating conversion of the neural network model into the common format.
The method of claim 23, further comprising: compressing the information characterizing the set of parameters of the neural network model.
The method of claim 31, wherein the information characterizing the set of parameters of the neural network model is information corresponding to the compressed set of parameters of the neural network model.
The method of claim 31, wherein the syntax element further comprises a compression enable flag indicating compression of the parameter set of the neural network model.
The method of claim 23, wherein the information characterizing the set of parameters of the neural network model is corresponding information obtained by converting the set of parameters of the neural network model into a common format and compressing the common format.
The method according to any of claims 29 to 30, 34, wherein the generic format comprises NNFF or ONNX.
The method of any of claims 31 to 34, wherein the compression is a compression process by a compression technique in an NNR-based compression framework or an AITISA compression framework.
The method of claim 23, wherein the syntax element further comprises an enable flag for an encoding process using a neural network model for determining whether the encoding process uses the neural network.
The method of claim 23, wherein the syntax element further comprises processing timing information of the neural network model in the encoding process, the processing timing information indicating a specific position of the neural network model in the encoding process.
The method of claim 38, wherein the processing timing information comprises at least any one of:

information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding.
The method of claim 23, wherein the syntax element further comprises identification information of the neural network model.
The method of claim 23, wherein the syntax element further comprises framework information of the neural network model, and wherein the framework information indicates a framework used by the neural network model.
The method of claim 41, wherein the framework of the neural network model comprises: TensorFlow or Pytrch or Caffe2 or AI hardware accelerators.
The method of claim 26, wherein the neural network model is determined by:

determining a neural network framework and a video encoder;

and training based on the neural network framework and the video encoder to obtain the neural network model.
The method of claim 30, wherein the bitstream is generated based on a specified coding standard, and the reserved field is located in a video parameter set, and/or a sequence parameter set, and/or a neural network model parameter set, and/or a picture parameter set, and/or a slice header, and/or auxiliary enhancement information, and/or extension data of a syntax element parameter set, and/or user data, and/or an open bitstream unit, and/or a sequence header, and/or a picture group header, and/or a slice header, and/or macro block information of the bitstream.
A method of video decoding, the method comprising:

analyzing the received video code stream to obtain a syntax element of the video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model;

and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.
The method of claim 45, wherein the syntax element is located in a specified position of the codestream, and wherein the specified position is a reserved field of the codestream.
The method of claim 46, wherein the reserved field is located in a header of a data packet of the codestream.
The method of claim 45, wherein the set of parameters of the neural network model includes at least one of: the input parameters, the number of layers, the weight parameters, the hyper-parameters, the number of nodes on each layer and the activation functions of the neural network.
The method of claim 45, wherein the information characterizing the set of parameters of the neural network model is corresponding information after converting the set of parameters of the neural network model into a common format.
The method of claim 49, wherein the syntax element further comprises a format conversion enable flag for indicating the conversion of the neural network model into the common format.
The method of claim 50, further comprising:

and when the format conversion enabling identification indicates that the neural network model is converted into the universal format, converting the neural network model in the universal format into the neural network model of the specified frame.
The method of claim 51, wherein the syntax element further comprises information characterizing a framework of the neural network model.
The method of claim 51, wherein the framework of the neural network model comprises: TensorFlow or Pytrch or Caffe2 or AI hardware accelerators.
The method of claim 45, wherein the information characterizing the set of parameters of the neural network model is information corresponding to the compressed set of parameters of the neural network model.
The method of claim 54, wherein the syntax element further comprises a compression enable flag for compressing the set of parameters of the neural network model.
The method of claim 55, further comprising:

the compression enabling identification indicates that the parameter set of the neural network model is compressed, and the parameter set of the compressed neural network model is decompressed.
The method of claim 45, wherein the information characterizing the set of parameters of the neural network model is corresponding information obtained by converting the set of parameters of the neural network model into a common format and compressing the common format.
The method according to any of claims 49 to 51, 57, wherein the generic format comprises NNFF or ONNX.
The method of any of claims 52 to 57, wherein the decompression is a decompression process based on a decompression technique corresponding to a compression technique in the compression framework of NNRs or the compression framework of AITISAs.
The method of claim 45, wherein the syntax element further comprises an enable flag for a coding process using a neural network model for determining whether the coding process uses the neural network.
The method of claim 45, wherein the syntax element further comprises processing timing information of the neural network model in the encoding process, the processing timing information indicating a specific position of the neural network model in the encoding process.
The method according to claim 45, wherein the processing timing information comprises at least any one of the following information:

information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding.
The method of claim 45, wherein the syntax element further comprises identification information of the neural network model.
The method according to claim 62, wherein the decoding process using the neural network model is a neural network-based in-loop filtering technique, and the processing timing information is information indicating after prediction reconstruction;

and/or (c) and/or,

decoding processing is performed by using a neural network model to be an intra-frame prediction technology based on a neural network, and the processing time sequence information is information indicating intra-frame prediction in prediction reconstruction;

and/or (c) and/or,

decoding processing is carried out by utilizing a neural network model, the decoding processing is an image super-resolution technology based on a neural network and is used for carrying out inter-frame motion estimation, and the processing time sequence information is information indicating an inter-frame prediction stage of prediction reconstruction;

and/or (c) and/or,

decoding processing is performed by using a neural network model, and the decoding processing is an image super-resolution technology based on a neural network and is used for acquiring a reconstructed image, and the processing time sequence information is information indicating after prediction reconstruction;

and/or (c) and/or,

the decoding process using the neural network model is a neural network-based context probability estimation technique, and the processing timing information is information indicated in entropy decoding.
The method of claim 45, wherein the decoding process is implemented based on a video decoder.
The method of claim 46, wherein the bitstream is generated based on a specified coding standard, and the reserved field is located in a video parameter set, and/or a sequence parameter set, and/or a neural network model parameter set, and/or a picture parameter set, and/or a slice header, and/or auxiliary enhancement information, and/or extension data of a syntax element parameter set, and/or user data, and/or an open bitstream unit, and/or a sequence header, and/or a picture group header, and/or a slice header, and/or macro block information of the bitstream.
A method for decoding a video based on a neural network, the method comprising:

obtaining a syntax element obtained after a video decoder analyzes a video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model;

and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.
The method of claim 67, wherein the set of parameters of the neural network model includes at least one of: the input parameters, the number of layers, the weight parameters, the hyper-parameters, the number of nodes on each layer and the activation functions of the neural network.
The method of claim 67, wherein said decoding video data using the neural network model comprises:

in the intra-frame prediction stage of prediction reconstruction, performing a neural network-based intra-frame prediction technology;

and/or (c) and/or,

in the inter-frame prediction stage of prediction reconstruction, an image super-resolution technology based on a neural network is executed for performing inter-frame motion estimation;

and/or (c) and/or,

after predictive reconstruction, performing a neural network-based in-loop filtering technique;

and/or (c) and/or,

after predictive reconstruction, performing a neural network-based image super-resolution technique for obtaining a reconstructed image;

and/or (c) and/or,

in entropy decoding, a neural network based context probability estimation technique is performed.
The method of claim 67, wherein the syntax element is located in a specified position of the code stream, and wherein the specified position is a reserved field of the code stream.
The method of claim 70, wherein the reserved field is located in a header of a data packet of the codestream.
The method of claim 67, wherein the information characterizing the set of parameters of the neural network model is corresponding information after converting the set of parameters of the neural network model into a common format.
The method according to claim 72, wherein the syntax element further comprises a format conversion enabling flag for indicating the conversion of the neural network model into the common format.
The method of claim 73, further comprising:

and when the format conversion enabling identification indicates that the neural network model is converted into the universal format, converting the neural network model in the universal format into the neural network model of the specified frame.
The method of claim 74, wherein the syntax element further comprises information characterizing a framework of the neural network model.
The method of claim 75, wherein the framework of the neural network model comprises: TensorFlow or Pytrch or Caffe2 or AI hardware accelerators.
The method of claim 67, wherein the information characterizing the set of parameters of the neural network model is corresponding information obtained by compressing the set of parameters of the neural network model.
The method of claim 77, wherein the syntax element further comprises a compression enable flag indicating compression of the set of parameters of the neural network model.
The method of claim 78, further comprising:

the compression enabling identification indicates that the parameter set of the neural network model is compressed, and the parameter set of the compressed neural network model is decompressed.
The method of claim 67, wherein the information characterizing the set of parameters of the neural network model is corresponding information obtained by converting the set of parameters of the neural network model into a common format and compressing the common format.
The method of any of claims 72 to 74 or 80, wherein the generic format comprises NNFF or ONNX.
The method of any of claims 77 to 80, wherein the decompression is a decompression process based on a decompression technique corresponding to a compression technique in the compression framework of NNRs or the compression framework of AITISAs.
The method of claim 67, wherein the syntax element further comprises an enable flag for a coding process using a neural network model for determining whether the coding process uses the neural network.
The method of claim 67, wherein the syntax element further comprises processing timing information of the neural network model in the encoding process, the processing timing information indicating a specific position of the neural network model in the encoding process.
The method according to claim 84, wherein the processing timing information comprises at least any one of:

information indicative in predictive coding, information indicative in transform coding, information indicative in quantization, information indicative in entropy coding, information indicative before predictive coding, information indicative between predictive coding and transform coding, information indicative between transform coding and quantization, information indicative between quantization and entropy coding, information indicative after entropy coding.
The method of claim 67, wherein the syntax element further comprises identification information of the neural network model.
The method of claim 67, wherein the decoding process is implemented based on a video decoder.
The method of claim 70, wherein the bitstream is generated based on a specified coding standard, and the reserved field is located in a video parameter set, and/or a sequence parameter set, and/or a neural network model parameter set, and/or a picture parameter set, and/or a slice header, and/or auxiliary enhancement information, and/or extension data of a syntax element parameter set, and/or user data, and/or an open bitstream unit, and/or a sequence header, and/or a picture group header, and/or a slice header, and/or macro block information of the bitstream.
A video encoder, characterized in that the encoder comprises: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program:

performing encoding processing on video data, the encoding processing including encoding processing using a neural network model;

and generating a code stream carrying syntax elements based on the video data after the coding processing, wherein the syntax elements comprise information of parameter sets representing the neural network model.
A video decoder, characterized in that the decoder comprises: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program:

analyzing the received video code stream to obtain a syntax element of the video code stream, wherein the syntax element comprises information of a parameter set representing a neural network model;

and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.
An AI accelerator for video coding, the AI accelerator comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the following method when executing the program:

performing encoding processing on the video data;

and sending the parameter set of the neural network to a video encoder so that the video encoder generates a code stream carrying syntax elements based on the video data after encoding processing, wherein the syntax elements contain information representing the parameter set of the neural network.
An AI accelerator for video decoding, the AI accelerator comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following method when executing the program:

obtaining a syntax element obtained after a video decoder analyzes a video code stream, wherein the syntax element comprises information of a parameter set representing a neural network;

and decoding the video code stream by using a neural network model corresponding to the parameter set according to the syntax element.
A machine-readable storage medium having stored thereon computer instructions which, when executed, perform the method of any of claims 1, 21, 45 and 67.