CN117716687A

CN117716687A - Implicit image and video compression using machine learning system

Info

Publication number: CN117716687A
Application number: CN202280035149.6A
Authority: CN
Inventors: 张云帆; T·J·范罗森达尔; T·S·科恩; M·纳格尔; J·H·布雷默
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2021-05-21
Filing date: 2022-03-31
Publication date: 2024-03-15

Abstract

Techniques for compressing and decompressing data using a machine learning system are described. An example process may include receiving a plurality of images for compression by a neural network compression system. The process may include determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images. The process may include generating a first bit stream including a compressed version of a first plurality of weight values. The process may include outputting the first bit stream for transmission to a recipient.

Description

Implicit image and video compression using machine learning system

Technical Field

The present disclosure relates generally to data compression. For example, aspects of the present disclosure include using a machine learning system to compress image and/or video content.

Background

Many devices and systems allow media data (e.g., image data, video data, audio data, etc.) to be processed and output for consumption. Media data includes a large amount of data to meet the ever-increasing demands in terms of image/video/audio quality, performance, and characteristics. For example, consumers of video data often desire high quality video with high fidelity, resolution, frame rate, and the like. Often large amounts of video data are required to meet these demands, which places a significant burden on the communication network and on the devices that process and store the video data. Video coding techniques may be used to compress video data. One example purpose of video coding is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradation of video quality. As evolving video services become available and demand for massive video data continues to increase, better performing and efficient transcoding techniques are needed.

SUMMARY

In some examples, systems and techniques for data compression and/or decompression using one or more machine learning systems are described. In some examples, a machine learning system (e.g., that uses one or more neural network systems) for compressing and/or decompressing media data (e.g., video data, image data, audio data, etc.) is provided. According to at least one illustrative example, a method of processing image data is provided. The method may include: receiving a plurality of images for compression by a neural network compression system; determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images; generating a first bit stream comprising a compressed version of a first plurality of weight values; and outputting the first bit stream for transmission to the recipient.

In another example, an apparatus for processing media data is provided that includes at least one memory and at least one processor (e.g., configured in circuitry) communicatively coupled to the at least one memory. The at least one processor may be configured to: receiving a plurality of images for compression by a neural network compression system; determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images; generating a first bit stream comprising a compressed version of a first plurality of weight values; and outputting the first bit stream for transmission to the recipient.

In another example, a non-transitory computer-readable medium is provided that includes at least one instruction stored thereon that, when executed by one or more processors, causes the one or more processors to: receiving a plurality of images for compression by a neural network compression system; determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images; generating a first bit stream comprising a compressed version of a first plurality of weight values; and outputting the first bit stream for transmission to the recipient.

In another example, an apparatus for processing image data is provided. The apparatus may include: means for receiving input data for compression by a neural network compression system; means for receiving a plurality of images for compression by a neural network compression system; means for determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images; means for generating a first bit stream comprising a compressed version of a first plurality of weight values; and means for outputting the first bit stream for transmission to the recipient.

In another example, a method for processing media data is provided. The method may include: receiving a compressed version of a first plurality of neural network weight values associated with a first image from the plurality of images; decompressing the first plurality of neural network weight values; and processing the first plurality of neural network weight values using the first neural network model to produce a first image.

In another example, an apparatus for processing image data is provided that includes at least one memory and at least one processor (e.g., configured in circuitry) communicatively coupled to the at least one memory. The at least one processor may be configured to: receiving a compressed version of a first plurality of neural network weight values associated with a first image from the plurality of images; decompressing the first plurality of neural network weight values; and processing the first plurality of neural network weight values using the first neural network model to produce a first image.

In another example, a non-transitory computer-readable medium is provided that includes at least one instruction stored thereon that, when executed by one or more processors, causes the one or more processors to: receiving a compressed version of a first plurality of neural network weight values associated with a first image from the plurality of images; decompressing the first plurality of neural network weight values; and processing the first plurality of neural network weight values using the first neural network model to produce a first image.

In another example, an apparatus for processing image data is provided. The apparatus may include: means for receiving a compressed version of a first plurality of neural network weight values associated with a first image from the plurality of images; means for decompressing the first plurality of neural network weight values; and means for processing the first plurality of neural network weight values using the first neural network model to produce a first image.

In some aspects, an apparatus (device) may be, or be part of, a camera (e.g., an IP camera), a mobile device (e.g., a mobile phone or so-called "smart phone", or other mobile device), a smart wearable device, an augmented reality device (e.g., a Virtual Reality (VR) device, an Augmented Reality (AR) device, or a Mixed Reality (MR) device), a personal computer, a laptop computer, a server computer, a 3D scanner, a multi-camera system, or other device. In some aspects, the apparatus (device) includes one or more cameras for capturing one or more images. In some aspects, the apparatus (device) further comprises a display for displaying one or more images, notifications, and/or other displayable data. In some aspects, the apparatus may include one or more sensors.

This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The subject matter should be understood with reference to appropriate portions of the entire specification of this patent, any or all of the accompanying drawings, and each claim.

The foregoing and other features and embodiments will become more apparent upon reference to the following description, claims and appended drawings.

Brief Description of Drawings

Illustrative embodiments of the present application are described in detail below with reference to the following drawings:

fig. 1 is a diagram illustrating an example of an image processing system according to some examples of the present disclosure;

fig. 2A is a diagram illustrating an example of a fully connected neural network, according to some examples of the present disclosure;

fig. 2B is a diagram illustrating an example of a locally connected neural network according to some examples of the present disclosure;

fig. 2C is a diagram illustrating an example of a convolutional neural network, according to some examples of the present disclosure;

fig. 2D is a diagram illustrating an example of a Depth Convolutional Network (DCN) for identifying visual features from an image, according to some examples of the present disclosure;

fig. 3 is a block diagram illustrating an example Deep Convolutional Network (DCN) according to some examples of the present disclosure;

Fig. 4 is a diagram illustrating an example of a system including a sender device for compressing video content and a receiver device for decompressing a received bitstream to video content, according to some examples of the present disclosure;

fig. 5A and 5B are diagrams illustrating example rate-distortion self-encoder systems according to some examples of the present disclosure;

FIG. 6 is a diagram illustrating an example inference process implemented by an example neural network compression system that is fine-tuned using model priors, according to some examples of the present disclosure;

fig. 7A is a diagram illustrating an example image compression codec based on an implicit neural representation in accordance with some examples of the present disclosure;

fig. 7B is a diagram illustrating another example image compression codec based on an implicit neural representation in accordance with some examples of the present disclosure;

fig. 8A is a diagram illustrating an example of a compression pipeline for a group of pictures using implicit neural representations, according to some examples of the present disclosure;

fig. 8B is a diagram illustrating another example of a compression pipeline for a group of pictures using implicit neural representations, according to some examples of the present disclosure;

fig. 8C is a diagram illustrating another example of a compression pipeline for a group of pictures using implicit neural representations, according to some examples of the present disclosure;

Fig. 9 is a diagram illustrating video frame coding order according to some examples of the present disclosure;

fig. 10 is a diagram illustrating an example process for performing implicit neural compression in accordance with some examples of the present disclosure;

fig. 11 is a flowchart illustrating an example of a process for compressing image data based on an implicit neural representation, according to some examples of the present disclosure;

fig. 12 is a flowchart illustrating another example of a process for compressing image data based on an implicit neural representation, according to some examples of the present disclosure;

fig. 13 is a flowchart illustrating an example of a process for decompressing image data based on an implicit neural representation in accordance with some examples of the present disclosure;

fig. 14 is a flowchart illustrating an example of a process for compressing image data based on an implicit neural representation, according to some examples of the present disclosure;

fig. 15 is a flowchart illustrating an example of a process for decompressing image data based on an implicit neural representation in accordance with some examples of the present disclosure; and

fig. 16 illustrates an example computing system according to some examples of this disclosure.

Detailed Description

Certain aspects and embodiments of the disclosure are provided below. It will be apparent to those skilled in the art that some of these aspects and embodiments may be applied independently and that some of them may be applied in combination. In the following description, for purposes of explanation, specific details are set forth in order to provide a thorough understanding of the embodiments of the present application. It may be evident, however, that the embodiments may be practiced without these specific details. The drawings and descriptions are not intended to be limiting.

The following description merely provides example embodiments and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the example embodiments will provide those skilled in the art with an enabling description for implementing the example embodiments. It being understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the application as set forth in the appended claims.

As mentioned above, media data (e.g., image data, video data, and/or audio data) may include a large amount of data, particularly as the demand for high quality video data continues to grow. For example, consumers of image, audio, and video data typically desire increasingly higher levels of quality, such as high fidelity, resolution, frame rate, and the like. However, the large amount of data required to meet such demands can place a significant burden on communication networks (such as high bandwidth and network resource requirements) and devices that process and store video data. Therefore, compression algorithms (also known as coding algorithms or coding tools) for reducing the amount of data required to store and/or transfer image and video data are advantageous.

Various techniques may be used to compress the media data. Compression of image data has been accomplished using algorithms such as Joint Photographic Experts Group (JPEG), better Portable Graphics (BPG), and the like. In recent years, a neural network-based compression method has shown great promise in compressing image data. Video coding may be performed according to a particular video coding standard. Example video coding standards include High Efficiency Video Coding (HEVC), basic video coding (EVC), advanced Video Coding (AVC), moving Picture Experts Group (MPEG) coding, and Versatile Video Coding (VVC). However, such conventional image and video coding techniques generate artifacts in the reconstructed image after performing decoding.

In some aspects, systems, devices, processes (also known as methods), and computer-readable media (collectively referred to herein as "systems and techniques") for performing compression and decompression (also known as encoding and decoding, which are collectively referred to as decoding) of data (e.g., images, video, audio, etc.) using one or more machine learning systems are described herein. For example, these systems and techniques may be implemented using implicit neural models. The implicit neural model may be based on an Implicit Neural Representation (INR). As described herein, the implicit neural model may take as input coordinate locations (e.g., coordinates within an image or video frame) and may output pixel values (e.g., color values of the image or video frame, such as color values of each coordinate location or pixel). In some cases, the implicit neural model may also be based on an IPB frame scheme. In some examples, the implicit neural model may modify the input data into a model optical flow.

In some examples, the implicit neural model may model optical flow with an implicit neural representation, where local translation may be equivalent to element-by-element addition. In some cases, the implicit model may model the light stream by adjusting the input coordinate positioning to produce corresponding output pixel values. For example, element-wise addition of inputs may result in local translation at the output, which may eliminate the need for pixel movement and associated computational complexity.

One or more machine learning systems may be trained as described herein and used to perform data compression and/or decompression, such as image, video, and/or audio compression and decompression. The machine learning system described herein may be trained to perform compression/decompression techniques that produce high quality data output. The systems and techniques described herein may perform compression and/or decompression of any type of data. For example, in some cases, the systems and techniques described herein may perform compression and/or decompression of image data. As another example, in some cases, the systems and techniques described herein may perform compression and/or decompression of video data. As used herein, the terms "image" and "frame" are used interchangeably to refer to either a free-standing image or frame (e.g., a photograph), or a group or sequence of images or frames (e.g., making up a video or other image/frame sequence). As another example, in some cases, the systems and techniques described herein may perform compression and/or decompression of audio data. For simplicity, illustration, and explanation, the systems and techniques described herein are discussed with reference to compression and/or decompression of image data (e.g., images or frames, video, etc.). However, as mentioned above, the concepts described herein may also be applied to other modalities, such as audio data and any other type of data.

The compression model used by the encoder and/or decoder may be generalized to different types of data. Furthermore, by utilizing the implicit neural model described herein with various characteristics, the machine learning system can improve compression and/or decompression performance, bit rate, quality, and/or efficiency of a particular data set. For example, a machine learning system based on an implicit neural model may eliminate the need to store a pre-trained neural network on the receiver side (and in some cases on the transmitter side). The neural network on the transmitting side and the receiving side can be implemented with a lightweight framework. Another advantage of such machine learning systems is that the actual machine learning system (e.g., neural network) does not operate with a stream, which may be difficult to implement in some situations (e.g., in hardware). In addition, the decoding function can be faster than in a standard machine learning based encoder-decoder (codec). In some cases, the implicit neural model-based machine learning system described herein does not require a separate training data set, as it can be implicitly trained using data to be encoded (e.g., a coordinate grid and current instances of images, video frames, video, etc.). The configuration of the implicit neural model described herein may also result in avoiding potential privacy concerns. The system also works well for data from different domains, including those where no suitable training data is available.

In some examples, the machine learning system may include one or more neural networks. Machine Learning (ML) is a subset of Artificial Intelligence (AI). ML systems include algorithms and statistical models that computer systems can use to perform various tasks through dependency patterns and inferences without using explicit instructions. One example of an ML system is a neural network (also known as an artificial neural network), which may include a group of interconnected artificial neurons (e.g., a neuron model). Neural networks may be used for various applications and/or devices, such as image analysis and/or computer vision applications, internet Protocol (IP) cameras, internet of things (IoT) devices, autonomous vehicles, service robots, and the like.

Individual nodes in a neural network can mimic biological neurons by taking input data and performing simple operations on the data. The result of the simple operation performed on the input data is selectively passed to other neurons. Weight values are associated with each vector and node in the network, and these values constrain how the input data relates to the output data. For example, the input data for each node may be multiplied by a corresponding weight value, and the products may be summed. The sum of these products may be adjusted by an optional bias, and an activation function may be applied to the result, resulting in an output signal or "output activation" of the node (sometimes referred to as an activation graph or signature). The weight values may be initially determined by an iterative flow of training data in the network (e.g., the weight values are established during a training phase in which the network learns how to identify a particular class by typical input data characteristics for each class).

There are different types of neural networks, such as deep-generation neural network models (e.g., generating an countermeasure network (GAN)), recurrent Neural Network (RNN) models, multi-layer perceptron (MLP) neural network models, convolutional Neural Network (CNN) models, self-encoder (AE), and so forth. For example, GAN is a form of generating neural networks that can learn patterns in input data so that the neural network model can generate new synthetic outputs that can reasonably be derived from the original dataset. The GAN may include two neural networks operating together. One of these neural networks, referred to as a generating neural network or generator, denoted G (z), generates a composite output, and the other neural network, referred to as a discriminating neural network or discriminator, denoted D (X), evaluates the authenticity of the output, whether the output is from an original data set, such as a training data set, or generated by the generator. As an illustrative example, the training input and output may include images. The generator is trained to attempt to fool the discriminator into determining that the composite image generated by the generator is a true image from the dataset. The training process continues and the generator becomes more adept at generating a composite image that looks like a real image. The discriminator continues to look for defects in the composite image, while the generator finds what the discriminator is looking for to determine defects in the image. Once the network is trained, the generator can produce a realistic image that the discriminator cannot distinguish from the real image.

The RNN works on the principle of preserving the output of a layer and feeding that output back to the input to help predict the outcome of that layer. In an MLP neural network, data may be fed into an input layer, and one or more hidden layers provide several levels of abstraction of the data. Predictions may then be made for the output layer based on the abstracted data. MLPs may be particularly suited for classifying predictive problems, where an input is assigned a class or label. Convolutional Neural Network (CNN) is a type of feedforward artificial neural network. CNNs may include a collection of artificial neurons, each having a receptive field (e.g., a spatially localized region of an input space), and which collectively tile the input space. CNNs have numerous applications, including pattern recognition and classification.

In a hierarchical neural network architecture (referred to as a deep neural network when there are multiple hidden layers), the output of a first layer of artificial neurons becomes the input of a second layer of artificial neurons, the output of a second layer of artificial neurons becomes the input of a third layer of artificial neurons, and so on. Convolutional neural networks may be trained to identify feature hierarchies. The computations in the convolutional neural network architecture may be distributed over a population of processing nodes, which may be configured in one or more computational chains. These multi-layer architectures may be trained one layer at a time and may be fine-tuned using back propagation.

The self-encoder (AE) can learn efficient data decoding in an unsupervised manner. In some examples, AEs may learn a representation of a data set (e.g., data coding) by training the network to ignore signal noise. AE may include an encoder and a decoder. The encoder may map the input data to a code and the decoder may map the code to a reconstruction of the input data. In some examples, a rate-distortion self-encoder (RD-AE) may be trained to minimize the average rate-distortion loss over a dataset of data points, such as image and/or video data points. In some cases, the RD-AE may be passed forward at the inferred time to encode the new data point.

In some examples, a machine learning system for data compression and/or decompression may include a neural network that is implicitly trained (e.g., using image data to be compressed). In some cases, implicit Neural Representation (INR) based data compression and/or decompression may be implemented using a convolution-based architecture. In some aspects, encoding the image data may include: a neural network architecture is selected and network weights are over-fitted on the image data. In some examples, the decoder may include a neural network architecture and receive network weights from the encoder. In other examples, the decoder may receive the neural network architecture from the encoder.

In some cases, the neural network weights may be large, which may increase the bit rate and/or computational overhead required to send these weights to the decoder. In some examples, the weights may be quantized to reduce the overall size. In some aspects, the quantized weights may be compressed using a weight prior. The weight priors may reduce the amount of data sent to the decoder. In some cases, the weight priors may be designed to reduce the cost of transmitting model weights. For example, the weight priors may be used to reduce and/or limit the bit rate overhead of the weights.

In some cases, the design of the weight priors may be improved, as further described herein. In some illustrative examples, the weight prior design may include a separate Gaussian (Gaussian) weight prior. In other illustrative examples, the weight prior design may include a separate Laplace (Laplace) weight prior. In other illustrative examples, the weight prior design may include a separate pin plate prior (Spike and Slab prior). In some illustrative examples, the weight priors may include complex dependencies learned by the neural network.

Fig. 1 is a diagram illustrating an example of an image processing system 100 according to some examples of the present disclosure. In some cases, image processing system 100 may include a Central Processing Unit (CPU) 102 or a multi-core CPU, which CPU 102 or multi-core CPU is configured to perform one or more of the functions described herein. Variables (e.g., neural signals and synaptic weights), system parameters associated with a computing device (e.g., neural network with weights), delay, frequency bin information, task information, and other information may be stored in a memory block associated with a Neural Processing Unit (NPU) 108, a memory block associated with a CPU 102, a memory block associated with a Graphics Processing Unit (GPU) 104, a memory block associated with a Digital Signal Processor (DSP) 106, a memory block 118, or distributed across multiple blocks. Instructions executed at CPU 102 may be loaded from a program memory associated with CPU 102 and/or from memory block 118.

The image processing system 100 may include additional processing blocks tailored for specific functions, such as a GPU 104, a DSP 106, a connectivity block 110, which may include fifth generation (5G) connectivity, fourth generation long term evolution (4 GLTE) connectivity, wi-Fi connectivity, USB connectivity, bluetooth connectivity, etc., and/or a multimedia processor 112, for example, that may detect and identify features. In one implementation, the NPU 108 is implemented in the CPU 102, DSP 106, and/or GPU 104. The image processing system 100 may also include a sensor processor 114, one or more Image Signal Processors (ISPs) 116, and/or a storage 120. In some examples, the image processing system 100 may be based on the ARM instruction set.

The image processing system 100 may be part of one or more computing devices. In some examples, the image processing system 100 may be part of an electronic device (or multiple electronic devices), such as a camera system (e.g., digital camera, IP camera, video camera, security camera, etc.), a telephone system (e.g., smart phone, cellular phone, conferencing system, etc.), a desktop computer, an XR device (e.g., head mounted display, etc.), a smart wearable device (e.g., smart watch, smart glasses, etc.), a laptop or notebook computer, a tablet computer, a set top box, a television, a display device, a digital media player, a game console, a video streaming device, an unmanned aerial vehicle, a computer in an automobile, a System On Chip (SOC), an internet of things (IoT) device, or any other suitable electronic device.

Although image processing system 100 is shown as including certain components, one of ordinary skill in the art will appreciate that image processing system 100 may include more or fewer components than those shown in fig. 1. For example, in some instances, image processing system 100 may also include one or more memory devices (e.g., RAM, ROM, cache, etc.), one or more networking interfaces (e.g., wired and/or wireless communication interfaces, etc.), one or more display devices, and/or other hardware or processing devices not shown in fig. 1. An illustrative example of computing devices and hardware components that may be implemented with image processing system 100 will be described below with reference to fig. 16.

The image processing system 100 and/or components thereof may be configured to perform compression and/or decompression (also referred to as encoding and/or decoding, which are collectively referred to as image coding) using the machine learning systems and techniques described herein. In some cases, image processing system 100 and/or components thereof may be configured to perform image or video compression and/or decompression using the techniques described herein. In some examples, the machine learning system may utilize a deep learning neural network architecture to perform compression and/or decompression of image, video, and/or audio data. By using a deep learning neural network architecture, the machine learning system can increase the efficiency and speed of content compression and/or decompression on a device. For example, a device using the described compression and/or decompression techniques may efficiently compress one or more images using machine learning based techniques, the compressed one or more images may be transmitted to a recipient device, and the recipient device may efficiently decompress the one or more compressed images using machine learning based techniques described herein. As used herein, an image may refer to a still image and/or a video frame associated with a sequence of frames (e.g., video).

As mentioned above, neural networks are examples of machine learning systems. The neural network may include an input layer, one or more hidden layers, and an output layer. The data is provided from input nodes of the input layer, the processing is performed by hidden nodes of one or more hidden layers, and the output is generated by output nodes of the output layer. Deep learning networks typically include multiple hidden layers. Each layer of the neural network may include a feature map or activation map, which may include artificial neurons (or nodes). The feature map may include filters, kernels, etc. The nodes may include one or more weights for indicating the importance of the nodes of one or more of the layers. In some cases, the deep learning network may have a series of many hidden layers, with early layers used to determine simple and low-level characteristics of the input, and later layers building a hierarchy of more complex and abstract characteristics.

The deep learning architecture may learn a feature hierarchy. For example, if visual data is presented to the first layer, the first layer may learn to identify relatively simple features (such as edges) in the input stream. In another example, if auditory data is presented to the first layer, the first layer may learn to identify spectral power in a particular frequency. A second layer, taking the output of the first layer as input, may learn to identify feature combinations, such as simple shapes for visual data or sound combinations for auditory data. For example, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases.

Deep learning architecture may perform particularly well when applied to problems with natural hierarchical structures. For example, classification of motor vehicles may benefit from first learning to identify wheels, windshields, and other features. These features may be combined at higher layers in different ways to identify cars, trucks, and planes.

Neural networks may be designed with various connectivity modes. In a feed-forward network, information is passed from a lower layer to an upper layer, with each neuron in a given layer communicating to neurons in the higher layer. As described above, the hierarchical representation may be built up in successive layers of the feed forward network. The neural network may also have a back-flow or feedback (also known as top-down) connection. In a reflow connection, output from a neuron in a given layer may be communicated to another neuron in the same layer. The reflow architecture may help identify patterns across more than one chunk of input data delivered to the neural network in sequence. The connection from a neuron in a given layer to a neuron in a lower layer is referred to as a feedback (or top-down) connection. Networks with many feedback connections may be beneficial when the identification of high-level concepts may assist in discerning particular low-level features of an input.

The connections between the layers of the neural network may be fully or partially connected. Fig. 2A illustrates an example of a fully connected neural network 202. In the fully-connected neural network 202, a neuron in a first layer may communicate its output to each neuron in a second layer, such that each neuron in the second layer will receive input from each neuron in the first layer. Fig. 2B illustrates an example of a locally connected neural network 204. In the local connected neural network 204, neurons in a first layer may be connected to a limited number of neurons in a second layer. More generally, the locally connected layers of the locally connected neural network 204 may be configured such that each neuron in a layer will have the same or similar connectivity pattern, but its connection strength may have different values (e.g., 210, 212, 214, and 216). The connectivity pattern of local connectivity may create spatially diverse receptive fields in higher layers, as higher layer neurons in a given region may receive inputs that are tuned by training to the nature of a limited portion of the total input to the network.

One example of a locally connected neural network is a convolutional neural network. Fig. 2C illustrates an example of a convolutional neural network 206. The convolutional neural network 206 may be configured such that the connection strength associated with the input for each neuron in the second layer is shared (e.g., 208). Convolutional neural networks may be well suited to problems in which the spatial location of the input is significant. Convolutional neural network 206 may be used to perform one or more aspects of video compression and/or decompression in accordance with aspects of the present disclosure.

One type of convolutional neural network is a Deep Convolutional Network (DCN). Fig. 2D illustrates a detailed example of a DCN 200 designed to identify visual features from an image 226 input from an image capturing device 230 (such as an in-vehicle camera). The DCN 200 of the current example may be trained to identify traffic signs and numbers provided on traffic signs. Of course, the DCN 200 may be trained for other tasks, such as identifying lane markers or identifying traffic lights.

The DCN 200 may be trained with supervised learning. During training, an image, such as image 226 of a speed limit sign, may be presented to DCN 200, and a "forward pass" may then be calculated to produce output 222.DCN 200 may include a feature extraction section and a classification section. Upon receiving the image 226, the convolution layer 232 may apply a convolution kernel (not shown) to the image 226 to generate the first set of feature maps 218. As an example, the convolution kernel of the convolution layer 232 may be a 5x5 kernel that generates a 28x28 feature map. In this example, since four different feature maps are generated in the first set of feature maps 218, four different convolution kernels are applied to the image 226 at the convolution layer 232. The convolution kernel may also be referred to as a filter or convolution filter.

The first set of feature maps 218 may be sub-sampled by a max pooling layer (not shown) to generate a second set of feature maps 220. The max pooling layer reduces the size of the first set of feature maps 218. That is, the size of the second set of feature maps 220 (such as 14x 14) is smaller than the size of the first set of feature maps 218 (such as 28x 28). The reduced size provides similar information to subsequent layers while reducing memory consumption. The second set of feature maps 220 may be further convolved via one or more subsequent convolution layers (not shown) to generate subsequent one or more sets of feature maps (not shown).

In the example of fig. 2D, the second set of feature maps 220 are convolved to generate a first feature vector 224. In addition, the first feature vector 224 is further convolved to generate a second feature vector 228. Each feature of the second feature vector 228 may include numbers corresponding to possible features of the image 226 (such as "logo", "60", and "100"). A softmax function (not shown) may convert the numbers in the second feature vector 228 to probabilities. As such, the output 222 of the DCN 200 is a probability that the image 226 includes one or more features.

In this example, the probabilities for "flags" and "60" in output 222 are higher than the probabilities for other features of output 222 (such as "30", "40", "50", "70", "80", "90", and "100"). The output 222 produced by the DCN 200 is likely to be incorrect prior to training. Thus, an error between the output 222 and the target output may be calculated. The target output is a true value (e.g., "flag" and "60") of the image 226. The weights of the DCN 200 may then be adjusted to more closely align the output 222 of the DCN 200 with the target output.

To adjust the weights, the learning algorithm may calculate gradient vectors for the weights. The gradient may indicate the amount by which the error will increase or decrease if the weight is adjusted. At the top layer, the gradient may directly correspond to the value of the weight connecting the activated neurons in the penultimate layer with the neurons in the output layer. In lower layers, the gradient may depend on the value of the weight and the calculated error gradient of the higher layer. The weights may then be adjusted to reduce the error. This way of adjusting weights may be referred to as "back propagation" because it involves back-propagation ("backward pass") in the neural network.

In practice, the error gradient of the weights may be calculated over a small number of examples, such that the calculated gradient approximates the true error gradient. This approximation method may be referred to as a random gradient descent method. The random gradient descent method may be repeated until the achievable error rate of the overall system has stopped descending or until the error rate has reached a target level. After learning, new images may be presented to the DCN and forward delivery in the network may produce an output 222, which may be considered an inference or prediction of the DCN.

A Deep Belief Network (DBN) is a probabilistic model that includes multiple layers of hidden nodes. The DBN may be used to extract a hierarchical representation of the training dataset. The DBN may be obtained by stacking multiple layers of constrained boltzmann machines (RBMs). RBM is a class of artificial neural networks that can learn probability distributions over an input set. RBMs are often used in unsupervised learning because they can learn probability distributions without information about which class each input should be classified into. Using the hybrid unsupervised and supervised paradigm, the bottom RBM of the DBN may be trained in an unsupervised manner and may act as a feature extractor, while the top RBM may be trained in a supervised manner (on joint distribution of inputs and target classes from previous layers) and may act as a classifier.

A Deep Convolutional Network (DCN) is a network of convolutional networks configured with additional pooling and normalization layers. DCNs have achieved the most advanced performance currently available for many tasks. DCNs may be trained using supervised learning, where both input and output targets are known for many paradigms and are used to modify the weights of the network by using gradient descent methods.

The DCN may be a feed forward network. In addition, as described above, connections from neurons in a first layer to a group of neurons in a next higher layer of the DCN are shared across the neurons in the first layer. The feed forward and shared connections of the DCN can be used for fast processing. The computational burden of DCNs may be much smaller than, for example, similarly sized neural networks including reflow or feedback connections.

The processing of each layer of the convolutional network can be considered as spatially invariant stencil or base projection. If the input is first decomposed into multiple channels, such as red, green, and blue channels of a color image, the convolutional network trained on the input can be considered three-dimensional, with two spatial dimensions along the axis of the image and a third dimension that captures color information. The output of the convolution connection may be considered to form a signature in a subsequent layer, each element in the signature (e.g., 220) receiving input from a range of neurons in a previous layer (e.g., signature 218) and from each channel in the plurality of channels. The values in the signature may be further processed with non-linearities, such as corrections, max (0, x). Values from adjacent neurons may be further pooled (which corresponds to downsampling) and may provide additional local invariance as well as dimension reduction.

Fig. 3 is a block diagram illustrating an example of a deep convolutional network 350. The deep convolutional network 350 may include a plurality of different types of layers based on connectivity and weight sharing. As shown in fig. 3, the deep convolutional network 350 includes convolutional blocks 354A, 354B. Each of the convolution blocks 354A, 354B may be configured with a convolution layer (CONV) 356, a normalization layer (LNorm) 358, and a MAX pooling layer (MAX POOL) 360.

Convolution layer 356 may include one or more convolution filters that may be applied to input data 352 to generate feature maps. Although only two convolution blocks 354A, 354B are shown, the present disclosure is not so limited, and instead any number of convolution blocks (e.g., blocks 354A, 354B) may be included in the deep convolutional network 350 depending on design preferences. The normalization layer 358 may normalize the output of the convolution filter. For example, normalization layer 358 may provide whitening or lateral inhibition. The max-pooling layer 360 may provide spatial downsampling aggregation to achieve local invariance as well as dimension reduction.

For example, parallel filter banks of a deep convolutional network may be loaded onto the CPU 102 or GPU 104 of the image processing system 100 to achieve high performance and low power consumption. In alternative embodiments, the parallel filter bank may be loaded onto the DSP 106 or ISP 116 of the image processing system 100. In addition, the deep convolutional network 350 may access other processing blocks that may be present on the image processing system 100, such as the sensor processor 114.

The deep convolutional network 350 may also include one or more fully-connected layers, such as layer 362A (labeled "FC 1") and layer 362B (labeled "FC 2"). The deep convolutional network 350 may further include a Logistic Regression (LR) layer 364. Between each layer 356, 358, 360, 362, 364 of the deep convolutional network 350 is a weight (not shown) to be updated. The output of each layer (e.g., 356, 358, 360, 362, 364) may be used as input to a subsequent layer (e.g., 356, 358, 360, 362, 364) in the deep convolutional network 350 to learn the hierarchical feature representation from the input data 352 (e.g., image, audio, video, sensor data, and/or other input data) supplied at the first convolution block 354A. The output of the deep convolutional network 350 is a classification score 366 for the input data 352. The classification score 366 may be a set of probabilities, where each probability is a probability that the input data includes a feature from the feature set.

The image, audio, and video content may be stored and/or may be shared between devices. For example, image, audio, and video content may be uploaded to a media hosting service and sharing platform, and may be transferred to various devices. Recording uncompressed image, audio, and video content typically results in large file sizes, which increase substantially as the resolution of the image, audio, and video content increases. For example, 16 bits of uncompressed video per channel recorded at 1080p/24 (e.g., 1920 pixels wide and 1080 pixels high capturing 24 frames per second) may occupy 12.4 megabytes per frame, or 297.6 megabytes per second. 16 bits per channel of uncompressed video recorded at a 4K resolution of 24 frames per second may occupy 49.8 megabytes per frame, or 1195.2 megabytes per second.

Because uncompressed image, audio, and video content can result in large files that can involve significant memory for physical storage and significant bandwidth for transmission, techniques can be utilized to compress such video content. For example, to reduce the size of image content, and thus the amount of memory involved in storing image content and the amount of bandwidth involved in delivering video content, various compression algorithms may be applied to image, audio, and video content.

In some cases, the image content may be compressed using a priori defined compression algorithms, such as Joint Photographic Experts Group (JPEG), better Portable Graphics (BPG), and the like. For example, JPEG is a lossy compressed form based on the Discrete Cosine Transform (DCT). For example, a device performing JPEG compression on an image may transform the image into an optimal color space (e.g., YCbCr color space, which includes luminance (Y), chrominance-blue (Cb), chrominance-red (Cr)), may downsample the chrominance components by averaging pixel groups together, and may apply a DCT function to pixel blocks to remove redundant image data and thereby compress the image data. Compression is based on identifying similar regions inside the image and converting these regions into the same color code (based on a DCT function). Video content may also be compressed using a priori defined compression algorithms, such as Moving Picture Experts Group (MPEG) algorithms, h.264, or high efficiency video coding algorithms.

These a priori defined compression algorithms may be capable of retaining most of the information in the original image and video content and may be a priori defined based on signal processing and information theory ideas. However, while these predefined compression algorithms may be widely applicable (e.g., to any type of image/video content), the compression algorithms may not account for similarity of content, new resolution or frame rate for video capture and delivery, unnatural images (e.g., radar images or other images captured via various sensors), and so forth.

The a priori defined compression algorithm is considered as a lossy compression algorithm. In lossy compression of an input image (or video frame), the input image cannot be decoded and then decoded/reconstructed, thereby reconstructing the exact input image. In contrast, in lossy compression, an approximate version of the input image is generated after decoding/reconstructing the compressed input image. Lossy compression results in a reduction in bit rate at the expense of distortion, which can lead to the presence of artifacts in the reconstructed image. Thus, there is a rate distortion tradeoff in lossy compression systems. For some compression methods (e.g., JPEG, BPG, etc.), distortion-based artifacts may take the form of blocking artifacts or other artifacts. In some cases, neural network-based compression may be used, and may result in high quality compression of image data and video data. In some cases, blurring and color shifting are examples of artifacts.

It may be difficult or impossible to reconstruct exact input data whenever the bit rate is below the true entropy of the input data. However, the fact that there is distortion/loss resulting from data compression/decompression does not mean that the reconstructed image or frame does not have to be artifact. In particular, it may be possible to reconstruct a compressed image into another similar but different image with high visual quality.

In some cases, compression and decompression may be performed using one or more Machine Learning (ML) systems. In some examples, such ML-based systems may provide image and/or video compression that produces high quality visual output. In some examples, such systems may use depth neural network(s), such as rate-distortion self-encoder (RD-AE), to perform compression or decompression of content (e.g., image content, video content, audio content, etc.). The deep neural network may include a self-encoder (AE) that maps images to an implicit code space (e.g., which includes a set of codes z). The implicit code space may include a code space used by an encoder and a decoder, and wherein the content has been encoded as a code z. These codes (e.g., code z) may also be referred to as implicit values, implicit variables, or implicit representations. The deep neural network may include a probabilistic model (also known as a priori or code model) that can losslessly compress code z from an implicit code space. The probability model may generate a probability distribution over the set of codes z, which may represent encoded data based on the input data. In some cases, the probability distribution may be represented as (P (z)).

In some examples, the deep neural network may include an arithmetic decoder that generates a bitstream including compressed data to be output based on a probability distribution P (z) and/or a set of codes z. The bit stream including the compressed data may be stored and/or may be transmitted to a recipient device. The receiver device may perform an inverse process for decoding or decompressing the bitstream using, for example, an arithmetic decoder, a probability (or code) model, and a decoder of AE. A device that generates a bitstream including compressed data may also perform a similar decoding/decompression process when retrieving compressed data from storage. Similar techniques may be performed to compress/encode and decompress/decode the updated model parameters.

In some examples, the RD-AE may be trained and operated to operate as a multi-rate AE (including high rate operation and low rate operation). For example, the implicit code space generated by the encoder of the multi-rate AE may be divided into two or more chunks (e.g., code z is divided into chunks z ₁ And z ₂ ). In high rate operation, the multi-rate AE may transmit a signal based on the entire implicit space (e.g., code z, which includes z ₁ 、z ₂ Etc.), which may be used by the receiving device to decompress the data, This is similar to the operation described above with respect to RD-AE. In low rate operation, the bit stream sent to the recipient device is based on a subset of the implicit space (e.g., chunk z ₁ Rather than z ₂ ). The recipient device may infer the remainder of the implicit space based on the subset transmitted and may use the subset of the implicit space and the inferred remainder of the implicit space to generate reconstructed data.

By compressing (decompressing) content using RD-AE or multi-rate AE, the encoding and decoding mechanisms can be adapted to various use cases. Compression techniques based on machine learning may generate compressed content with high quality and/or reduced bit rate. In some examples, RD-AEs may be trained to minimize the average rate distortion loss over a dataset of data points (such as image and/or video data points). In some cases, the RD-AE may also be fine-tuned for the particular data point to be sent to and decoded by the recipient. In some examples, the RD-AE may achieve high compression (rate/distortion) performance by fine tuning the RD-AE over the data points. An encoder associated with the RD-AE may send the AE model or a portion of the AE model to a recipient (e.g., decoder) to decode the bitstream.

In some cases, the neural network compression system may reconstruct input instances (e.g., input images, video, audio, etc.) from the (quantized) implicit representation. The neural network compression system may also use a priori to losslessly compress the implicit representation. In some cases, the neural network compression system may determine that the test time data distribution is known and relatively low in entropy (e.g., cameras viewing static scenes, automobile recorders in autonomous automobiles, etc.), and may be fine-tuned or adapted to such distribution. This fine tuning or adaptation may result in improved rate/distortion (RD) performance. In some examples, the model of the neural network compression system may be adapted to a single input instance to be compressed. The neural network compression system may provide model updates that may be quantized and compressed in some examples using parameter space priors along with implicit representations.

The fine-tuning may account for the impact of model quantization and the additional costs incurred by sending model updates. In some examples, the neural network compression system may be trimmed using RD loss and an additional model rate term M that measures the number of bits required to send model updates under model priors, resulting in combined RDM loss.

Fig. 4 is a diagram illustrating a system 400 according to some examples of the present disclosure, the system 400 including a transmitting device 410 and a receiving device 420. The transmitting device 410 and the receiving device 420 may each be referred to as RD-AE in some cases. The transmitting device 410 may compress the image content and may store the compressed image content and/or transmit the compressed image content to the receiving device 420 for decompression. The recipient device 420 may decompress the compressed image content and may output the decompressed image content on the recipient device 420 (e.g., for display, editing, etc.) and/or may output the decompressed image content to other devices (e.g., televisions, mobile devices, or other devices) connected to the recipient device 420. In some cases, the recipient device 420 may become the transmitting device by compressing the image content (using encoder 422) and storing and/or transmitting the compressed image content to another device, such as transmitting device 410 (in which case transmitting device 410 would become the recipient device). Although system 400 is described herein with respect to image compression and decompression, those skilled in the art will appreciate that system 400 may use the techniques described herein to compress and decompress video content.

As illustrated in fig. 4, the transmitting device 410 includes an image compression pipeline and the receiving device 420 includes an image bitstream decompression pipeline. In accordance with aspects of the present disclosure, the image compression pipeline in the transmitting device 410 and the bit stream decompression pipeline in the receiving device 420 typically use one or more artificial neural networks to compress and/or decompress the received bit stream into image content. The image compression pipeline in the transmitting apparatus 410 includes a self-encoder 401, a code model 404, and an arithmetic decoder 406. In some implementations, the arithmetic decoder 406 is optional and may be omitted in some cases. The image decompression pipeline in the receiver device 420 includes a self-encoder 421, a code model 424, and an arithmetic decoder 426. In some implementations, the arithmetic decoder 426 is optional and may be omitted in some cases. The self-encoder 401 and code model 404 of the transmitting device 410 are illustrated in fig. 4 as a machine learning system that has been previously trained, and thus are configured to perform operations during inference or operation of the trained machine learning system. The self-encoder 421 and the code model 424 are also illustrated as machine learning systems that have been previously trained.

The self-encoder 401 includes an encoder 402 and a decoder 403. The encoder 402 may perform lossy compression on the received uncompressed image content by mapping pixels in one or more images of the uncompressed image content to an implicit code space (which includes code z). In general, the encoder 402 may be configured such that the code z representing the compressed (or encoded) image is discrete or binary. These codes may be generated based on random perturbation techniques, soft vector quantization, or other techniques that may generate different codes. In some aspects, the self-encoder 401 may map an uncompressed image to code with a compressible (low entropy) distribution. These codes may be close to a predefined or learned prior distribution in cross entropy.

In some examples, the self-encoder 401 may be implemented using a convolutional architecture. For example, in some cases, the self-encoder 401 may be configured as a two-dimensional Convolutional Neural Network (CNN) such that the self-encoder 401 learns spatial filters for mapping image content to an implicit code space. In an example where the system 400 is used to code video data, the self-encoder 401 may be configured as a three-dimensional CNN such that the self-encoder 401 learns a space-time (space-temporal) filter for mapping video to an implicit code space. In such networks, the self-encoder 401 may encode video according to: key frames (e.g., an initial frame that marks the beginning of a sequence of frames, where subsequent frames in the sequence are described as differences relative to the initial frame in the sequence), distortion (or differences) between key frames and other frames in the video, and residual factors. In other aspects, the self-encoder 401 may be implemented as a two-dimensional neural network conditioned on previous frames, residual factors between frames, and adjusted by stacking channels or including recursive layers.

The encoder 402 from the encoder 401 may receive as input a first image (designated as image x in fig. 4) and may map the first image x to a code z in an implicit code space. As mentioned above, the encoder 402 may be implemented as a two-dimensional convolutional network such that the implicit code space has at each (x, y) position a vector that describes the image x-block centered on that position. The x-coordinate may represent horizontal pixel locations in an image x-block and the y-coordinate may represent vertical pixel locations in the image x-block. When coding video data, an implicit code space may have a t-variable or position, where the t-variable represents a timestamp in a block of video data (in addition to the x and y coordinates of the space). By locating these two dimensions using horizontal and vertical pixels, the vector may describe the image patch in image x.

The decoder 403 from the encoder 401 may then decompress the code z to obtain a reconstruction of the first image xIn general, reconstruct +.>May be an approximation of the uncompressed first image x and need not be an exact copy of the first image x. In some cases, the reconstructed image +.>May be output as a compressed image file for storage in the transmitting device.

The code model 404 receives a code z representing an encoded image or a portion thereof and generates a probability distribution P (z) over a set of compressed codewords that may be used to represent the code z. In some examples, code model 404 may include a probabilistic autoregressive generation model. In some cases, the code for which the probability distribution may be generated includes a learned distribution that controls bit assignment based on the arithmetic coder 406. For example, using the arithmetic decoder 406, the compressed code for the first code z may be predicted separately; the compressed code for the second code z may be predicted based on the compressed code for the first code z; the compressed code for the third code z may be predicted based on the compressed codes for the first code z and the second code z, and so on. Compression codes generally represent space-time chunks of a given image to be compressed.

In some aspects, z may be represented as a three-dimensional tensor. The three dimensions of the tensor may include a feature channel dimension, as well as a height space dimension and a width space dimension (e.g., represented as code z _c，w，h ). Each code z _c，w，h (which represents the code indexed by the channel and horizontal and vertical positioning) can be predicted based on the previous code (which can be a fixed and theoretically arbitrary code ordering). In some examples, these codes may be generated by: a given image file is analyzed from beginning to end, and each block in the image is analyzed in raster scan order.

The code model 404 may learn the probability distribution of the input code z using a probabilistic autoregressive model. The probability distribution can be conditioned on its previous values (as described above). In some examples, the probability distribution may be represented by the following formula:

where C is the channel index of all image channels C (e.g., R, G and B channels, Y, cb and Cr channels, or other channels), W is the width index of the total image frame width W, and H is the height index of the total image frame height H.

In some examples, the probability distribution P (z) may be predicted by a causal convolutional neural network. In some aspects, the kernel of each layer of the convolutional neural network may be masked so that the convolutional network knows the previous value z _{0：c，0：w，0：h} And is in meterOther values may not be known when calculating the probability distribution. In some aspects, the final layer of the convolutional network may include a softmax function that determines the probability that a code in implicit space applies to an input value (the likelihood that a given code may be used to compress a given input).

The arithmetic decoder 406 uses the probability distribution P (z) generated by the code model 404 to generate a bit stream 415 (shown as "0010011 …" in fig. 4) corresponding to the prediction of the code z. The prediction of the code z may be represented as the code with the highest probability score in the probability distribution P (z) generated over the set of possible codes. In some aspects, the arithmetic coder 406 may output a variable length bitstream based on the accuracy of the prediction of the code z and the actual code z generated from the encoder 410. For example, if the prediction is accurate, the bit stream 415 may correspond to a short codeword, while as the magnitude of the difference between the predictions of code z and code z increases, the bit stream 415 may correspond to a longer codeword.

In some cases, the bitstream 415 may be output by the arithmetic coder 406 for storage in a compressed image file. The bitstream 415 may also be output for transmission to a requesting device (e.g., a receiving device 420 as illustrated in fig. 4). In general, the bit stream 415 output by the arithmetic decoder 406 may losslessly encode z such that z may be accurately recovered during the decompression process applied to the compressed image file.

The bit stream 415 generated by the arithmetic decoder 406 and transmitted from the transmitting device 410 may be received by the receiving device 420. The transmission between the transmitting device 410 and the receiving device 420 may be made using any of a variety of suitable wired or wireless communication techniques. Communication between the transmitting device 410 and the receiving device 420 may be direct or may be performed by one or more network infrastructure components (e.g., base stations, relay stations, mobile stations, network hubs, routers, and/or other network infrastructure components).

As illustrated, the recipient device 420 may include an arithmetic decoder 426, a code model 424, and a self-encoder 421. The self-encoder 421 includes an encoder 422 and a decoder 423. For a given input Decoder 423 may produce the same or similar output as decoder 403. Although the self-encoder 421 is illustrated as including the encoder 422, the encoder 422 need not be used to obtain from the code z received from the transmitting device 410 during the decoding process(e.g., an approximation of the original image x compressed at the transmitting device 410).

The received bit stream 415 may be input into an arithmetic decoder 426 to obtain one or more codes z from the bit stream. The arithmetic decoder 426 may extract the decompressed code z based on the probability distribution P (z) generated by the code model 424 over the set of possible codes and information associating each generated code z with the bitstream. Given the received portion of the bitstream and the probability prediction of the next code z, the arithmetic decoder 426 may generate a new code z as encoded by the arithmetic decoder 406 at the transmitting device 410. Using the new code z, the arithmetic decoder 426 may make a probability prediction for a successive code z, read additional portions of the bitstream, and decode the successive code z until the entire received bitstream is decoded. The decompressed code z may be provided to a decoder 423 in a self-encoder 421. Decoder 423 decompresses code z and outputs an approximation of image content x (which may be referred to as a reconstructed image or a decoded image). In some cases, the approximation of content x +.>Can be stored for later retrieval. In some cases, the receiver device 420 may recover the approximation +.>And displayed on a screen communicatively coupled or integrated with the recipient device 420.

As mentioned above, the self-encoder 401 and the proxy of the transmitting device 410The code model 404 is illustrated in fig. 4 as a machine learning system that has been previously trained. In some aspects, the self-encoder 401 and the code model 404 may be trained together using image data. For example, the encoder 402 of the self-encoder 401 may receive the first training image n as an input and may map the first training image n to the code z in the implicit code space. The code model 404 may learn the probability distribution P (z) of the code z using a probabilistic autoregressive model (similar to the techniques described above). The arithmetic decoder 406 may use the probability distribution P (z) generated by the code model 404 to generate an image bitstream. Using the bit stream and the probability distribution P (z) from the code model 404, the arithmetic decoder 406 may generate a code z and may output the code z to the decoder 403 from the encoder 401. Subsequently, the decoder 403 may decompress the code z to obtain a reconstruction of the first training image n (wherein the reconstruction->Is an approximation of the uncompressed first training image n).

In some cases, a back propagation engine used during training of the transmitting device 410 may perform a back propagation process to tune parameters (e.g., weights, biases, etc.) of the neural network from the encoder 401 and the code model 404 based on one or more loss functions. In some cases, the back propagation process may be based on a random gradient descent technique. The back propagation may include forward transfer, one or more loss functions, backward transfer, and weight (and/or other parameter (s)) updating. Forward pass, loss function, backward pass, and parameter update may be performed for one training iteration. For each training data set, this process may be repeated for a particular number of iterations until the weights and/or other parameters of the neural network are accurately tuned.

For example, the self-encoder 401 may sum nComparing to determine a first training image n and a reconstructed first training image +.>Loss between (e.g., represented by a distance vector or other difference). The loss function may be used to analyze the error in the output. In some examples, the penalty may be based on a maximum likelihood. Using an uncompressed image n as input and using a reconstructed image +. >As one illustrative example of an output, the neural network system from encoder 401 and code model 404 may be trained using the loss function loss = D + beta R, where R is the rate, D is the distortion, represents the multiplication function, and beta is a compromise parameter set to a value defining the bit rate. In another example, an loss function may be usedThe neural network system from the encoder 401 and the code model 404 is trained. Other loss functions may be used in some situations, such as when other training data is used. One example of another loss function includes Mean Square Error (MSE), which is defined as +.>The MSE calculates the sum of the actual answer minus one-half of the square of the predicted (output) answer.

Based on the determined loss (e.g., distance vector or other difference) and using a back propagation process, parameters (e.g., weights, offsets, etc.) of the neural network system from encoder 401 and code model 404 may be adjusted (effectively adjusting the mapping between the received image content and the implicit code space) to reduce the loss between the uncompressed input image and the compressed image content output from encoder 401.

For the first training image, the loss (or error) may be high because the actual output value (reconstructed An image) may be quite different from the input image. The training purpose is to minimize the amount of loss of predicted output. The neural network may perform backward pass by determining which nodes of the neural network (which have corresponding weights) contribute most to the loss of the neural network, and the weights (and/or other parameters) may be adjusted to reduce and ultimately minimize the loss. The derivative of the loss with respect to the weight (denoted dL/dW, where W is the weight of a particular layer) may be calculated to determine the weight that contributes the greatest to the loss of the neural network. For example, the weights may be updated such that they vary in opposite directions of the gradient. The weight update may be represented asWherein w represents a weight, w _i Represents the initial weight, and η represents the learning rate. The learning rate may be set to any suitable value, where a high learning rate includes a larger weight update and a lower value indicates a smaller weight update.

The neural network system from encoder 401 and code model 404 may continue to train in this manner until the desired output is achieved. For example, the self-encoder 401 and the code model 404 may repeat the back-propagation process to minimize or otherwise reduce the input image n and the reconstructed image resulting from decompression of the generated code z Differences between them.

The self-encoder 421 and code model 424 may be trained using techniques similar to those described above for training the self-encoder 401 and code model 404 of the transmitting device 410. In some cases, the self-encoder 421 and the code model 424 may be trained using the same or different training data sets as are used to train the self-encoder 401 and the code model 404 of the transmitting device 410.

In the example shown in fig. 4, the rate distortion self-encoder (sender device 410 and receiver device 420) is trained and run based on bit rate to infer. In some implementations, the rate-distortion self-encoder may be trained at multiple bit rates to allow high quality reconstructed images or video frames to be generated and output (e.g., without or with artifacts due to distortion relative to the input image) as the amount of information provided in the implicit code z varies.

In some implementations, the implicit code z may be partitioned into at least two chunks z ₁ And z ₂ . When using the RD-AE model at high rate settings, both chunks are transferred to the device for decoding. When using a rate-distortion self-encoder in a low rate setting, only chunk z is transmitted ₁ And from z on the decoder side ₁ Inferring chunk z ₂ . From z ₁ Inferring chunk z ₂ May be performed using a variety of techniques, as described in more detail below.

In some implementations, a set of contiguous implicit values (e.g., that may convey a large amount of information) and corresponding quantized discrete implicit values (e.g., that contain less information) may be used. After training the RD-AE model, an auxiliary dequantization model may be trained. In some cases, when RD-AE is used, only discrete implicit values are transmitted, and an auxiliary dequantization model is used at the decoder side to infer continuous implicit values from these discrete implicit values.

Although system 400 is shown as including certain components, one of ordinary skill in the art will appreciate that system 400 may include more or fewer components than those shown in fig. 4. For example, the sender device 410 and/or the receiver device 420 of the system 400 may also include, in some examples, one or more memory devices (e.g., RAM, ROM, cache, etc.), one or more networking interfaces (e.g., wired and/or wireless communication interfaces, etc.), one or more display devices, and/or other hardware or processing devices, which are not shown in fig. 4. The components shown in fig. 4 and/or other components of system 400 may be implemented using one or more computing or processing components. The one or more computing components may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), and/or an Image Signal Processor (ISP). An illustrative example of computing devices and hardware components that may be implemented with system 1600 will be described below with reference to fig. 16.

The system 400 may be part of or implemented by a single computing device or multiple computing devices. In some examples, the transmitting device 410 may be part of a first device and the receiving device 420 may be part of a second computing device. In some examples, the transmitting device 410 and/or the receiving device 420 may be included as part of an electronic device (or multiple electronic devices), such as a telephone system (e.g., smart phone, cellular phone, conference system, etc.), desktop computer, laptop or notebook computer, tablet computer, set-top box, smart television, display device, game console, video streaming device, SOC, ioT (internet of things) device, smart wearable device (e.g., head mounted display (HDM), smart glasses, etc.), camera system (e.g., digital camera, IP camera, video camera, security camera, etc.), or any other suitable electronic device. In some cases, system 400 may be implemented by image processing system 100 shown in FIG. 1. In other cases, system 400 may be implemented by one or more other systems or devices.

Fig. 5A is a diagram illustrating an example neural network compression system 500. In some examples, the neural network compression system 500 may include an RD-AE system. In fig. 5A, a neural network compression system 500 includes an encoder 502, an arithmetic encoder 508, an arithmetic decoder 512, and a decoder 514. In some cases, encoder 502 and/or decoder 514 may be identical to encoder 402 and/or decoder 403, respectively. In other cases, encoder 502 and/or decoder 514 may be different from encoder 402 and/or decoder 403, respectively.

Encoder 502 may receive image 501 (image x _i ) As an input, and an image 501 (image x _i ) An implicit code 504 (implicit value z _i ). Image 501 may represent a still image and/or a video frame associated with a sequence of frames (e.g., video). In some cases, encoder 502 may perform forward pass to generate implicit code 504. In some examples, encoder 502 may implement a learnable function. In some of the cases in which the liquid is to be circulated,encoder 502 may be implemented byParameterized learnable functions. For example, encoder 502 may implement the function +.>In some examples, the learnable functions need not be shared with or known to the decoder 514.

The arithmetic encoder 508 may be based on the implicit code 504 (implicit value z _i ) And implicit a priori 506 to generate a bit stream 510. In some examples, implicit prior 506 may implement a learnable function. In some cases, implicit prior 506 may implement a learnable function parameterized by ψ. For example, implicit a priori 506 may implement the function p _ψ (z). Implicit prior 506 may be used to use lossless compression to transform the implicit code 504 (implicit value z _i ) Converted into a bit stream 510. Implicit prior 506 may be shared and/or made available at both the sender side (e.g., encoder 502 and/or arithmetic encoder 508) and the receiver side (e.g., arithmetic decoder 512 and/or decoder 514).

The arithmetic decoder 512 may receive the encoded bitstream 510 from the arithmetic encoder 508 and use the implicit prior 506 to encode the implicit code 504 (implicit value z _i ) Decoding is performed. The decoder 514 may convert the implicit code 504 (implicit value z _i ) Decoding into approximately reconstructed image 516 (reconstruction). In some cases, decoder 514 may implement a learnable function parameterized by θ. For example, decoder 514 may implement function p _θ (x|z). The learnable functions implemented by decoder 514 may be shared and/or made available at both the sender side (e.g., encoder 502 and/or arithmetic encoder 508) and the receiver side (e.g., arithmetic decoder 512 and/or decoder 514).

The neural network compression system 500 may be trained to minimize rate distortion. In some examples, the rate reflects the length of the bitstream 510 (bitstream b)And distortion reflects the image 501 (image x _i ) And reconstructing image 516 (reconstruct) Distortion between them. The parameter β may be used to train the model to obtain a particular rate-distortion ratio. In some examples, the parameter β may be used to define and/or implement some trade-off between rate and distortion.

In some examples, the loss may be expressed as follows: Where function E is expected. The distortion (x|z; θ) may be determined based on a loss function, such as, for example, a Mean Square Error (MSE). In some examples, term-log p _θ (x|z) may indicate and/or represent distortion D (x|z; θ).

The rate for transmitting the implicit value may be expressed as R _z (z; ψ). In some examples, the term log p _ψ (z) may indicate and/or represent the rate R _z (z; ψ). In some cases, the penalty may be minimized over the complete data set D as follows:

fig. 5B is a diagram illustrating an example neural network compression system 530 for implementing the inference process. As shown, the encoder 502 may convert the image 501 into an implicit code 504. In some examples, image 501 may represent a still image and/or a video frame associated with a sequence of frames (e.g., video).

In some examples, encoder 502 may use a single forward passTo encode the image 501. Subsequently, the arithmetic encoder 508 may execute the implicit code 504 (implicit value z) under the implicit prior 506 _i ) To generate a bit stream 520 +.>In some examples, the arithmetic encoder 508 may generate the bitstream 520 as follows:

the arithmetic decoder 512 may receive the bit stream 520 from the arithmetic encoder 508 and perform the decoding of the implicit code 504 (implicit value z) under the implicit prior 506 _i ) Is a complex arithmetic decoding of (a). In some examples, the arithmetic decoder 512 may decode the implicit code 504 from the bitstream 520 as follows:the decoder 514 may decode the implicit code 504 (implicit value z _i ) And generates a reconstructed image 516 (reconstruction +.>). In some examples, the decoder 514 may decode the implicit code 504 (implicit value z) using a single forward pass as follows _i )：/>

In some examples, the RD-AE system may be trained using a training data set and further fine-tuned for data points (e.g., image data, video data, audio data) to be transmitted to and decoded by a recipient (e.g., decoder). For example, at the extrapolated time, the RD-AE system may fine tune on the image data that is transmitted to the recipient. Because compression models are typically large, sending parameters associated with the model to the recipient can be very expensive in terms of resources such as network (e.g., bandwidth, etc.), storage, and computing resources. In some cases, the RD-AE system may also fine tune on a single data point that is compressed and sent to the recipient for decompression. This may limit the amount of information (and associated costs) sent to the recipient while maintaining and/or improving compression/decompression efficiency, performance, and/or quality.

Fig. 6 is a diagram illustrating an example inference process implemented by an example neural network compression system 600 that is trimmed using model priors. In some examples, neural network compression system 600 may include an RDM-AE system that has been trimmed a priori using an RDM-AE model. In some cases, the neural network compression system 600 may include an AE model that is trimmed using model priors.

In this illustrative example, the neural network compression system 600 includes an encoder 602, an arithmetic encoder 608, an arithmetic decoder 612, a decoder 614, a model prior 616, and an implicit prior 606. In some cases, encoder 602 may be the same as or different from encoder 402 or encoder 502, and decoder 614 may be the same as or different from decoder 403 or decoder 514. The arithmetic encoder 608 may be the same as or different from the arithmetic decoder 406 or the arithmetic encoder 508, and the arithmetic decoder 612 may be the same as or different from the arithmetic decoder 426 or the arithmetic decoder 512.

The neural network compression system 600 may generate an implicit code 604 (implicit value z _i ). The neural network compression system 600 may use an implicit code 604 and implicit prior 606 to compress the image 601 (image x _i ) Encodes and generates a bitstream 610, which bitstream 610 may be used by a recipient to generate a reconstructed image 620 (reconstruction ). In some examples, image 601 may represent a still image and/or a video frame associated with a sequence of frames (e.g., video).

In some examples, the neural network compression system 600 may use RDM-AE losses to fine tune. The neural network compression system 600 may be trained by minimizing rate distortion model Rate (RDM) loss. In some examples, on the encoder side, the AE model may use RDM loss to generate a new image (image x) on image 601 as follows _i ) Fine tuning is carried out on:

the fine tuned encoder 602 can encode an image 601 (image x _i ) Encoding is performed to generate an implicit code 604. In some cases, the fine-tuned encoder 602 may use a single forward pass as followsDelivery pair image 601 (image x _i ) Coding:the arithmetic encoder 608 may use the implicit prior 606 to convert the implicit code 604 into a bit stream 610 for an arithmetic decoder 612. The arithmetic encoder 608 may entropy code the parameters of the trimmed decoder 614 and the trimmed implicit prior 606 under a model prior 616 and generate a bitstream 611 comprising the compressed parameters of the trimmed decoder 614 and the trimmed implicit prior 606. In some examples, the bitstream 611 may include updated parameters for the trimmed decoder 614 and the trimmed implicit prior 606. The updated parameters may include, for example, parameter updates relative to baseline decoders and implicit priors, such as decoder 614 and implicit priors 606 prior to fine tuning.

In some cases, the fine-tuned implicit prior 606 may entropy code under the model prior 616 as follows:the trimmed decoder 614 may entropy code under model priors 616 as follows:and implicit code 604 (implicit value z) _i ) Entropy coding may be performed with a fine-tuned implicit prior 606 as follows: />In some cases, at the decoder side, the fine-tuned implicit prior 606 may entropy code under the model prior 616 as follows: />The trimmed decoder 614 may entropy code under model priors 616 as follows: />And implicit code 604 (implicit value z) _i ) Entropy translation under fine-tuned implicit prior 606 may be performed as followsCode: />

The decoder 614 may convert the implicit code 604 (implicit value z _i ) Decoding into approximately reconstructed image 620 (reconstruction). In some examples, decoder 614 may decode implicit code 604 using a single forward pass of the trimmed decoder as follows:

as previously explained, the neural network compression system 600 may be trained by minimizing RDM loss. In some cases, the rate may reflect the length of bit stream b (e.g., bit streams 610 and/or 611), and the distortion may reflect the length of input image 601 (image x _i ) And reconstructing image 620 (reconstruct) And the model rate may reflect the length of the bitstream used and/or required to send the model update (e.g., updated parameters) to the recipient (e.g., decoder 614). The parameter β may be used to train the model to obtain a particular rate-distortion ratio.

In some examples, the loss of data point x may be minimized at the extrapolated time as follows:in some examples, RDM loss may be expressed as follows: in some cases, the distortion D (x|z; θ) may be determined based on a loss function, such as, for example, a Mean Square Error (MSE).

Term-log p _θ (x|z) may indicate and/or represent distortion D (x|z; θ). Term beta log p _ψ (z) can indicate and/or represent a message for transmitting the implicit value R _z (z; ψ) and term βlog p _ω (ψ, θ) may indicate and/or represent a method for sending a trimmed model update R _ψ，θ (ψ, θ; ω).

In some cases, the model prior 616 may reflect the length of the bit rate overhead for sending the model update. In some examples, the bit rate for sending the model update may be described as follows:in some cases, the model priors may be chosen such that it is inexpensive to send models that do not have updates, i.e. the bit length (model rate loss) is small: />

In some cases, using the RDM loss function, the neural network compression system 600 may only reduce the implicit rate or distortion by at least as much bits to the bitstream for model updateBits are added. This may provide an improvement in rate distortion (R/D) performance. For example, the neural network compression system 600 may increase the number of bits in the bit stream 611 used for the transmit model update if it is also able to reduce the rate or distortion by at least the same number of bits. In other cases, the neural network compression system 600 may add ∈m to the bit stream for model update even if the implicit rate or distortion is not reduced by at least as much bits >Bits are added.

The neural network compression system 600 may be trained end-to-end. In some cases, RDM losses may be minimized end-to-end at the inference time. In some examples, a certain amount of computation (e.g., fine tuning of the model) may be spent at a time, and then a high compression rate may be achieved without additional cost to the receiver side. For example, for video to be provided to a large number of recipients, the content provider may expend a high amount of computation to more broadly train and fine tune the neural network compression system 600. The highly trained and fine-tuned neural network compression system 600 may provide high compression performance for the video. After having spent a high amount of computation, the video provider may store the updated parameters of the model priors and efficiently provide each recipient of the compressed video to decompress the video. The video provider can achieve a large benefit in compression (and a reduction in network and computing resources) with each transmission of video, which can significantly outweigh the initial computing costs of training and fine-tuning the model.

The training/learning and fine tuning approach described above may be very advantageous for video compression and/or high resolution images due to the large number of pixels in the video and images (e.g., high resolution images). In some cases, complexity and/or decoder computation may be used as additional considerations for overall system design and/or implementation. For example, fine-tuning can be done on very small networks that are fast to infer. As another example, a cost term for receiver complexity may be added, which may force and/or cause the model to remove one or more layers. In some examples, machine learning may be used to learn more complex model priors to achieve even greater gains.

The model prior design may include various attributes. In some examples, the model priors implemented may include a model priors that assign a high probability to sending models that do not have any updatesAnd thus has a low bit rate:in some cases, the model prior may comprise a model prior that is directed to +.>Surrounding values are assigned non-zero probabilities so that different instances of the trimmed model can be encoded in practice. In some cases, the model priors may include a model priors that may be quantized at inference time and used for entropy coding.

Despite the rapid progress of research, compressed coder-decoders based on deep learning (referred to as "codecs") have not been deployed in commercial or consumer applications. One reason for this is that neural codecs have not been robustly superior to traditional codecs in terms of rate distortion. In addition, existing nerve-based codecs present further implementation challenges. For example, a neural-based codec requires a trained neural network on all recipients. Thus, all users across different platforms must store identical copies of such neural networks in order to perform decoding functions. Storing such neural networks consumes a large amount of memory, is difficult to maintain, and is vulnerable to damage.

As mentioned above, systems and techniques including implicit neural compression codecs are described herein that can address the above issues. For example, aspects of the present disclosure include video compression codecs based on Implicit Neural Representations (INRs), which may be referred to as implicit neural models. As described herein, an implicit neural model may take as input coordinate locations (e.g., coordinates within an image or video frame) and may output pixel values (e.g., color values of the image or video frame, such as red, green, blue (RGB) values for each coordinate location or pixel). In some cases, the implicit neural model may also be based on an IPB frame scheme. In some examples, the implicit neural model may modify the input data into a model optical flow (referred to as Implicit Neural Optical Flow (INOF)).

For example, an implicit neural model may model optical flow with an implicit neural representation, where local translation may be equivalent to element-by-element addition. In some cases, optical flow may correspond to local translation (e.g., movement of pixels as a function of positioning). In some aspects, the light stream may be modeled across video frames to improve compression performance. In some cases, the implicit model may model the light stream by adjusting the input coordinate positioning to produce corresponding output pixel values. For example, element-wise addition of inputs may result in local translation at the output, which may eliminate the need for pixel movement and associated computational complexity. In one illustrative example, a translation from a first frame having three pixels (e.g., p1|p2|p3) and a second frame having three pixels (e.g., p0|p1|p2) may be modeled by an implicit neural model by modifying the input (e.g., without shifting the positioning of the pixels), such as by performing an element-wise subtraction or addition. The following diagrams illustrate this example:

1|2|3→P1|P2|P3

0|1|2→P0|P1|P2

As mentioned above, the implicit neural model may take as input the coordinate positioning of an image or video frame and may output the pixel values of the image or video frame. In this case, the inputs (1.sub.2.sub.3 and 0.sub.1.sub.2) represent inputs to the implicit neural model and include coordinates within the image. The outputs (p1|p2|p3) and (p0|p1|p2) represent the output of the implicit neural model and may include RGB values. Each of the two lines (1|2|3→p1|p2|p3 and 0|1|2→p0|p1|p2) corresponds to the same model, where an input change by a value of '1' results in a corresponding shift in output. In the case of using conventional optical flow, the machine learning model itself must shift the positioning of the pixels from one frame to the next. Because the implicit machine learning model takes coordinates as input, the input can be preprocessed (prior to processing by the codec) to subtract 1 from each input value, in which case the output will be shifted and thus the light flow will be effectively modeled. In some cases, element-wise addition may be performed (e.g., when an object in a frame moves in a particular direction), where a value (e.g., value 1) is to be added to the input value.

In some examples, the residual may be modeled across frames with a weight update of the implicit neural model. In some cases, the present techniques may be used to reduce the bit rate required to compress inter-prediction frames (e.g., unidirectional frames (P-frames) and/or bi-directional frames (B-frames)). In some examples, intra frames (e.g., intra frames or I frames) may be processed using a convolution-based architecture. Convolution-based architectures can be used to solve the decoding computational bottleneck of implicit models, resulting in a resulting fast encoded and decoded model. In some aspects, converting the data into a bitstream may be performed by post-training quantization of I frames and quantization-aware training of P and B frames.

In some cases, the model may be quantized and/or encoded to form a complete neural compression codec. In some examples, the model may be sent to the recipient. In some cases, fine-tuning of the model may be performed on P-frames and B-frames, and the convergence update may be sent to the recipient. In some aspects, the model may be fine-tuned using sparsity-induced priors and/or quantization-aware procedures (which may minimize the bit rates of P-frames and B-frames). Compared to existing neural compression codecs, neural compression codecs based on implicit neural models eliminate the need for a pre-training network on the receiver side (and in some cases also on the transmitter side). The performance of the present technology is advantageous over both traditional and neural-based codecs, and has enhanced performance over previous INR-based neural codecs, over both image and video datasets.

In some aspects, implicit Neural Representation (INR) methods/models may be used for video and image compression. The video or image may be represented as a function, which may be implemented as a neural network. In some examples, encoding the image or video may include: the architecture is selected and the network weights are over-fitted on a single image or video. In some examples, decoding may include neural network forward pass. One challenge with the implicit neural model for compression is decoding computational efficiency. Most existing implicit neural models require one forward pass for each pixel in the input data. In some aspects, the present techniques include a promotion of a convolutional architecture as an implicit neural representation model, which may reduce the computational overhead associated with decoding high resolution video or images, thereby reducing decoding time and memory requirements.

In some examples, the bit rate may be determined by the size of the stored model weights. In some cases, to improve the performance of the implicit neural approaches disclosed herein, the model size may be reduced to improve the bit rate. In some configurations, reducing the model size may be performed by: the weights are quantized and a weight prior that can be used to losslessly compress the quantized network weights is fitted.

In some cases, the present technique may be matched to the compression performance of the most advanced neural image and video codecs. One exemplary advantage of a codec as disclosed herein is that it may eliminate the need to store a neural network on the receiver side and may be implemented with a lightweight framework. Another advantage (e.g., compared to a scale space-like flow (SSF) neural codec) is the absence of flow operations, which can be difficult to implement in hardware. In addition, the decoding function may be faster than standard tricot decoders. Furthermore, the present technique does not require a separate training data set, as it can be implicitly trained using the data to be encoded (e.g., current instance of image, video frame, video, etc.). The configuration of the implicit neural model described herein may help avoid potential privacy concerns and perform well on data from different domains, including those where no suitable training data is available.

In one example relating to neural compression codes, neural video compression may be implemented using a framework of variators or compression self-encoders. Such a model is configured to optimize Rate Distortion (RD) loss as follows:

In this example, encoder q _φ Each instance x is mapped to an implicit value z and the decoder p recovers the reconstruction. Assuming that a trained decoder is available on the receiver side, the transmitted bitstream includes an encoded implicit value z. Examples of this type of configuration include 3D convolution architecture and IP frame stream architecture, which stripe each P-frame with the previous frameAnd (3) a piece. Another example includes instance adaptive trimming, where the model is trimmed on each test instance and the model is transmitted with implicit values. While this approach may include advantages over previous works (e.g., robustness to domain shifting and reduction in model size), it still requires that a pre-trained global decoder be available on the receiver side.

In another example related to a neural compression codec, a model may be used to compress an image by implicit representation of the image as neural network weights. This configuration implements a sinusoidal representation network (SIREN) based model with different numbers of layers and channels and quantizes them to 16-bit precision. The described implicit warp decoder differs from other systems in that the SIREN model may be used for image compression tasks. For example, in some examples, the implicit neural codecs described herein may include a convolutional architecture with position-coding, may implement higher-level compression schemes including quantization and entropy coding, and may perform video compression.

In one example related to implicit neural representations, implicit representations have been used to learn three-dimensional structures and light fields. In some examples, these configurations may train the neural network on a single scene such that it is encoded by the network weights. A new view of the scene may then be generated by forward-passing through the network. In some aspects, these methods may be more efficient than discrete methods because there is high redundancy in the discrete representation when the object data is located on a low-dimensional manifold in a high-dimensional coordinate system, with one value associated with each set of coordinates. In some examples, implicit neural representations can take advantage of such redundancy and thereby learn more efficient representations.

Although implicit representations may be applied to data having lower dimensional coordinates, such as images and video, the relative efficiency compared to discrete or implicit representations has not been determined. Furthermore, the performance of existing configurations using implicit representations needs to match or exceed the performance of configurations using discrete representations or having an established compression codec.

Selecting correct regardless of the dimension of the input dataIs important. In some examples, fourier domain features facilitate learning an implicit neural model of a structure of a real scene. For example, fourier domain features have been implemented for natural language processing, where fourier location encoding of words in sentences is shown to enable language modeling with full attention architecture that is most advanced at the time. In addition, with respect to implicit neural modeling of visual tasks, one configuration may use a randomly sampled fourier frequency as an encoder prior to afferent to the MLP model. In addition, some configurations include all MLP activations that may be sinusoidal functions, provided that the weights are carefully initialized. Wherein X is _int Is an integer tensor with b bits and s is a scaling factor (or vector) in floating point.

In some examples, neural network quantization may be used to reduce model size to facilitate more efficient running of models on resource constrained devices. Examples of neural network quantization include vector quantization (which may represent quantized tensors using a codebook) and fixed-point quantization (which may represent tensors with fixed-point numbers including integer tensors and scaling factors). In the fixed point, the quantization function may be defined as follows:

wherein θ is _int Is an integer tensor with b bits and s is a scaling factor (or vector) in floating point. In some aspects, the symbol τ= (s, b) may be used to refer to a set of all quantization parameters.

In some examples, low-order quantization of weight tensors (e.g., ownership weight tensors) in the neural network may incur significant quantization noise. With quantization-aware training, the neural network can adapt to quantization noise by end-to-end training it with quantization operations. Since the rounding operation in equation 2 is not differentiable, a pass-through estimator (STE) is typically used to approximate its gradient. In some cases, learning of each tensor bit width of each layer may be performed in addition to learning the scaling factor in conjunction with the network. In some aspects, the present techniques may program the quantization bit width as a rate loss and may minimize RD loss to implicitly learn the best tradeoff between bit rate and distortion in pixel space.

Fig. 7A is a diagram illustrating an example codec of an implicit neural network based compression system 700. In some aspects, the implicit neural compression system 700 may include a pipeline for training an implicit compression model configured to optimize distortion and/or bit rate. In some examples, distortion may be minimized by training weights w 706 of implicit model ψ (w) 704 on the distortion target. In some aspects, the rate may be minimized by: using a quantisation function Q _τ (w) quantizing the weights 706, and weighting the weights a priori on the quantized weights 708712, fitting. In some examples, these components may be combined into a single target reflecting rate distortion loss as follows:

in some examples, the first step in "encoding" a data point x (e.g., corresponding to input image data 702, which may include one or more images) is to find the minimum loss for that data point (e.g., input image data 702) in equation (3). In some cases, the minimum loss may be obtained using search and/or training algorithms. For example, as shown in fig. 7A, to train the implicit neural model 704 on the transmitter side, the coordinate grid 703 is input into the implicit model 704. The weights of the implicit model 704 are initialized to initial values prior to training. The initial values of the weights are used to process the coordinate grid 703 and generate a reconstructed output value (e.g., RGB value for each pixel) of the input image data 702, represented in equation (3) as ψ (Q _τ (ω)). The actual input image data 702 being compressed may be used as a known output (or label), represented as data point x in equation (3). Subsequently, a reconstructed output value (ψ (Q) _τ (omega))) and known inputsLoss between outputs (data point x, which is input image data 702 in FIG. 7A)Based on this loss, the weights of implicit model 704 may be tuned (e.g., based on a back propagation training technique). Such a process may be performed a number of iterations until the weights are tuned such that a certain loss value is obtained (e.g., minimizing the loss value). Once implicit model 704 is trained, weights w 706 from implicit model 704 may be output, as shown in FIG. 7A. On the receiver side, the coordinate grid 703 may be processed after dequantization (or using quantized weights 708) using an implicit model 704 tuned with decoded weights w. In some cases, the architectural parameters (ψ (w)) of the implicit model 704 may be determined based on the decoding of the bit stream 720 by the architectural decoder 726.

In some aspects, the first step may comprise: the optimal implicit model 704 to be used for compressing the input image data 702 is determined (among the available implicit model groups) by searching the network architecture ψ (), and the weights w 706 for each model are trained by minimizing the D loss without quantization. In some examples, this process may be used to select implicit model 704.

In some cases, a quantizer may be implemented to achieve an optimal distortion D based on the quantizer super-parameter τ. In some aspects, the implicit model ψ (w) 704 may be fine-tuned based on the quantized weights 708.

In some examples, the weight priors 712 may be implemented while fixing the quantizer parameters and implicit model weights (e.g., quantized weights 708 or weights 706). In some aspects, the weight a priori 712 may be used to determine an optimal setting (which includes the weight w 706) that minimizes the rate loss R.

In some aspects, the implicit neural network compression system 700 may be used as an image or video codec, which may be configured to encode (using the a priori encoder 714) the weight a priori parameters w 712 in the bitstream 722, and in the weight a prioriEntropy coding is used (by Arithmetic Encoder (AE) 710) under 712 to quantize the weights +.>708 are encoded in the bitstream 724. In some examples, decoding may be implemented in an inverse manner. For example, on the receiver/decoder side, an Arithmetic Decoder (AD) 730 may perform entropy decoding using a priori decoded weights (decoded by a priori decoder 728) to decode the bitstream 724 and generate weights (e.g., weights 706 or quantized weights 708). Using weights and neural network model architecture (e.g., ψ (w)), implicit model 704 may generate output image data 732. In one example, once ψ (·) and +_are decoded >The reconstruction +.can be obtained using forward transfer as follows>

As mentioned above, the implicit model 704 may include one or more neural network architectures that may be selected by training the weights w 706 and determining minimum distortion. In one example, the implicit model 704 may include a multi-layer perceptron (MLP) that takes coordinates within an image as input and returns RGB values (or other color values) as follows:

in some aspects, implicit model 704 may implement a SIREN architecture that may use periodic activation functions to ensure that fine details in images and video may be accurately represented. In some examples, decoding the image may include evaluating the MLP at each pixel location (x, y) of interest. In some cases, the representation may be trained or evaluated at different resolution settings or any type of grid of pixels (e.g., an irregular grid) because the representation is continuous.

In some examples, the implicit model 704 may include a convolutional network that may be used to improve the computational efficiency of the code (e.g., particularly on the receiver side). In some cases, an implicit neural model based on MLP may require forward pass for each input pixel coordinate, which may result in many (e.g., about two million) forward passes to decode each frame of 1K resolution video.

In some aspects, an MLP-based implicit neural model can be considered as a convolution operation with a 1x1 kernel. In some examples, the techniques described herein may generalize an implicit model to a convolution architecture.

Unlike MLP (which processes one coordinate at a time), the present technique can arrange all coordinates at once, with the coordinate values on the channel axis. In some aspects, the present techniques may use a 3x3 kernel and a stride value of 2 for the transpose convolution block (e.g., which indicates that the convolution kernel or filter moves two positions after each convolution operation), which may result in a 2 reduction in the number of forward passes required to reconstruct the image ^2L Where L is the number of convolutional layers.

In some examples, the random fourier coding and SIREN architecture may be generalized in this manner. For example, a first layer in a convolution architecture may include a positioning encoding of coordinates as follows:

where c, i are indexes along the channel and spatial dimensions,0≤c＜2N _ω is N from Gaussian distribution _ω Frequency samples. Standard deviation and frequency number are super parameters. The position encoding may be followed by alternating transpose convolution and ReLU activation.

In some aspects, a convolution model from the present technique can easily process high resolution images with any low number of forward passes, thereby speeding up both encoding and decoding. At high bit rates, memory efficiency is also much higher. In some examples, training the 3x3 convolution kernel at ultra-low bit rates may be achieved using different convolution kernels (e.g., 1x1 and/or 3x3 convolutions in the pipeline).

As mentioned above, the input to the neural network compression system 700 may include image data 702 (e.g., to train an implicit model), which may include video data. In some examples, video data may have strong redundancy between subsequent frames. In existing video codecs, a group of pictures (GoP) is often compressed in a manner that makes each frame dependent on the previous frame. In particular, the new frame prediction may be formulated as the sum of the warp and residual of the previous frame. The present technology may implement a similar configuration for use with an implicit neural compression scheme. In some cases, the implicit model has been shown to accurately represent the distortion. In some aspects, the present techniques may use temporal redundancy, which may be implicitly utilized to share weights across frames. In some aspects, a completely implicit approach (as disclosed herein) may have the advantage of being conceptually simple and architecture free.

In some examples, the implicit video representation may be implemented using a group of pictures. For example, video may be divided into groups of N frames (or pictures), and each batch may be compressed with a separate network. In some cases, such an implementation reduces the expressiveness required for implicit representations. In some examples, such an implementation may enable buffered streaming, as only one small network needs to be sent before the next N frames can be decoded.

In some aspects, the implicit video representation may be implemented using 3D MLP. For example, the MLP representation can be easily extended to video data by adding a third input representing a frame number (or time component) t. In some examples, the SIREN architecture may be used in conjunction with sinusoidal activation.

In some cases, the implicit video representation may be implemented using a 3D convolutional network. As mentioned previously, 3D MLP can be considered a 1x1x1 convolution operation. Similar to the 2-dimensional case, the present technique may implement 3DMLP as a convolution operation with a 3-dimensional kernel. To keep the number of parameters to a minimum, the present technique may use a spatial kernel of size kxkx1 followed by a frame-by-frame kernel of shape 1x1 xk'.

Regarding the fourier code in equation 5, one can be obtained by combining x _i Set as [ t, x, y]And accordingly introducing additional frequencies to account for additional coordinates. Since the temporal and spatial correlation scales are likely to be very different, the present technique can set the time-conjugate frequency variance to a separate super-parameter. A three-dimensional transposed convolution sequence alternating with ReLU activation may process the position-coding features into a video sequence.

In some aspects, the implicit video representation may be implemented using a time-modulated network, which corresponds to an implicit representation that may adapt the representation to a data set rather than a single instance to be effective. In some examples, the methods may include using a super network, as well as implicit value-based methods. In some cases, the present technology may use a time-modulated network to generalize an instance model to frames (rather than sets of data points) in a video. In some examples, the present technology may implement a composite modulator composite network architecture due to its conceptual simplicity and parameter sharing efficiency. While prior implementations have found that SIREN MLP cannot reconstruct high quality at high resolution and thus partition images into overlapping spatial tiles for weight sharing purposes, the present technique implements a SIREN architecture that can generate high resolution frames. In some cases, the present technique may only preserve modulation along the frame axis. In this approach, the input to the model is still only the spatial coordinates (x, y). However, the kth layer of the network is given by:

Here, σ (·) is the activation function,is a neural network layer comprising a 3 x 3 or 1 x 1 convolution, z _t Is a learnable implicit vector for each frame, and g _k (. Cndot.) represents the k-th layer output of the modulated MLP. Element-wise multiplication interactions allow modeling of complex time dependencies.

In some examples, the implicit video representation may be implemented using configurations based on IPB frame decomposition and/or IP frame decomposition. Referring to fig. 9, a group of consecutive frames 902 may be encoded by first compressing the intermediate frames into I frames (e.g., using IPB frame decomposition). Next, starting from the trained I-frame implicit model, the present technique can fine tune the first and last frames to P-frames. In some examples, fine-tuning the first and last frames may include using sparsity-induced priors and quantization-aware fine-tuning to minimize bit rate. In some aspects, the remaining frames may be encoded as B frames. In some examples, IPB frame decomposition may be achieved by initializing model weights to interpolate model weights on both sides of the frame. In some cases, the entire bitstream may include quantized parameters of an I-frame model that are a priori encoded with a fitting model and quantized updates for P-frames and B-frames that are a priori encoded with sparsity induction. In some examples, the implicit video representation may be implemented using IP frame decomposition, as illustrated by frame 904 in fig. 9.

Returning to fig. 7A, the neural network compression system 700 may implement a quantization algorithm that may be used to quantize the weights 706 to produce quantized weights 708. In some aspects, network quantization may be used to reduce model size by: for each weight tensor w using fixed point representation ⁽ⁱ⁾ E w. In some cases, quantization parameters and bit widths may be jointly learned; for example, by learning a scale s and a clipping threshold q _max . Subsequently, the bit width b is implicitly defined as b (s, q _max )＝log ₂ (q _max +1). This parameterization is shown to be better than direct learning bit width because it does not suffer from an unbounded gradient norm.

In some examples, encoding the bitstream may include encoding all quantization parametersAnd all integer tensorsEncoding is performed. All s ⁽ⁱ⁾ Are all encoded as 32-bit floating point variables, bit width b ⁽ⁱ⁾ Encoded as its corresponding bit width b ⁽ⁱ⁾ INT4 and integer tensor +.>

In some aspects, the neural network compression system 700 may implement entropy coding. For example, the final training phase may include: an Arithmetic Encoder (AE) 710 fits a priori over weights (e.g., weights 706 or quantized weights 708) to generate a bitstream 724. As mentioned above, on the receiver/decoder side, an Arithmetic Decoder (AD) 730 may perform entropy decoding using decoded weights priors (decoded by a priori decoder 728) to decode the bitstream 724 and generate weights (e.g., weights 706 or quantized weights 708). Using the weights and neural network model architecture, implicit model 704 may generate output image data 732. In some cases, the weights may be approximately distributed as a gaussian distribution centered around 0 for most tensors. In some examples, the scale of each weight may be different, but since the weight range is a (transmitted) quantization parameter And thus the weights can be normalized. In some cases, the network compression system 700 may then gaussian fit the normalized weights and use this for entropy coding (e.g., to produce the bitstream 724).

In some examples, some weights (e.g., weights 706 or quantized weights 708) are sparsely distributed. For sparsely populated weights, the neural network compression system 700 may transmit a binary mask that may be used to redistribute only the probability mass to the binary file with content. In some cases, a single bit may be included to encode whether a mask is transmitted.

Fig. 7B is a diagram illustrating an example codec of an implicit neural network based compression system 700. In some aspects, the implicit neural compression system 700 may include a pipeline for training an implicit compression model configured to optimize distortion and/or bit rate. As mentioned above with respect to fig. 7A, the first step may include: the optimal implicit model 704 to be used for compression of the input image data 702 is determined (among the available implicit model groups) by searching the network architecture ψ (), and the weights w 706 for each model are trained by minimizing distortion loss without quantization. In some examples, this process may be used to select implicit model 704. In some examples, implicit model 704 may be associated with one or more model characteristics, which may include model width, model depth, resolution, size of convolution kernel, input dimensions, and/or any other suitable model parameter or characteristic.

In some aspects, the receiver side (e.g., decoder) does not have a priori knowledge about the network architecture ψ (·) used to encode the input image data 702. In some cases, the implicit neural network compression system 700 may be configured (using the architecture encoder 716) to encode the model architecture ψ (·) 718 in the bitstream 720.

Fig. 8A is a diagram illustrating an example of a pipeline 800 for a group of pictures using implicit neural representations. In some aspects, pipeline 800 may be implemented by a video compression codec that may process images using a neural network that may map coordinates associated with an input image (e.g., I-frame 802 and/or P1-frame 808) to pixel values (e.g., RGB values). In some examples, the output of pipeline 800 may include a compressed file with a header (e.g., to identify a network architecture) and/or weights for the neural network for the corresponding input frame.

In some examples, pipeline 800 may include a base model 804 (e.g., base model f _θ ) Which can be used forOne or more image frames from a group of frames associated with a video input are compressed. In some cases, the base model 804 may include an I-frame model trained using a first frame from a group of frames. In some aspects, training of the base model 804 may include compressing a first frame (e.g., an I-frame) from a group of frames by mapping input coordinate locations to pixel values (e.g., using equation (4)).

In some aspects, the size of the base model 804 may be reduced by quantizing one or more of the weight tensors associated with the base model 804. In some examples, the weight tensor may be quantized using a fixed-point quantization function, such as a function from equation (2). For example, equation (2) may be used to quantize the base model 804 to produce a quantized base model 806 (e.g., quantized base model f _Q(θ) ). In some aspects, the quantized base model 806 may be compressed (e.g., using an arithmetic encoder) and sent to the recipient.

In some examples, pipeline 800 may include a flow model 810 (e.g., a flow model) Which may be used to determine the optical flow field between two image frames (e.g., I-frame 802 and P1 frame 808). For example, the flow model 810 may be configured to determine optical flow fields or motion vectors (e.g., displacement vector fields) between consecutive image frames from a video. In some aspects, the flow model 810 may be trained using a second frame (e.g., P1 frame 808) from the group of frames. In some cases, the displacement vector field determined by the flow model 810 may be applied to a previous frame to model the current frame. In some aspects, the displacement of the optical flow field may be expressed as +. >In some cases, the displacement of the optical flow field may be applied by adding the displacement vector to the input variable according to:

in some aspects, the size of the flow model 810 may be reduced by quantifying one or more of the weight tensors associated with the flow model 810. In some examples, the weight tensor may be quantized using a fixed-point quantization function, such as a function from equation (2). For example, equation (2) may be used to quantize the flow model 810 to produce a quantized flow model 812 (e.g., quantized flow model). In some aspects, the quantized flow model 812 may be compressed (e.g., using an arithmetic encoder) and sent to the recipient. />

Fig. 8B is a diagram illustrating an example of a pipeline 840 for a group of pictures using implicit neural representations. In some aspects, pipeline 840 may represent a second pipeline stage that may follow pipeline 800. For example, pipeline 840 may be used to process and compress frames using a trained base model (e.g., base model 844) and a trained flow model (e.g., flow model 846).

In some examples, pipeline 840 may be used to encode additional frames from a group of frames by determining quantized updates to parameters of the composite model. For example, pipeline 840 may be used to sequentially iterate over subsequent P frames (e.g., P1 frame 842) to learn the base model weight update δθ and the flow model weight update relative to the previous frame In some cases, update of the base model weight θ and the flow model weight +.>The update of (c) may be determined as follows:

θ _t ＝θ _t-1 +δ _t θ and

in some aspects, updated weights of the base model 844 and the flow model 846 may be sent to the recipient. In some cases, the weight updates δθ andquantization can be performed on a fixed grid of n equal-sized slots of width t centered on δθ=0. In some examples, the weight updates may be entropy coded under a pin-plate prior, which is a hybrid model of a narrow gaussian distribution and a wide gaussian distribution given by:

in some aspects, the variance is used in equation (9)The "slab" component of (b) may minimize the bit rate used to send the updated weights to the receiver. In some cases, with a narrow standard deviation sigma _spike ＜＜σ _slab The "spike" component of (c) may minimize the processing costs associated with zero updates. In some examples, similar subsequent frames may have an update δθ that is sparse and associated with relatively low bit rate cost. In some aspects, the grid parameters n and t, a priori standard deviation σ are quantized _spike Sum sigma _slab And the needle plate ratio ∈corresponds to the super parameter. As shown in fig. 8B, the receiving side outputs a reconstructed P1 frame 850 (shown as frame +. >850 And a reconstructed I-frame (shown as frame +.>848)。

In some aspects, the variance is used in equation (9)The "slate" component of (c) may minimize the bit rate used to send the updated weights to the recipient. In some cases, with a narrow standard deviation sigma _spike ＜＜σ _slab The "spike" component of (c) may minimize the processing costs associated with zero updates. In some examples, similar subsequent frames may have an update δθ that is sparse and associated with relatively low bit rate cost. In some aspects, the grid parameters n and t, a priori standard deviation σ are quantized _spike Sum sigma _slab And the needle plate ratio ∈corresponds to the super parameter. As shown in fig. 8B, the recipient outputs a reconstructed P1 frame 850 (shown as a frame850 And a reconstructed I-frame (shown as frame +.>848)。

In some aspects, the pipeline 860 may include a second stage 868 that may process a second frame (e.g., P ₁ Frame 862) to generate a reconstructed second frame874. In some examples, pipeline 860 may include a third stage 870, which may process a third frame (e.g., P ₂ Frame 864) to produce a reconstructed third frame +.>876. Those skilled in the art will recognize that pipeline 860 may be configured with any number of stages in accordance with the present technique.

In some examples, each stage of the pipeline 860 may include a base model (e.g., base model 804) and a flow model (e.g., flow model 810). In some aspects, the input to the base model may be an element-wise sum of the input coordinates and the current flow model and a previous version of the flow model output. In some cases, the additional flow model may be implemented as an additional layer that may be added with the jump connection.

Fig. 10 is a diagram illustrating an example process 1000 for performing implicit neural compression. In one aspect, each block of process 1000 may be associated with a formula 1002, which formula 1002 may be implemented for minimizing rate distortion in a neural network compression system (e.g., system 700). In some examples, formula 1002 may have the following form:

referring to formula 1002, d may correspond to a distortion function (e.g., MSE, MS-SSIM); psi may correspond to implicit model categories (e.g., network type and architecture); q (Q) _v May correspond to a weight quantizer; w may correspond to implicit model weights; i may correspond to an input image or video; beta may correspond to a compromise parameter; and p is _ω May correspond to a weight prior.

Turning to process 1000, at block 1004, the process includes finding an optimal function class or model architecture. In some aspects, finding the optimal implicit model may include: searching over the network architecture and training the weights of each model by minimizing distortion loss (e.g., without quantizing the weights). In some examples, the optimal model is selected based on minimized distortion loss. In some cases, the search may include a neural network search or a bayesian optimization technique.

At block 1006, process 1000 includes finding optimal function parameters and/or weights. In some examples, finding the optimal weight may include using gradient descent or random gradient descent to find the optimal weight.

At block 1008, process 1000 includes finding an optimal quantization setting. In some aspects, finding the optimal quantization settings may be performed using a trainable quantizer (e.g., trained using a machine learning algorithm). In some examples, the quantization settings may be determined using codebook quantization, learned fixed point quantization, and/or any other suitable quantization technique.

At block 1010, process 1000 includes finding an optimal weight prior. In some cases, the optimal weight prior may be found by searching for different distribution types (e.g., gaussian distribution, beta distribution, laplace distribution, etc.). In some aspects, finding the optimal weight prior may include: parameters of the weight distribution (e.g., mean and/or standard deviation) are fitted to minimize rate loss. In some examples, a binary mask for transmission to a decoder may be included, which may provide an indication of a binary file without weights.

In some examples, the steps in process 1000 may be performed sequentially or, where applicable, using parallel processing. In some aspects, one or more parameters may allow for backward propagation, which may facilitate a combination of one or more steps (e.g., when using a learnable quantizer, gradient descent may be used to minimize blocks 1006 and 1008).

Fig. 11 is a diagram illustrating an example process 1100 for performing implicit neural compression. At block 1102, process 1100 may include receiving input video data for compression by a neural network compression system. In some examples, the neural network compression system may be configured to perform video and image compression using Implicit Frame Flows (IFFs) based on implicit neural representations. For example, a full resolution video sequence may be compressed by representing each frame with a neural network that maps coordinate locations to pixel values. In some aspects, the coordinate input may be modulated using a separate implicit network to enable motion compensation (e.g., optical flow distortion) between frames. In some examples, the IFF may be implemented such that the recipient is not required to be able to access the pre-trained neural network. In some cases, the IFF may be implemented without requiring a separate training data set (e.g., the network may train using input frames).

At block 1104, process 1100 includes dividing the input video into groups of frames (also referred to as "groups of pictures" or "GoPs"). In some examples, a frame group may include 5 or more frames. In some aspects, a first frame in a frame group may be compressed as a free-standing image (e.g., an I-frame), while other frames in the frame group may be compressed using information available from the other frames. For example, other frames in a frame group may be compressed into P frames that depend on the previous frame. In some aspects, a frame may be compressed into a B frame that depends on both a preceding frame and a subsequent frame.

At block 1106, the process 1100 includes training a base model (e.g., base model). In some examples, training the base model on the I-frame may include minimizing distortion. In some aspects, training the base model on the I-frame may be based on the following relationship: />

In formula (11), t may correspond to a frame index; x, y may correspond to coordinates within a video frame; i _t，x，y True RGB values at coordinates (x, y) may correspond; f (f) _θt (x, y) may correspond to having a weight θ evaluated at coordinates (x, y) _t Is an implicit neural network of (2); q (Q) _τ May correspond to a quantization function having a parameter ψ; and p is _ω May correspond to a priori that is used to compress the quantized weights ω.

At block 1108, process 1100 includes weighting the I-frame by θ ₀ Quantization and entropy coding are performed and written into the bitstream. In some aspects, to reduce the model size of the implicit model representing the I-frame, each weight tensor θ may be quantized using a fixed-point representation (e.g., using equation (2)) ⁽ⁱ⁾ E theta. In some examples, the bit width may be implicitly defined as b (s, θ _max )＝log ₂ (θ _max +1), where s may correspond to a scale, and θ _max May correspond to a clipping threshold. In some examples, per-channel quantization may be performed to obtain a separate range and bit width for each row in the matrix. In one aspect, the per-channel hybrid precision quantization function may be defined according to the following equation:

In some aspects, quantization parametersAnd integer tensor->May be encoded into the bitstream. For example, s ^(l) Can be encoded as a 32-bit floating point vector, bit width b ^(l) Can be encoded as a 5-bit integer vector and an integer tensor +.>Can be according to the corresponding bit width per channel>Encoding is performed.

At block 1110, the process 1100 includes training a flow model (e.g., model). In some aspects, a P frame may correspond to a next sequential frame in a group of frames (e.g., a first P frame after an I frame). As mentioned above, optical flow may be implicitly modeled by exploiting continuity between implicit representations. Using IFF, a frame can be represented as a network taking image coordinates as input and returning pixel values as follows: (x, y) →f _θ (x, y) = (r, g, b). In some aspects, the displacement of the optical flow field may be applied by adding the displacement vector to the input variable (e.g., equation (7)) ∈>In some aspects, training the flow model on P frames may be based on the relationship in equation (11).

At block 1112, process 1100 includes weighting P-frame by phi ₀ Quantization and entropy coding are performed and written into the bitstream. In some aspects, the above regarding I-frame weights θ may be used ₀ The described method weights P-frame ₀ Quantization and entropy coding are performed. For example, P-frame weights φ ₀ Fixed point representations may be used for quantization. In one example, per-channel quantization may be performed according to equation (12). In some aspects, the quantization parameter and integer tensor may be written or encoded into a bitstream and transmitted to a recipient. In some aspects, the learnable quantization parameter ω may also be encoded and written into the bitstream.

At block 1114, process 1100 includes loading a buffer for processing the current frame P _t Is used for the model parameters. In some aspects, the current frame P _t May correspond to the next frame in the group of frames. For example, the current frame may correspond to a frame following an I-frame and a P-frame that are used to train the base model and the flow model, respectively. In some aspects, existing model parameters may be represented as base model weights (e.g., θ _t-1 ) And flow model weights of previous frames (e.g., φ _t-1 )。

At block 1116, the process 1100 includes training a base model and a flow model on the current frame. In some aspects, training the base model and the flow model on the current frame includes learning weight updates δθ and δΦ relative to the previous frame such that:

θ _t ＝θ _t-1 +δ _t θ and

in some examples, updating the base model may correspond to modeling the residual. In some cases, modeling the update may avoid resending previously calculated flow information (e.g., optical flow between consecutive frames is likely to be similar). In some aspects, the implicit representation T of the P frame may be shown by:

In some examples, as illustrated by equation (15), the cumulative effect of all previous flow models is stored in a single tensor, which is the sum of the local displacements. In some cases, the tensor may be maintained by the sender and the receiver. In some aspects, using a single tensor may avoid the need to store a previous version of the streaming network in order to perform forward delivery through each network for each frame.

In some cases, for frame P _T Can be expressed according to the following relation:

in formula (16), D _T Can be expressed relative to frame P _T And R (δθ, δΦ) may represent the updated rate cost.

At block 1118, the process 1100 may include quantizing and entropy coding the weight updates δθ and δφ into a bitstream. In some examples, updates δθ and δΦ may be quantized on a fixed grid of n equal-sized slots of width t centered on δθ=0. In some aspects, the quantized weight updates may be entropy coded under pin plate priors, as discussed with respect to equation (9). As mentioned above, in some aspects, the variance is used in equation (9)The "slate" component of (c) may minimize the bit rate used to send the updated weights to the recipient. In some cases, with a narrow standard deviation sigma _spike ＜＜σ _slab The "spike" component of (c) may minimize the processing costs associated with zero updates.

At block 1120, the process 1100 includes updating model parameters for the base model and the flow model. In some aspects, the update of the model parameters may be expressed as θ _t ←θ _t-1 Sum of +delta thetaIn some cases, updates to the model parameters may be sent to the recipient.

At block 1122, the process 1100 includes updating the displacement tensor. In some aspects, the update to the displacement tensor may be represented as

At block 1124, process 1100 may determine whether additional frames are present in the group of frames (e.g., goP). If there are additional frames to process (e.g., additional P frames), process 1100 may repeat the operations discussed with respect to blocks 1114 through 1122. If the network compression system has completed processing the group of frames, process 1100 may proceed to block 1126 and determine if there are more groups of frames associated with the video input. If there are additional frame groups to process, the method may return to block 1106 and begin training of the base model using the new I-frame corresponding to the next frame group. If no additional frame groups exist, process 1100 may return to block 1102 to receive new input data for compression.

Fig. 12 is a flowchart illustrating an example process 1200 for processing media data. At block 1202, the process 1200 may include receiving a plurality of images for compression by a neural network compression system. For example, the implicit neural network compression system 700 may receive image data 702. In some aspects, the implicit neural network compression system 700 may be implemented using a pipeline 800, and the plurality of images may include I-frames 802 and P-frames ₁ Frame 808.

At block 1204, the process 1200 may include determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images. For example, the base model 804 may determine a first plurality of weight values (e.g., weights w 706) based on the I-frame 802. In some aspects, at least one layer of the first model may include a positioning encoding of a plurality of coordinates associated with the first image. For example, at least one layer of the base model 804 may include a positioning encoding of coordinates associated with the I-frame 802.

In some cases, the first model may be configured to determine one or more pixel values corresponding to a plurality of coordinates associated with the first image. For example, the base model 804 may be configured to determine one or more pixel values (e.g., RGB values) corresponding to a plurality of coordinates associated with the I-frame 802.

At block 1206, the process 1200 may include generating a first bit stream including a compressed version of a first plurality of weight values. For example, the arithmetic encoder 710 may generate a bitstream 724, which bitstream 724 may include a compressed version of a plurality of weight values (e.g., weights w 706). At block 1208, the process 1200 may include outputting the first bit stream for transmission to a recipient. For example, the bitstream 724 may be output by the arithmetic encoder 710 for transmission to a recipient (e.g., the arithmetic decoder 730).

In some aspects, the process 1200 may include quantizing the first plurality of weight values under a weight prior to generate a plurality of quantized weight values. In some cases, the bitstream may include compressed versions of a plurality of quantized weight values. For example, the weights w 706 may be quantized under a weight prior 712 to produce quantized weights 708. In some examples, quantized weights 708 may be encoded into bitstream 724 by arithmetic encoder 710. In some aspects, the process 1200 may include entropy encoding the first plurality of weight values using a weight prior. For example, arithmetic encoder 710 may encode quantized weights 708 in bitstream 724 using entropy coding under weight prior 712.

In some cases, the weight a priori may be selected to minimize rate loss associated with transmitting the first bit stream to the recipient. For example, the weight priors 712 may be selected or configured to minimize rate loss associated with transmitting the bit stream 724 to the recipient. In some examples, the first plurality of weight values may be quantized using fixed-point quantization. In some aspects, fixed point quantization may be implemented using a machine learning algorithm. For example, the weights w 706 may be quantized using fixed-point quantization, which may represent a weight tensor with fixed-point numbers including integer tensors and scaling factors. In some cases, the implicit neural network compression system 700 may use a machine learning algorithm to implement fixed point quantization of the weights w 706.

In some aspects, the process 1200 may include determining a second plurality of weight values for use by a second model associated with the neural network compression system based on a second image from the plurality of images. For example, pipeline 800 may be P-based ₁ Frame 808 determines a second set of weight values for use by flow model 810And (5) combining. In some cases, process 1200 may include generating a second bitstream comprising a compressed version of the second plurality of weight values, and outputting the second bitstream for transmission to the recipient. For example, an arithmetic encoder (e.g., arithmetic encoder 710) may generate a bit stream that may include a compressed version of the weight tensor used by the flow model 810.

In some examples, the second model may be configured to determine optical flow between the first image and the second image. For example, a flow model 810 (e.g., a flow model) May be used to determine the optical flow field between the I frame 802 and the P1 frame 808. In some aspects, process 1200 may include determining at least one updated weight value from the first plurality of weight values based on the optical flow. For example, flow model 810 may determine updated weight values based on the optical flow from the weight values used by base model 804.

In some aspects, the process 1200 may include selecting a model architecture corresponding to the first model based on the first image. In some cases, selecting the model architecture may include tuning a plurality of weight values associated with one or more model architectures based on the first image, wherein each of the one or more model architectures is associated with one or more model characteristics. For example, the implicit neural compression system 700 may tune the weights w 706 for each model architecture based on the image data 702. In some examples, the one or more model characteristics may include at least one of: width, depth, resolution, size of convolution kernel, and input dimension.

In some cases, process 1200 may include determining at least one distortion between the first image and the reconstructed data output corresponding to each of the one or more model architectures. For example, the implicit neural compression system 700 may tune the weights w 706 associated with each model to minimize distortion loss without quantization. In some aspects, the process 1200 may include selecting a model architecture from the one or more model architectures based on the at least one distortion. For example, the implicit neural compression system 700 may select a model architecture based on the lowest distortion value.

In some examples, process 1200 may include generating a second bitstream that includes a compressed version of the model architecture, and outputting the second bitstream for transmission to a recipient. For example, the architecture encoder 716 may encode the model architecture ψ (·) 718 in a bitstream 720 and output the bitstream 720 for transmission to a recipient (e.g., the architecture decoder 726).

Fig. 13 is a flow chart illustrating an example process 1300 for processing media data. At block 1302, the process 1300 may include receiving a compressed version of a first plurality of neural network weight values associated with a first image from among a plurality of images. For example, the arithmetic decoder 730 may receive a bitstream 724, which bitstream 724 may include a plurality of weight values (e.g., weights w 706) associated with the image data 702.

At block 1304, the process 1300 may include decompressing the first plurality of neural network weight values. For example, the arithmetic decoder may decompress weights w 706 from bitstream 724. At block 1306, the process 1300 may include processing the first plurality of neural network weight values using the first neural network model to generate a first image. For example, the implicit neural compression system 700 may include a pipeline 800 with a quantized base model 806, which quantized base model 806 may be used to process weight tensors to produce a reconstructed version of the I-frame 802.

In some aspects, the process 1300 may include receiving a compressed version of a second plurality of neural network weight values associated with a second image from the plurality of images. In some cases, process 1300 may include decompressing the second plurality of neural network weight values and processing the second plurality of neural network weight values using the second neural network model to determine optical flow between the first image and the second image. For example, the implicit neural compression system 600 may include a pipeline 800 with a quantized flow model that may be used to process weight tensors associated with a flow model 810 to determine I-frames 802 and P ₁ Optical flow between frames 808.

In some cases, process 1300 may include an optical flow-based basis anda first plurality of neural network weight values associated with the first neural network model determine at least one updated weight value. For example, the flow model 810 may determine updated weight values from the weights associated with the flow model 810. In some aspects, the process 1300 may include processing the at least one updated weight value using the first neural network model to generate a reconstructed version of the second image. For example, quantized base model 806 may use updated weights (e.g., based on optical flow) to generate P ₁ A reconstructed version of frame 808.

In some examples, the first plurality of neural network weight values may be quantized under a weight prior. For example, the weights received by the quantized base model 806 may be quantized under a weight prior (e.g., weight prior 712). In some aspects, the compressed version of the first plurality of network weight values is received in an entropy encoded bitstream. For example, the arithmetic encoder 710 may perform entropy encoding of weights (e.g., weight w 706) or quantized weights (e.g., quantized weights 708) and output the bitstream 724.

In some cases, process 1300 may include receiving a compressed version of a neural network architecture corresponding to a first neural network model. For example, the architecture encoder 716 may encode the model architecture ψ (·) 718 in the bitstream 720 and send it to the architecture decoder 726.

Fig. 14 is a flow chart illustrating an example process 1400 for compressing image data based on an implicit neural representation. At block 1402, the process 1400 may include receiving input data for compression by a neural network compression system. In some aspects, the input data may correspond to media data (e.g., video data, picture data, audio data, etc.). In some examples, the input data may include a plurality of coordinates corresponding to image data used to train the neural network compression system.

At block 1404, process 1400 may include selecting a model architecture for use by the neural network compression system to compress the input data based on the input data. In some aspects, selecting the model architecture may include tuning a plurality of weight values associated with one or more model architectures based on the input data, wherein each of the one or more model architectures is associated with one or more model characteristics. In some examples, selecting the model architecture may further include determining at least one distortion between the input data and the reconstructed data output corresponding to each of the one or more model architectures. In some cases, selecting a model architecture from one or more model architectures may be based on at least one distortion. In some aspects, the one or more model characteristics may include at least one of: width, depth, resolution, size of convolution kernel, and input dimension.

At block 1406, process 1400 may include using the input data to determine a plurality of weight values corresponding to a plurality of layers associated with the model architecture. At block 1408, the process 1400 may include generating a first bit stream including a compressed version of the weight priors. In some examples, generating the first bit stream may include encoding the weight priors using an open neural network exchange (ONNX) format. At block 1410, the process 1400 may include generating a second bit stream comprising a compressed version of the plurality of weight values under the weight prior. In some aspects, generating the second bitstream may include entropy encoding the plurality of weight values using a weight prior. In some examples, the weight priors may be selected to minimize rate loss associated with transmitting the second bit stream to the recipient.

At block 1412, the process 1400 may include outputting the first bit stream and the second bit stream for transmission to a recipient. In some examples, the process may include generating a third bitstream including a compressed version of the model architecture, and outputting the third bitstream for transmission to a recipient. In some aspects, at least one layer of the model architecture includes a positioning encoding of a plurality of coordinates associated with the input data.

In some examples, the process may include quantizing the plurality of weight values to produce a plurality of quantized weight values, wherein the second bitstream includes a compressed version of the plurality of quantized weight values under the weight prior. In some aspects, a plurality of weight values may be quantized using learned fixed-point quantization. In some cases, the learned fixed-point quantization may be implemented using a machine learning algorithm. In some examples, the second bitstream may include a plurality of encoded quantization parameters for quantizing the plurality of weight values.

Fig. 15 is a flowchart illustrating an example of a process 1500 for decompressing image data based on an implicit neural representation. At block 1502, the process 1500 may include receiving a compressed version of a weight prior and compressed versions of a plurality of weight values under the weight prior. In some aspects, the plurality of weights under the weight priors may be received in an entropy encoded bitstream. At block 1504, the process 1500 may include decompressing the weight prior and the compressed version of the plurality of weight values under the weight prior.

At block 1506, the process 1500 may include determining a plurality of neural network weights based on the weight prior and the plurality of weights under the weight prior. At block 1508, the process 1500 may include processing the plurality of neural network weights using a neural network architecture to generate reconstructed image content. In some aspects, the plurality of weight values under the weight priors may correspond to a plurality of quantized weights under the weight priors. In some examples, the process may include receiving a plurality of encoded quantization parameters for quantizing a plurality of quantized weights under a weight prior.

In some aspects, the process may include receiving a compressed version of a neural network architecture and decompressing the compressed version of the neural network architecture. In some examples, the process may include redistributing the plurality of weights under the weight a priori based on a binary mask.

In some examples, the processes described herein (e.g., process 1100, process 1200, process 1300, process 1400, process 1500, and/or other processes described herein) may be performed by a computing device or apparatus. In one example, processes 1100, 1200, 1300, 1400, and/or 1500 may be performed by a computing device in accordance with system 400 shown in fig. 4 or computing system 1600 shown in fig. 16.

The computing device may include any suitable device, such as a mobile device (e.g., a mobile phone), a desktop computing device, a tablet computing device, a wearable device (e.g., a VR headset, AR glasses, a networked watch or smart watch, or other wearable device), a server computer, an autonomous vehicle or a computing device of an autonomous vehicle, a robotic device, a television, and/or any other computing device having resource capabilities to perform the processes described herein (including process 1100, process 1200, process 1300, process 1400, process 1500, and/or other processes described herein). In some cases, the computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) configured to perform the steps of the processes described herein. In some examples, a computing device may include a display, a network interface configured to communicate and/or receive data, any combination thereof, and/or other component(s). The network interface may be configured to communicate and/or receive Internet Protocol (IP) based data or other types of data.

The components of the computing device may be implemented with circuitry. For example, the components may include and/or be implemented using electronic circuitry or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics Processing Units (GPUs), digital Signal Processors (DSPs), central Processing Units (CPUs), and/or other suitable electronic circuits), and/or may include and/or be implemented using computer software, firmware, or any combination thereof to perform the various operations described herein.

The processes 1100, 1200, 1300, 1400, and 1500 are illustrated as logic flow diagrams whose operations represent sequences of operations that can be implemented in hardware, computer instructions, or a combination thereof. In the context of computer instructions, the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined and/or performed in parallel in any order to implement the processes.

Additionally, the processes 1100, 1200, 1300, 1400, 1500, and/or other processes described herein may be performed under control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing concurrently on one or more processors, by hardware, or a combination thereof. As mentioned above, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. The computer-readable or machine-readable storage medium may be non-transitory.

Fig. 16 is a diagram illustrating an example of a system for implementing certain aspects of the present technique. In particular, fig. 16 illustrates an example of a computing system 1600, which computing system 1600 can be, for example, any computing device that constitutes an internal computing system, a remote computing system, a camera, or any component thereof, with components of the system in communication with each other using a connection 1605. The connection 1605 may be a physical connection using a bus, or a direct connection to the processor 1610 (such as in a chipset architecture). The connection 1605 may also be a virtual connection, a networking connection, or a logical connection.

In some embodiments, computing system 1600 is a distributed system, where the functionality described in this disclosure may be distributed within a data center, multiple data centers, a peer-to-peer network, and so forth. In some embodiments, one or more of the described system components represent many such components, each of which performs some or all of the functions described for that component. In some embodiments, the components may be physical or virtual devices.

The example system 1600 includes at least one processing unit (CPU or processor) 1610 and connections 1605 that couple various system components including a system memory 1615, such as Read Only Memory (ROM) 1620 and Random Access Memory (RAM) 1625, to the processor 1610. The computing system 1600 may include a cache 1612 directly connected to the processor 1610, immediately adjacent to the processor 1610, or integrated as part of the processor 1610.

Processor 1610 may include any general purpose processor and hardware services or software services, such as services 1632, 1634, and 1636 stored in storage device 1630 configured to control processor 1610, as well as special purpose processors, where software instructions are incorporated into the actual processor design. Processor 1610 may be essentially a fully self-contained computing system comprising a plurality of cores or processors, a bus, a memory controller, a cache, etc. The multi-core processor may be symmetrical or asymmetrical.

To enable user interaction, computing system 1600 includes input device(s) 1645 that can represent any number of input mechanisms, such as a microphone for voice, a touch-sensitive screen for gesture or graphical input, a keyboard, a mouse, motion input, voice, and so forth. The computing system 1600 may also include an output device 1635, which output device 1635 may be one or more of several output mechanisms. In some examples, a multi-modal system may enable a user to provide multiple types of input/output to communicate with computing system 1600. The computing system 1600 may include a communication interface 1640 that can generally manage and manage user inputs and system outputs.

The communication interface may perform or facilitate the use of a wired and/or wireless transceiver to receive and/or transmit wired or wireless communications, including utilizing an audio jack/plug, a microphone jack/plug, a Universal Serial Bus (USB) port/plug,Port/plug, ethernet port/plug, fiber optic port/plug, dedicated wired port/plug,Radio signal transmission, < >>Low Energy (BLE) radio signaling, < > and->Wireless signaling, radio Frequency Identification (RFID) wireless signaling, near Field Communication (NFC) wireless signaling, dedicated Short Range Communication (DSRC) wireless signaling, 802.11Wi-Fi wireless signaling, wireless Local Area Network (WLAN) signaling, visible Light Communication (VLC), worldwide Interoperability for Microwave Access (WiMAX), infrared (IR) communication wireless signaling, public Switched Telephone Network (PSTN) signaling, integrated Services Digital Network (ISDN) signaling, 3G/4G/5G/LTE cellular data network wireless signaling, ad hoc network signaling, radio wave signaling, microwave signaling, infrared signaling, visible light signaling, ultraviolet light signaling, wireless signaling along the electromagnetic spectrum, or some combination thereof.

The communication interface 1640 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers that are used to determine the location of the computing system 1600 based on receiving one or more signals from one or more satellites associated with the one or more GNSS systems. GNSS systems include, but are not limited to, the united states based Global Positioning System (GPS), the russian based global navigation satellite system (GLONASS), the chinese based beidou navigation satellite system (BDS), and the european based galileo GNSS. There are no limitations to operating on any particular hardware arrangement, and thus the underlying features herein may be readily replaced to obtain an improved hardware or firmware arrangement as they are developed.

The storage 1630 may be a nonvolatile and/or non-transitory and/or computer-readable memory device and may be a hard disk or other type of computer-readable medium capable of storing data that is accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, floppy disks, hard disks, magnetic tape, magnetic stripe/strip, any other magnetic storage medium, flash memory, memory storage, any other solid state storage Memory, compact disc read-only Memory (CD-ROM) disc, rewritable Compact Disc (CD) disc, digital Video Disc (DVD) disc, blue-ray disc (BDD) disc, holographic disc, another optical medium, secure Digital (SD) card, micro secure digital (microSD) card, memory StickCards, smart card chips, EMV chips, subscriber Identity Module (SIM) cards, mini/micro/nano/pico SIM cards, another Integrated Circuit (IC) chip/card, random Access Memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory EPROM (FLASHEPROM), cache memory (L1/L2/L3/L4/L5/l#), resistive random access memory (RRAM/ReRAM), phase Change Memory (PCM), spin transfer torque RAM (STT-RAM), another memory chip or cartridge, and/or combinations thereof. />

Storage 1630 may include software services, servers, services, etc., that when executed by processor 1610 cause the system to perform functions. In some embodiments, a hardware service performing a particular function may include software components stored in a computer-readable medium that interfaces with the necessary hardware components (such as processor 1610, connection 1605, output device 1635, etc.) to perform the function. The term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying instruction(s) and/or data. Computer-readable media may include non-transitory media in which data may be stored and which do not include carrier waves and/or transitory electronic signals propagating wirelessly or through a wired connection. Examples of non-transitory media may include, but are not limited to, magnetic disks or tapes, optical storage media such as Compact Discs (CDs) or Digital Versatile Discs (DVDs), flash memory, or memory devices. The computer-readable medium may have code and/or machine-executable instructions stored thereon, which may represent procedures, functions, subroutines, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.

In some embodiments, the computer readable storage devices, media, and memory may comprise a cable or wireless signal comprising a bit stream or the like. However, when referred to, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals themselves.

In the above description, specific details are provided to provide a thorough understanding of the embodiments and examples provided herein. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. For clarity of illustration, in some examples, the inventive techniques may be presented as including individual functional blocks that include devices, device components, steps or routines in a method implemented in software or a combination of hardware and software. Additional components other than those shown in the figures and/or described herein may be used. For example, circuits, systems, networks, processes, and other components may be shown in block diagram form in order to avoid obscuring the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.

Various embodiments may be described above as a process or method, which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be rearranged. The process is terminated when its operations are completed, but may have additional steps not included in the figures. The process may correspond to a method, a function, a procedure, a subroutine, etc. When a process corresponds to a function, its termination corresponds to the function returning to the calling function or the main function.

The processes and methods according to the examples above may be implemented using stored computer-executable instructions or computer-executable instructions otherwise available from a computer-readable medium. Such instructions may include, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or processing device to perform a certain function or group of functions. Portions of the computer resources used are accessible over a network. The computer-executable instructions may be, for example, binary files, intermediate format instructions (such as assembly language), firmware, source code. Examples of computer readable media that may be used to store instructions, information used during a method according to the described examples, and/or created information include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and the like.

An apparatus implementing various processes and methods according to these disclosures may include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may employ any of a variety of form factors. When implemented in software, firmware, middleware or microcode, the program code or code segments (e.g., a computer program product) to perform the necessary tasks may be stored in a computer-readable or machine-readable medium. The processor may perform the necessary tasks. Typical examples of the form factors include: laptop devices, smart phones, mobile phones, tablet devices, or other small form factor personal computers, personal digital assistants, rack-mounted devices, free-standing devices, and the like. The functionality described herein may also be implemented with a peripheral device or a plug-in card. As a further example, such functionality may also be implemented on different chips or circuit boards among different processes executing on a single device.

The instructions, the media used to convey these instructions, the computing resources used to execute them, and other structures used to support such computing resources are example means for providing the functionality described in this disclosure.

In the above description, aspects of the present application are described with reference to specific embodiments thereof, but those skilled in the art will recognize that the present application is not limited thereto. Thus, although illustrative embodiments of the present application have been described in detail herein, it is to be understood that the various inventive concepts may be otherwise variously embodied and employed, and that the appended claims are not intended to be construed to include such variations unless limited by the prior art. The various features and aspects of the above-mentioned applications may be used singly or in combination. Furthermore, embodiments may be utilized in any number of environments and applications beyond those described herein without departing from the broader spirit and scope of the specification. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. For purposes of illustration, the methods are described in a particular order. It should be appreciated that in alternative embodiments, the methods may be performed in a different order than described.

Those of ordinary skill in the art will appreciate that less ("<") and greater than (">) symbols or terms used herein may be substituted with less than equal (" +") and greater than equal (" +") symbols, respectively, without departing from the scope of the present description.

Where components are described as "configured to" perform certain operations, such configuration may be achieved, for example, by designing electronic circuitry or other hardware to perform the operations, by programming programmable electronic circuitry (e.g., a microprocessor, or other suitable electronic circuitry), or any combination thereof.

The phrase "coupled to" refers to any component being physically connected directly or indirectly to another component, and/or any component being in communication directly or indirectly with another component (e.g., being connected to the other component through a wired or wireless connection and/or other suitable communication interface).

Claim language or other language reciting "at least one" of a collection and/or "one or more" of a collection indicates that a member of the collection or members of the collection (in any combination) satisfies the claim. For example, claim language reciting "at least one of a and B" or "at least one of a or B" means A, B or a and B. In another example, claim language reciting "at least one of A, B and C" or "at least one of A, B or C" means A, B, C, or a and B, or a and C, or B and C, or a and B and C. The language "at least one of" and/or "one or more of" in a collection does not limit the collection to the items listed in the collection. For example, claim language reciting "at least one of a and B" or "at least one of a or B" may mean A, B or a and B, and may additionally include items not recited in the set of a and B.

The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the examples disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. The techniques may be implemented in any of a variety of devices such as a general purpose computer, a wireless communication device handset, or an integrated circuit device having multiple uses including applications in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer-readable data storage medium comprising program code that includes instructions that, when executed, perform one or more of the methods, algorithms, and/or operations described above. The computer readable data storage medium may form part of a computer program product, which may include packaging material. The computer-readable medium may include memory or data storage media such as Random Access Memory (RAM), such as Synchronous Dynamic Random Access Memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage media, and the like. The techniques may additionally or alternatively be implemented at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures that may be accessed, read, and/or executed by a computer, such as propagated signals or waves.

The program code may be executed by a processor, which may include one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Such processors may be configured to perform any of the techniques described in this disclosure. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term "processor" as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein.

Illustrative examples of the present disclosure include:

aspect 1: an apparatus, comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: receiving a plurality of images for compression by a neural network compression system; determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images; generating a first bit stream comprising a compressed version of a first plurality of weight values; and outputting the first bit stream for transmission to the recipient.

Aspect 2: the apparatus of aspect 1, wherein at least one layer of the first model includes a positioning encoding of a plurality of coordinates associated with the first image.

Aspect 3: the apparatus of aspect 2, wherein the first model is configured to determine one or more pixel values corresponding to the plurality of coordinates associated with the first image.

Aspect 4: the apparatus of any of aspects 1 through 3, wherein the at least one processor is further configured to: determining a second plurality of weight values for use by a second model associated with the neural network compression system based on a second image from the plurality of images; generating a second bitstream comprising a compressed version of the second plurality of weight values; and outputting the second bitstream for transmission to the recipient.

Aspect 5: the apparatus of aspect 4, wherein the second model is configured to determine optical flow between the first image and the second image.

Aspect 6: the apparatus of aspect 5, wherein the at least one processor is further configured to: at least one updated weight value is determined from the first plurality of weight values based on the optical flow.

Aspect 7: the apparatus of any of aspects 1 to 6, wherein the at least one processor is further configured to: the first plurality of weight values is quantized under a weight prior to generate a plurality of quantized weight values, wherein the first bit stream comprises compressed versions of the plurality of quantized weight values.

Aspect 8: the apparatus of aspect 7, wherein the weight a priori is selected to minimize a rate loss associated with transmitting the first bit stream to the recipient.

Aspect 9: the apparatus of any of aspects 7 to 8, wherein to generate the first bit stream, the at least one processor is further configured to: the first plurality of weight values are entropy encoded using the weight priors.

Aspect 10: the apparatus of any of aspects 7 to 9, wherein the first plurality of weight values are quantized using fixed-point quantization.

Aspect 11: the apparatus of aspect 10, wherein the fixed point quantization is implemented using a machine learning algorithm.

Aspect 12: the apparatus of any one of aspects 1 to 11, wherein the at least one processor is further configured to: a model architecture corresponding to the first model is selected based on the first image.

Aspect 13: the apparatus of aspect 12, wherein the at least one processor is further configured to: generating a second bitstream comprising a compressed version of the model architecture; and outputting the second bit stream for transmission to the recipient.

Aspect 14: the apparatus of any of aspects 12 to 13, wherein to select the model architecture, the at least one processor is further configured to: tuning a plurality of weight values associated with one or more model architectures based on the first image, wherein each of the one or more model architectures is associated with one or more model characteristics; determining at least one distortion between the first image and the reconstructed data output corresponding to each of the one or more model architectures; and selecting the model architecture from the one or more model architectures based on the at least one distortion.

Aspect 15: the apparatus of aspect 14, wherein the one or more model characteristics include at least one of: width, depth, resolution, size of convolution kernel, and input dimension.

Aspect 16: a method of performing any of the operations of aspects 1 to 15.

Aspect 17: a computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any of the operations of aspects 1 through 15.

Aspect 18: an apparatus comprising means for performing any of the operations of aspects 1 to 15.

Aspect 19: an apparatus, comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: receiving a compressed version of a first plurality of neural network weight values associated with a first image from the plurality of images; decompressing the first plurality of neural network weight values; and processing the first plurality of neural network weight values using the first neural network model to produce a first image.

Aspect 20: the apparatus of aspect 19, wherein the at least one processor is further configured to: receiving a compressed version of a second plurality of neural network weight values associated with a second image from the plurality of images; decompressing the second plurality of neural network weight values; and processing the second plurality of neural network weight values using the second neural network model to determine optical flow between the first image and the second image.

Aspect 21: the apparatus of aspect 20, wherein the at least one processor is further configured to: at least one updated weight value is determined from a first plurality of neural network weight values associated with the first neural network model based on the optical flow.

Aspect 22: the apparatus of aspect 21, wherein the at least one processor is further configured to: the at least one updated weight value is processed using the first neural network model to produce a reconstructed version of the second image.

Aspect 23: the apparatus of any of aspects 19-22, wherein the first plurality of neural network weight values are quantized under a weight prior.

Aspect 24: the apparatus of any of aspects 19-23, wherein the compressed version of the first plurality of neural network weight values is received in an entropy encoded bitstream.

Aspect 25: the apparatus of any of aspects 19 to 24, wherein the at least one processor is further configured to: a compressed version of a neural network architecture corresponding to a first neural network model is received.

Aspect 26: a method of performing any of the operations of aspects 19 to 25.

Aspect 27: a computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any of the operations of aspects 19 to 25.

Aspect 28: an apparatus comprising means for performing any of the operations of aspects 19 through 25.

Aspect 29: an apparatus, comprising: a memory; and one or more processors coupled to the memory, the one or more processors configured to: receiving input data for compression by a neural network compression system; selecting a model architecture for the neural network compression system to use to compress the input data based on the input data; determining a plurality of weight values corresponding to a plurality of layers associated with the model architecture using the input data; generating a first bit stream comprising a compressed version of the weight priors; generating a second bitstream comprising a compressed version of the plurality of weight values under the weight prior; and outputting the first bit stream and the second bit stream for transmission to a recipient.

Aspect 30: the apparatus of claim 29, wherein to select the model architecture for use by the neural network, the one or more processors are configured to: tuning a plurality of weight values associated with one or more model architectures based on the input data, wherein each of the one or more model architectures is associated with one or more model characteristics; determining at least one distortion between the input data and a reconstructed data output corresponding to each of the one or more model architectures; and selecting the model architecture from the one or more model architectures based on the at least one distortion.

Aspect 31: the apparatus of aspect 30, wherein the one or more model characteristics include at least one of: width, depth, resolution, size of convolution kernel, and input dimension.

Aspect 32: the apparatus of any of aspects 29 to 31, wherein the one or more processors are further configured to: the plurality of weight values are quantized to generate a plurality of quantized weight values, wherein a second bit stream includes compressed versions of the plurality of quantized weight values under the weight priors.

Aspect 33: the apparatus of aspect 32, wherein the plurality of weight values are quantized using learned fixed-point quantization.

Aspect 34: the apparatus of aspect 32, wherein the fixed point quantization is implemented using a machine learning algorithm.

Aspect 35: the apparatus of aspect 32, wherein the second bitstream includes a plurality of encoded quantization parameters for quantizing the plurality of weight values.

Aspect 36: the apparatus of any of aspects 29 to 35, wherein the one or more processors are further configured to: generating a third bitstream comprising a compressed version of the model architecture; and outputting a third bitstream for transmission to the recipient.

Aspect 37: the apparatus of any of claims 29 to 36, wherein at least one layer of the model architecture comprises a positioning encoding of a plurality of coordinates associated with the input data.

Aspect 38: the apparatus of any of claims 29 to 37, wherein to generate the first bitstream, the one or more processors are configured to: the weight priors are encoded using an open neural network switching format.

Aspect 39: the apparatus of any of aspects 29 to 38, wherein to generate the second bitstream, the one or more processors are configured to: the plurality of weight values are entropy encoded using the weight prior.

Aspect 40: the apparatus of any of aspects 29 to 39, wherein the weight a priori is selected to minimize a rate loss associated with transmitting the second bit stream to the recipient.

Aspect 41: the apparatus of any of aspects 29 to 40, wherein the input data comprises a plurality of coordinates corresponding to image data for training the neural network compression system.

Aspect 42: a method of performing any of the operations of aspects 29 to 41.

Aspect 43: a computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any of the operations of aspects 29 to 41.

Aspect 44: an apparatus comprising means for performing any of the operations of aspects 29 to 41.

Aspect 45: an apparatus, comprising: a memory; and one or more processors coupled to the memory, the one or more processors configured to: receiving a compressed version of a weight prior and a compressed version of a plurality of weight values under the weight prior; decompressing the weight prior and the compressed version of the plurality of weight values under the weight prior; determining a plurality of neural network weights based on the weight prior and the plurality of weights under the weight prior; and processing the plurality of neural network weights using the neural network architecture to produce reconstructed image content.

Aspect 46: the apparatus of aspect 45, wherein the one or more processors are further configured to: receiving a compressed version of the neural network architecture; and decompressing the compressed version of the neural network architecture.

Aspect 47: the apparatus of any of aspects 45 to 46, wherein the plurality of weight values under the weight prior correspond to a plurality of quantized weights under the weight prior.

Aspect 48: the apparatus of aspect 47, wherein the one or more processors are further configured to: a plurality of encoded quantization parameters are received for quantizing the plurality of quantized weights under the weight prior.

Aspect 49: the apparatus of any of aspects 45 to 48, wherein the compressed version of the plurality of weights under the weight priors is received in an entropy encoded bitstream.

Aspect 50: the apparatus of any one of aspects 45 to 49, wherein the one or more processors are further configured to: the plurality of weights under the weight priors are redistributed based on a binary mask.

Aspect 51: a method of performing any of the operations of aspects 45 to 50.

Aspect 52: a computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any of the operations of aspects 45 to 50.

Aspect 53: an apparatus comprising means for performing any of the operations of aspects 45 through 50.

Claims

1. A method of processing media data, comprising:

receiving a plurality of images for compression by a neural network compression system;

determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images;

generating a first bit stream comprising a compressed version of the first plurality of weight values; and

the first bit stream is output for transmission to a recipient.

2. The method of claim 1, wherein at least one layer of the first model comprises a positioning encoding of a plurality of coordinates associated with the first image.

3. The method of claim 2, wherein the first model is configured to determine one or more pixel values corresponding to the plurality of coordinates associated with the first image.

4. The method of claim 1, further comprising:

determining a second plurality of weight values for use by a second model associated with the neural network compression system based on a second image from the plurality of images;

generating a second bitstream comprising a compressed version of the second plurality of weight values; and

Outputting the second bit stream for transmission to a receiver.

5. The method of claim 4, wherein the second model is configured to determine optical flow between the first image and the second image.

6. The method of claim 5, further comprising:

at least one updated weight value is determined from the first plurality of weight values based on the optical flow.

7. The method of claim 1, further comprising:

the first plurality of weight values are quantized under a weight prior to generate a plurality of quantized weight values, wherein the first bitstream comprises compressed versions of the plurality of quantized weight values.

8. The method of claim 7, wherein the weight a priori is selected to minimize a rate loss associated with transmitting the first bit stream to the recipient.

9. The method of claim 7, wherein generating the first bit stream comprises:

entropy encoding the first plurality of weight values using the weight priors.

10. The method of claim 7, wherein the first plurality of weight values are quantized using fixed-point quantization.

11. The method of claim 10, wherein the fixed point quantization is implemented using a machine learning algorithm.

12. The method of claim 1, further comprising:

a model architecture corresponding to the first model is selected based on the first image.

13. The method of claim 12, further comprising:

generating a second bitstream comprising a compressed version of the model architecture; and

outputting the second bit stream for transmission to the recipient.

14. The method of claim 12, wherein selecting the model architecture comprises:

tuning a plurality of weight values associated with one or more model architectures based on the first image, wherein each of the one or more model architectures is associated with one or more model characteristics;

determining at least one distortion between the first image and a reconstructed data output corresponding to each of the one or more model architectures; and

the model architecture is selected from the one or more model architectures based on the at least one distortion.

15. The method of claim 14, wherein the one or more model characteristics comprise at least one of: width, depth, resolution, size of convolution kernel, and input dimension.

16. An apparatus, comprising:

at least one memory; and

at least one processor coupled to the at least one memory and configured to:

the first bit stream is output for transmission to a recipient.

17. The apparatus of claim 16, wherein at least one layer of the first model comprises a positioning encoding of a plurality of coordinates associated with the first image.

18. The apparatus of claim 17, wherein the first model is configured to determine one or more pixel values corresponding to the plurality of coordinates associated with the first image.

19. The apparatus of claim 16, wherein the at least one processor is further configured to:

outputting the second bit stream for transmission to a receiver.

20. The apparatus of claim 19, wherein the second model is configured to determine optical flow between the first image and the second image.

21. The apparatus of claim 20, wherein the at least one processor is further configured to:

22. The apparatus of claim 16, wherein the at least one processor is further configured to:

23. The apparatus of claim 22, wherein the weight a priori is selected to minimize a rate loss associated with transmitting the first bit stream to the receiver.

24. The apparatus of claim 22, wherein to generate the first bitstream, the at least one processor is further configured to:

Entropy encoding the first plurality of weight values using the weight priors.

25. The apparatus of claim 22, wherein the first plurality of weight values are quantized using fixed-point quantization.

26. The apparatus of claim 25, wherein the fixed point quantization is implemented using a machine learning algorithm.

27. The apparatus of claim 16, wherein the at least one processor is further configured to:

28. The apparatus of claim 27, wherein the at least one processor is further configured to:

outputting the second bit stream for transmission to the recipient.

29. The apparatus of claim 27, wherein to select the model architecture, the at least one processor is further configured to:

30. The apparatus of claim 29, wherein the one or more model characteristics comprise at least one of: width, depth, resolution, size of convolution kernel, and input dimension.

31. A method for processing media data, comprising:

receiving a compressed version of a first plurality of neural network weight values associated with a first image from the plurality of images;

decompressing the first plurality of neural network weight values; and

the first plurality of neural network weight values are processed using a first neural network model to produce the first image.

32. The method of claim 31, further comprising:

receiving a compressed version of a second plurality of neural network weight values associated with a second image from the plurality of images;

decompressing the second plurality of neural network weight values; and

processing the second plurality of neural network weight values using a second neural network model to determine optical flow between the first image and the second image.

33. The method of claim 32, further comprising:

At least one updated weight value is determined from the first plurality of neural network weight values associated with the first neural network model based on the optical flow.

34. The method of claim 33, further comprising:

the at least one updated weight value is processed using the first neural network model to generate a reconstructed version of the second image.

35. The method of claim 31, wherein the first plurality of neural network weight values are quantized under a weight prior.

36. The method of claim 31, wherein the compressed version of the first plurality of neural network weight values is received in an entropy encoded bitstream.

37. The method of claim 31, further comprising:

a compressed version of a neural network architecture corresponding to the first neural network model is received.

38. An apparatus, comprising:

at least one memory; and

at least one processor coupled to the at least one memory and configured to:

Decompressing the first plurality of neural network weight values; and

39. The apparatus of claim 38, wherein the at least one processor is further configured to:

decompressing the second plurality of neural network weight values; and

40. The apparatus of claim 39, wherein the at least one processor is further configured to:

41. The apparatus of claim 40, wherein the at least one processor is further configured to:

42. The apparatus of claim 38, wherein the first plurality of neural network weight values are quantized under a weight prior.

43. The device of claim 38, wherein the compressed version of the first plurality of neural network weight values is received in an entropy encoded bitstream.

44. The apparatus of claim 38, wherein the at least one processor is further configured to: