CN117750293A

CN117750293A - Object audio coding

Info

Publication number: CN117750293A
Application number: CN202311198575.4A
Authority: CN
Inventors: S·扎马尼; M·Y·金; D·森; S·U·吕; J·O·梅里马; S·D·马尼亚斯
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2022-09-21
Filing date: 2023-09-18
Publication date: 2024-03-22

Abstract

The present disclosure relates to object audio encoding. In one aspect, a computer-implemented method includes: obtaining object audio and metadata spatially describing the object audio; converting the object audio to time-frequency domain ambisonics audio based on the metadata; and encoding the time-frequency domain ambisonics audio and the subset of metadata into one or more bitstreams to be stored in a computer readable memory or transmitted to a remote device.

Description

Object audio coding

This patent application claims the benefit of earlier dates of application U.S. provisional application No. 63/376,520 filed on month 21 of 2022 and U.S. provisional application No. 63/376,523 filed on month 9 of 2022.

Technical Field

The present disclosure relates to techniques in digital audio signal processing, and in particular to encoding or decoding of object audio in the Ambisonics (Ambisonics) domain.

Background

A processing device, such as a computer, smart phone, tablet, or wearable device, may output audio to a user. For example, a computer may launch an audio application, such as a movie player, a music player, a conference application, a telephone call, an alarm clock, a game, a user interface, a web browser, or other application that includes audio content for playback to a user through a speaker. Some audio content may include audio scenes having a spatial quality.

The audio signal may comprise an analog or digital signal that varies with time and frequency to represent sound or a sound field. The audio signal may be used to drive a sound transducer (e.g., a speaker) that replicates a sound or sound field. The audio signal may have a variety of formats. Traditional channel-based audio is recorded for listening devices such as 5.1 home theatres with five speakers and one bass box arranged at specified locations. Object audio encodes an audio source as an "object". Each object may have associated metadata describing spatial information about the object. Ambisonics is a global surround sound format that covers sound in the horizontal plane and sound sources above and below the listener. With ambisonics, the sound field is decomposed into spherical harmonic components.

Disclosure of Invention

In some aspects, a computer-implemented method includes: obtaining object audio and metadata spatially describing the object audio; converting the object audio to time-frequency domain ambisonics audio based on the metadata; and encoding the time-frequency domain ambisonics audio and the subset of metadata into one or more bitstreams to be stored in a computer readable memory or transmitted to a remote device.

In some examples, the time-frequency domain ambisonics audio includes a plurality of time-frequency tiles (tiles), each tile of the plurality of time-frequency tiles representing audio in a subband of the ambisonics component. Each tile of the plurality of time-frequency tiles may include a portion of the metadata that spatially describes a corresponding portion of the object audio in the tile. The time-frequency domain ambisonics audio may include a set of the plurality of time-frequency tiles corresponding to audio frames of the subject audio.

In some aspects, a computer-implemented method includes: decoding one or more bitstreams to obtain time-frequency domain ambisonics audio and metadata; extracting object audio from the time-frequency domain ambisonics audio using the metadata spatially describing the object audio; and rendering the object audio with the metadata based on a desired output layout. In some examples, the metadata is used to extract the object audio directly from the time-frequency domain ambisonics audio. In other examples, extracting the object audio includes converting the time-frequency domain ambisonics audio to time-domain ambisonics audio, and extracting the object audio from the time-domain ambisonics audio using the metadata.

In some aspects, a computer-implemented method includes: obtaining object audio and metadata spatially describing the object audio; converting the object audio to ambisonics audio based on the metadata; encoding the ambisonics audio in a first bitstream (e.g., as time-frequency domain ambisonics audio); and encoding a subset of the metadata in a second bitstream. The subset of the metadata may be used by a decoder to convert the ambisonics audio back to the object audio.

In some aspects, a computer-implemented method includes: decoding the first bitstream to obtain ambisonics audio (e.g., as time-frequency domain ambisonics audio); decoding the second bitstream to obtain metadata; extracting object audio from the ambisonics audio using the metadata spatially describing the object audio; and rendering the object audio with the metadata based on a desired output layout.

In some aspects, a computer-implemented method includes: converting the object audio to time-frequency domain ambisonics audio based on metadata spatially describing the object audio, wherein the object audio is associated with a first priority; converting second object audio to time-domain ambisonics audio, wherein the second object audio is associated with a second priority different from the first priority; encoding the time-frequency domain ambisonics audio into a first bitstream; encoding the metadata into a second bitstream; and encoding the time domain ambisonics audio into a third bitstream. The first priority may be a higher priority than the second priority. The time-domain ambisonics audio may be encoded at a lower resolution than the time-domain ambisonics audio.

Aspects of the disclosure may be performed by a processing device or processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), a machine-readable memory, etc.), software (e.g., machine-readable instructions stored or executed by processing logic), or a combination thereof.

The above summary does not include an exhaustive list of all aspects of the disclosure. It is contemplated that the present disclosure includes all systems and methods that can be practiced by all suitable combinations of the various aspects summarized above, as well as those disclosed in the detailed description below and particularly pointed out in the claims section. Such combinations may have particular advantages not specifically set forth in the foregoing summary.

Drawings

Aspects of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements. It should be noted that references to "a" or "an" aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. In addition, for the sake of brevity and reducing the total number of drawings, features of more than one aspect of the disclosure may be illustrated using a given drawing, and not all elements in the drawing may be required for a given aspect.

Fig. 1 illustrates an exemplary system for encoding object audio in a time-frequency domain ambisonics audio format, according to some aspects.

Fig. 2 illustrates an example system for encoding object audio in a time-frequency domain ambisonics audio format and a time-domain ambisonics audio format, according to some aspects.

Fig. 3 illustrates an exemplary system for encoding object audio in a ambisonics domain with metadata in accordance with some aspects.

Fig. 4 illustrates an example system for encoding object audio in a ambisonics domain based on priority in accordance with some aspects.

Fig. 5 illustrates an example of time-frequency domain ambisonics audio in accordance with some aspects.

Fig. 6 illustrates an example of an audio processing system in accordance with some aspects.

Detailed Description

Humans can estimate the location of sound by analyzing the sound at their two ears. This is known as binaural hearing, and the human auditory system can estimate the direction of sound using the way it diffracts around our body and reflects off and interacts with our pinna. These spatial cues may be generated manually by applying a spatial filter, such as a Head Related Transfer Function (HRTF) or a Head Related Impulse Response (HRIR), to the audio signal. HRTF is applied in the frequency domain and HRIR is applied in the time domain.

The spatial filter may artificially impart spatial cues into the audio that are similar to diffraction, retardation, and reflection caused by our ergonomics and auricles nature. Spatially filtered audio may be produced by a spatial audio reproduction system (renderer) and output through headphones. Spatial audio may be rendered for playback such that the audio is perceived as having spatial quality, e.g., originating from a location above, below, or to one side of the listener.

Spatial audio may correspond to visual components that together form an audiovisual work. The audiovisual work may be associated with an application, a user interface, a movie, a live show, a sporting event, a game, a conference call, or other audiovisual experience. In some examples, the audiovisual work may be integrated into an augmented reality (XR) environment, and a sound source of the audiovisual work may correspond to one or more virtual objects in the XR environment. The XR environment may include Mixed Reality (MR) content, augmented Reality (AR) content, virtual Reality (VR) content, and so forth. With an XR system, some of the physical movement of a person or representation thereof may be tracked and, in response, characteristics of virtual objects simulated in the XR environment may be adjusted in a manner consistent with at least one laws of physics. For example, the XR system may detect movements of the user's head and adjust the graphical content and auditory content presented to the user (similar to how such views and sounds change in a physical environment). As another example, the XR system may detect movement of an electronic device (e.g., mobile phone, tablet, laptop, etc.) presenting the XR environment, and adjust the graphical content and auditory content presented to the user (similar to how such views and sounds change in a physical environment). In some cases, the XR system may adjust features of the graphical content in response to other inputs (e.g., voice commands) such as representations of physical movements.

Many different types of electronic systems may enable a user to interact with and/or perceive an XR environment. Exemplary non-exclusive lists include head-up displays (HUDs), head-mounted systems, projection-based systems, windows or vehicle windshields with integrated display capabilities, displays formed as lenses placed on the eyes of a user (e.g., contact lenses), headphones/earphones, input systems with or without haptic feedback (e.g., wearable or handheld controllers), speaker arrays, smartphones, tablets, and desktop/laptop computers. The head-mounted system may have an opaque display and one or more speakers. Other head-mounted systems may be configured to accept an opaque external display (e.g., a smart phone). The head-mounted system may include one or more image sensors for capturing images or video of the physical environment, and/or one or more microphones for capturing audio of the physical environment. The head-mounted system may have a transparent or translucent display instead of an opaque display. The transparent or translucent display may have a medium through which light is directed to the eyes of the user. The display may utilize various display technologies such as uLED, OLED, LED, liquid crystal on silicon, laser scanning light sources, digital light projection, or combinations thereof. Optical waveguides, optical reflectors, holographic media, optical combiners, combinations thereof or other similar techniques may be used for the media. In some implementations, the transparent or translucent display may be selectively controlled to become opaque. Projection-based systems may utilize retinal projection techniques that project a graphical image onto a user's retina. Projection systems may also project virtual objects into a physical environment (e.g., as holograms or onto physical surfaces). An immersive experience such as an XR environment or other audio work may include spatial audio.

Spatial audio reproduction may include spatialization of sound sources in a scene. The scene may be a three-dimensional representation, which may include the location of each sound source. In an immersive environment, in some cases, a user may be able to move and interact around in a scene. Each sound source in a scene may be characterized by an object in object audio.

The object audio or object-based audio may include one or more audio signals and metadata associated with each object. The metadata may define whether the audio signal is an object (e.g., a sound source) and include spatial information such as an absolute position of the object, a relative direction from the listener to the object, a distance from the object to the listener, or other spatial information or combination thereof. The metadata may also include other audio information. Each audio signal with spatial information may be considered an "object" or sound source in the audio scene and rendered according to a desired output layout.

The renderer may render the object using spatial information of the object to give spatial cues in the resulting spatial audio to give the impression that the object has a position corresponding to the spatial information. For example, an object representing a bird may have spatial information indicating that the bird is high above the right side of the user. The objects may be rendered with spatial cues such that the resulting spatial audio signal gives this impression when output by speakers (e.g., left and right speakers through headphones). Furthermore, by changing the spatial information of the metadata over time, objects in the audio scene may be moved.

Ambisonics involves techniques for recording, mixing, and playback of three-dimensional 360-degree audio in both horizontal and/or vertical planes. Ambisonics treats an audio scene as 360-degree sound spheres from different directions around the center. One example of a ambisonics format is a B format, which may include a first order ambisonics consisting of four audio components W, X, Y and Z. Each component may represent a different spherical harmonic component pointing in a particular direction or a different microphone directivity, each directivity being coupled at the center point of the sphere.

Ambisonics has an inherent hierarchical format. Each incremental step (e.g., first, second, third, etc.) increases spatial resolution when played back to a listener. Ambisonics may be formatted with only low order ambisonics, such as with first order W, X, Y and Z. This format provides low spatial resolution despite having low bandwidth occupancy. Higher order ambisonics components are typically applied for high resolution immersive spatial audio experiences.

Ambisonics audio can be extended to higher orders, thereby improving the quality or resolution of localization. As each step increases, an additional ambisonics component is introduced. For example, for second order ambisonics audio, 5 new components are introduced in the ambisonics audio. For third order ambisonics audio 7 additional components are introduced, and so on. For traditional ambisonics audio (which may be referred to herein as time-domain ambisonics), this may result in occupancy or size increase of the audio information, which may rapidly suffer from bandwidth limitations. Thus, if the order of the ambisonics audio is high, simply converting the object audio to the ambisonics audio may suffer from bandwidth limitations in order to meet the desired spatial resolution.

Aspects of the present disclosure describe methods or apparatus (e.g., an encoder or decoder) that can encode and decode object audio in the ambisonics audio domain. Metadata may be used to map between object audio and a ambisonics audio representation of the object audio to reduce encoding overhead of the object audio.

In some aspects, the object audio is encoded as time-frequency domain (TF) ambisonics audio. In some aspects, in the decoding stage, the object audio is decoded into TF ambisonics audio and converted back to object audio. In some examples, the time-frequency domain ambisonics audio is directly decoded as object audio. In other examples, the time-frequency domain ambisonics audio is converted to time-domain (TD) ambisonics audio and then converted to object audio.

In some aspects, the object audio is encoded as TD ambisonics audio and the metadata is encoded in a separate bitstream. The decoder may use the object metadata to convert the TD ambisonics audio back to object audio.

In some aspects, the object audio is encoded as TF ambisonics audio or TD ambisonics audio based on a priority of the object audio. Objects associated with high priority may be encoded as TF ambisonics audio and objects not associated with high priority may be encoded as TD ambisonics audio.

At the decoder, once the object audio is extracted from the received ambisonics audio, the object audio may be rendered according to the desired output layout. In some examples, object audio may be spatialized and combined to form binaural audio, which may include left and right audio channels. The left and right audio channels may be used to drive left and right ear-worn speakers. In other examples, the object audio may be rendered according to a speaker layout format (e.g., 5.1, 6.1, 7.1, etc.).

Fig. 1 illustrates an exemplary system 100 for encoding object audio in a time-frequency domain ambisonics audio format, according to some aspects. Some aspects of the system may be implemented as an encoder 138 and other aspects of the system may be implemented as a decoder 140. Encoder 138 may include one or more processing devices that perform the described operations. Similarly, decoder 140 may include one or more processing devices that perform the described operations. The encoder 138 and decoder 140 may be communicatively coupled via a computer network, which may include wired or wireless communication hardware (e.g., transmitters and receivers). The encoder 138 and decoder 140 may communicate via one or more network communication protocols (such as an IEEE 702-based protocol) and/or other network communication protocols.

At the encoder 138, the object audio 102 and the metadata 104 spatially describing the object audio 102 are obtained by the encoder 138. The object audio 102 may include one or more objects, such as object 1, object 2, and the like. Each object may represent a sound source in a sound scene. The object metadata 104 may include information describing each object specifically and individually.

The encoder 138 may obtain the object audio 102 and the object metadata 104 as digital data. In some examples, the encoder 138 may generate the object audio 102 and the metadata 104 based on sensing sounds with microphones in a physical environment. In other examples, the encoder 138 may obtain the object audio 102 and the metadata 104 from another device (e.g., an encoding device, a capturing device, or an intermediary device).

The object audio 102 may be converted to time-frequency domain (TF) ambisonics audio 142. For example, at the ambisonics converter block 106, the object audio 102 may be converted to Time Domain (TD) ambisonics audio 132 based on the object metadata 104. The tdambisonics audio may include a time-varying audio signal for each ambisonics component of the tdambisonics audio. TD ambisonics audio may be understood as traditional ambisonics audio or Higher Order Ambisonics (HOA). At block 108, the TD ambisonics audio 132 may be converted to TF ambisonics audio 142. The TF ambisonics audio 142 may represent the TD ambisonics audio 132 and the object audio 102 with multiple time-frequency tiles. As further described in other sections, each tile may uniquely characterize the ambisonics components, subbands, and temporal ranges of object audio 102 and TD ambisonics audio 132.

At blocks 108 and 110, the TF ambisonics audio 142 and the subset 134 of metadata 104 may be encoded into one or more bitstreams (e.g., bitstream 128 and bitstream 130), respectively. The bitstreams 128 and 130 may be stored in a computer readable memory and/or transmitted to a remote device, for example, the decoder 140 or an intermediary device that may pass data to the decoder 140.

TF ambisonics audio 142 may include a plurality of time-frequency tiles, each tile of the plurality of time-frequency tiles representing audio in a subband of the ambisonics component. Each of the plurality of time-frequency tiles may include a portion of metadata 104 that spatially describes a corresponding portion of the object audio in the tile. Further, the TF ambisonics audio 142 may include a set of the plurality of time-frequency tiles corresponding to the audio frames of the subject audio. An example of TF ambisonics audio is shown in fig. 5.

At block 106 of fig. 1, converting object audio 102 to TF ambisonics audio may include converting object audio 102 to TD ambisonics audio 132 and encoding time domain ambisonics audio 132 into TF ambisonics audio 142 using object metadata 104 or a subset 134 of object metadata.

The TF ambisonics audio 142 may be a compressed (bitrate reduced) version of the TD ambisonics audio 132. The TD ambisonics audio 132 and the TF ambisonics audio 142 may include Higher Order Ambisonics (HOA) components. For example, at block 106, the object audio 102 may be converted to TD ambisonics, which may include a first order ambisonics component, a second order ambisonics component, and a third order ambisonics component. Each component other than first order may be understood as an HOA component, and ambisonics audio having more than one order may be referred to as Higher Order Ambisonics (HOA) audio.

Metadata 104 and its subset 134 may include spatial information of the object, such as direction, distance, and/or location. In some examples, direction, distance, location, or other spatial information may be defined relative to listener position. The metadata may include other information about the object, such as loudness, object type, or other information that may be specific to the object.

At the ambisonics decoder block 112 of the decoder 140, one or more bitstreams, such as bitstreams 128 and 130, are decoded to obtain TF ambisonics audio 124 and metadata 136. The TF ambisonics audio 124 may be the same as the TF ambisonics audio 142 encoded at the encoder 138. Similarly, metadata 136 may be the same as subset 134 encoded at encoder 138.

At block 114, the bitstream 130 may be decoded to obtain metadata 136. Metadata 136 may be the same as subset 134 encoded into bitstream 130 by encoder 138. Metadata 136 may be a quantized version of object metadata 104. Metadata 136 may include at least one of a distance or a direction associated with an object of the object audio. In some examples, the metadata 136 spatially describes each object in the object audio 126.

At block 116, the object audio 126 may be extracted from the TF ambisonics audio 124 using metadata 136 that spatially describes the object audio. The object audio 126 may be a quantized version of the object audio 102.

Quantization may be referred to as a process of constraining an input from a continuous or otherwise large set of values (such as real numbers) to a discrete set (such as integers). The quantized object audio 126 may include a more abbreviated representation (e.g., lower audio resolution) than the original object audio 102. This may include downsampled versions of the audio signal of the object, or versions with a smaller granularity in terms of amplitude or phase of the audio signal. Similarly, the quantized version of the metadata may be a reduced version having less or more abbreviated information (e.g., lower spatial resolution) than the original object metadata 104.

In some aspects, as shown in fig. 1, metadata is used to extract object audio 126 directly from TF ambisonics audio 124. For example, TF ambisonics audio 124 is not first converted to TD ambisonics audio (unlike the example in fig. 2). Extracting the object audio at block 116 may include referencing metadata information contained in each tile of the TF ambisonics audio 124 to extract a relevant audio signal for each object and re-associating a direction from the metadata 136 with each object to reconstruct the object audio 126. Thus, the resulting object audio 126 may include each object from the object audio 102 and a direction and/or distance for each object.

At the box labeled object renderer 118, the metadata 136 may be utilized to render the object audio 126 based on the desired output layout 120. The desired output layout 120 may vary depending on the configuration of the playback device and speakers 122, which may include multi-speaker layouts (such as 5.1, 6.1, 7.1, etc.), sets of ears, headsets, or other audio playback output formats. The resulting audio channels 144 generated by the object renderer 118 may be used to drive the speakers 122 to output sound scenes that replicate the sound scenes of the original object audio 102.

For example, the desired output layout 120 may include a multi-speaker layout having preset locations of speaker channels (e.g., center, front left, front right, or other speaker channels in a surround sound audio format). The object audio signals may be combined or mixed into the audio channels according to a rendering algorithm that distributes each object audio signal at those preset positions according to the spatial information contained in the object metadata.

In other examples, the desired output layout 120 may include a head mounted speaker layout that outputs binaural audio. In such cases, the object renderer 118 may comprise a binaural renderer that may apply HRTFs or HRIRs to the object audio 126 according to spatial information (e.g., direction and distance) contained in the metadata and/or metadata 136 of the object audio 126. The resulting left and right audio channels may include spatial cues imparted by HRTF or HRIR to spatially output audio to a listener through left and right ear-worn speakers. The ear-worn speaker may be worn on, over, or in the ear of the user.

In this way, by encoding, decoding, and rendering object audio using object metadata, object audio may be converted from and to a ambisonics audio format. At encoder 138, each time-frequency (TF) tile may be represented by a set (or sets) of audio signals and metadata. The metadata may include direction, distance, or other audio or spatial information, or a combination thereof. The audio signal of the object audio 102 and the metadata 104 may be encoded and transmitted as a bitstream 128 (as TF ambisonics audio) along with a subset 134 of the original object metadata 104, which may be encoded and transmitted as a bitstream 130.

At decoder 140, a set (or sets) of object audio and metadata for each TF tile is reconstructed. At block 114, a quantized version of the object metadata may be reconstructed. Similarly, a quantized version of the object audio signal may be extracted at block 116 using the set (or sets) of audio signals and metadata for each TF tile. The object renderer 118 may synthesize speaker or headphone outputs based on the quantized object audio 126, the quantized metadata 136, and the desired output layout 120 or other output channel layout information.

In some aspects, methods may be performed using aspects such as those described with respect to fig. 1. The method may be performed by processing logic of the encoder 138 or decoder 140, other audio processing devices, or a combination thereof. Processing logic may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof.

Although certain functional blocks ("blocks") are described in the methods, such blocks are examples. That is, aspects are well suited to performing various other blocks or variations of the blocks recited in the methodologies. It should be understood that the blocks in the method may be performed in an order different than presented, and that not all of the blocks in the method may be performed.

In one approach, processing logic may obtain object audio 102 and metadata 104 spatially describing the object audio. Processing logic may convert the object audio 102 to time-frequency domain ambisonics audio 142 based on the subset 134 or metadata 104 (e.g., at block 106 and block 108). Processing logic may encode the time-frequency domain ambisonics audio 142 and the subset 134 of metadata 104 into one or more bitstreams (e.g., 128 and 130) to be stored in a computer readable memory or transmitted to a remote device such as decoder 140 or an intermediary device.

In another approach, processing logic may decode one or more bitstreams (e.g., 128 and 130) to obtain the time-frequency domain ambisonics audio 124 and metadata 136. Processing logic may extract the object audio 126 from the time-frequency domain ambisonics audio 124 using metadata 136 that spatially describes the object audio 126. Processing logic may render object audio 126 with metadata 136 based on desired output layout 120. The metadata 136 may be used to extract the object audio 126 directly from the time-frequency domain ambisonics audio 124 (e.g., at block 116).

Fig. 2 illustrates an example system 200 for encoding object audio in a time-frequency domain ambisonics audio format and a time-domain ambisonics audio format, according to some aspects. Some aspects may be performed as encoder 244 and other aspects may be performed as decoder 242.

Encoder 244 may correspond to other examples of encoders, such as encoder 138 described with respect to fig. 1. For example, the encoder 244 may obtain the object audio 202 and the metadata 204 spatially describing the object audio 202. At blocks 206 and 208, the encoder 244 may convert the object audio 202 into TF ambisonics audio 246 based on the metadata 204 and its subset 234. At blocks 208 and 210, the TF ambisonics audio 246 and the subset 234 of metadata 204 are encoded into one or more bitstreams (e.g., 228 and 230) to be stored in a computer readable memory or transmitted to a remote device.

Decoder 242 may correspond to other examples of decoders, such as decoder 140. In addition to the blocks discussed with respect to decoder 140 and fig. 1, decoder 242 may also include a time domain ambisonics decoder 238. Decoder 242 may decode one or more bitstreams, such as bitstream 228 and bitstream 230, to obtain TF ambisonics audio 224 and metadata 236, respectively. The TF ambisonics audio 224 may correspond to or be the same as TF ambisonics audio 246. The decoder 242 may extract the object audio 226 from the TF ambisonics audio 224 using the metadata 236 spatially describing the object audio 226. The decoder 242 may utilize the metadata 236 to render the object audio 226 based on the desired output layout 220.

As shown in this example, extracting the object audio 226 may include converting the TF ambisonics audio 224 to TD ambisonics audio 240 at the decoder 238. At block 216, object audio 226 is extracted from the TD ambisonics audio 240 using metadata 236. The tdambisonics audio may include a plurality of components, each component corresponding to a unique polarity pattern. The number of components may vary depending on the resolution. These components may each comprise a time-varying audio signal. The TD ambisonics audio 240 may also be referred to as ambisonics audio or conventional ambisonics. The TD ambisonics may not include time-frequency tiles like the TF ambisonics audio 246 and 224.

A set (or sets) of audio signals and metadata for each object of each TF tile may be reconstructed (e.g., at block 212 and block 214, respectively). These may be used to reconstruct the TD ambisonics audio 240. The tdambisonics audio 240 may correspond to the TDs ambisonics audio 232. At block 214, metadata 236, which may be a quantized version of object metadata 204, may be reconstructed. Similarly, at block 216, a quantized version of the original object audio 202 (labeled object audio 226) may be extracted using the TD ambisonics audio 240 and metadata 236. The object renderer 218 may synthesize speaker or headphone outputs (e.g., output audio channels) based on the object audio 226, the metadata 236, and channel information of the desired output layout 220. The resulting output audio channels may be used to drive speakers 222 to match the output channel layout.

In some aspects, the method may be performed using aspects such as those described with respect to fig. 2. The method may be performed by processing logic of the encoder 244 or decoder 242, other audio processing devices, or a combination thereof. Processing logic may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof.

In one approach, processing logic may decode one or more bitstreams (e.g., 228 and 230) to obtain time-frequency domain ambisonics audio 224 and metadata 236. Processing logic may extract object audio 226 from time-frequency domain ambisonics audio 224 using metadata 236 spatially describing object audio 226. Extracting the object audio 226 may include converting the time-frequency domain ambisonics audio 224 to time-domain ambisonics audio or TD ambisonics audio 240 (e.g., at decoder 238), and extracting the object audio 226 from the TD ambisonics audio 240 using metadata 236. Processing logic may utilize metadata 236 to render object audio 226 based on desired output layout 220.

Fig. 3 illustrates an exemplary system for encoding object audio in a ambisonics domain with metadata in accordance with some aspects. Some aspects may be performed as an encoder 340 and other aspects may be performed as a decoder 342. Encoder 340 may share common features with other encoders described herein. Similarly, decoder 342 may share common features with other decoders described herein.

In system 300, object audio 302 is converted to ambisonics (e.g., HOA). The system 300 uses the object metadata 304 to encode, decode, and render object audio. The HOA converted from object audio is encoded/decoded/rendered by using the object metadata 304.

At encoder 340, one or more bitstreams (e.g., 332 and 334) for both the HOA and the subset of original object metadata are generated and transmitted to decoder 342. At decoder 342, a quantized version of HOA may be reconstructed and a quantized version of object metadata may be reconstructed. The reconstructed HOA and the reconstructed metadata may be used to extract a quantized version of the object audio signal. The object renderer 318 can synthesize an audio channel 330 (headphone output or speaker output) based on the extracted object audio signal, the reconstructed metadata, and channel layout information of the desired output layout 320.

In particular, the encoder 340 may obtain the object audio 302 and the object metadata 304 spatially describing the object audio 302. The object audio 302 may be referred to as raw object audio and the object metadata 304 may be referred to as raw object metadata.

At block 306, the encoder 340 may convert the object audio 302 to ambisonics audio (e.g., HOA) based on the object metadata 304. The object metadata 304 may describe spatial information such as the relative direction and distance between the object and the listener. At the ambisonics converter block 306, the audio signal of the object audio 302 may be converted into each ambisonics component by spatially mapping the acoustic energy described by the metadata of the audio signal of the object to a unique pattern of each component. This may be performed for each object of object audio 302, resulting in ambisonics audio 338. Ambisonics audio 338 may be referred to as time domain ambisonics audio. Depending on the distribution of audio objects in the audio scene, one or more components of the tdambisonics audio 338 may have audio contributions from multiple objects in the object audio 302. Accordingly, encoder 340 may apply metadata 304 to map each object of object audio 302 to each component of resulting ambisonics audio 338. The process may also be performed in other examples to convert object audio to TD ambisonics audio.

At block 308, ambisonics audio 338 is encoded in first bitstream 332 as ambisonics audio (e.g., TD ambisonics audio). At block 310, a subset 336 of the metadata 304 is encoded in the second bitstream 334. The metadata 304, or a subset 336 thereof, or both, may include at least one of a distance or direction specifically associated with the object of the object audio. Other spatial information may also be included.

The subset of metadata may be used by downstream devices (e.g., decoder 342) to convert the ambisonics audio in 332 back to object audio 302 (or a quantized version of the object audio). In some examples, bitstreams 332 and 334 are separate bitstreams. In other examples, the bitstreams may be combined (e.g., by multiplexing or other operations).

The decoder 342 may obtain one or more bitstreams, such as bitstream 332 and bitstream 334. At block 312, the first bitstream 332 may be decoded to obtain ambisonics audio 324. Ambisonics audio 324 may correspond to or be identical to ambisonics audio 338. In some examples, decoder 342 may decode bitstream 332 to reconstruct a quantized version of ambisonics audio 338.

At block 314, the decoder 342 may decode the second bitstream 334 to obtain the metadata 326. The metadata may correspond to or be the same as metadata subset 336. In some aspects, a quantized version of metadata subset 336 is reconstructed.

At block 316, object audio 328 is extracted from ambisonics audio 324 using metadata 326 spatially describing object audio 328. Extracting object audio 328 may include extracting acoustic energy from each component of ambisonics audio 324 according to the spatial location indicated in metadata 326 to reconstruct each object indicated in metadata 326. The metadata 326 may be used to extract object audio 328 directly from the ambisonics audio 324 (e.g., TD ambisonics audio). The extraction process may also correspond to other examples. The object audio 328 may be a quantized version of the object audio 302.

At a block labeled object renderer 318, the object audio 328 may be rendered with metadata based on the desired output layout 320. The object audio 328 may include separate audio signals for each object, and metadata 326 that may have portions associated with or specific to each of the respective audio signals.

The resulting audio channel 330 may be used to drive the speaker 322 to output sound that approximates or matches the original audio scene characterized by the original object audio 302 and the original object metadata 304.

In many of the examples described, encoding data into a bitstream may include performing one or more encoding algorithms that encapsulate data in the bitstream according to a defined digital format. Similarly, decoding data such as ambisonics audio and metadata from a bitstream may include applying one or more decoding algorithms to unpack the data according to a defined digital format.

In some aspects, methods may be performed using aspects such as those described with respect to fig. 3. The method may be performed by processing logic of the encoder 340 or decoder 342, other audio processing devices, or a combination thereof. Processing logic may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof.

In one approach, processing logic may obtain object audio 302 and metadata 304 that spatially describes object audio 302. Processing logic may convert object audio 302 into ambisonics audio 338 based on metadata 304. Processing logic may encode ambisonics audio 338 in first bitstream 332. Processing logic may encode metadata 304 or a subset 336 thereof in second bit stream 334.

In another approach, processing logic may decode the first bitstream 332 to obtain the ambisonics audio 324. Processing logic may decode the second bit stream 334 to obtain metadata 326. Processing logic may extract object audio 328 from ambisonics audio 324 using metadata 326 that spatially describes object audio 324. Processing logic may utilize metadata 326 to render object audio 328 based on desired output layout 320.

In some examples, the object with the higher priority may be encoded as first ambisonics audio. Objects that do not have a higher priority may be encoded as second ambisonics audio having a lower order than the first ambisonics audio. The first ambisonics audio may be encoded with bitstream 332 and the second ambisonics audio may be encoded with a third bitstream (not shown). Priority-based encoding is further described with reference to fig. 4.

Fig. 4 illustrates an example system 400 for encoding object audio in a ambisonics domain based on priority in accordance with some aspects. Some aspects may be performed as an encoder 456 and other aspects may be performed as a decoder 458. Encoder 456 may share common features with other encoders described herein. Similarly, decoder 458 may share common features with other decoders described herein.

The system 400 may include a hybrid domain of object encoding. The object audio may have objects of different priorities. Objects with a first priority level (e.g., higher priority) may be converted, encoded, and decoded into TF ambisonics audio. Objects with a second priority level (e.g., lower priority) may be converted, encoded, and decoded into a TD ambisonics (e.g., HOA). Regardless of the priority level, the objects may be reconstructed at the decoder and added together to produce the final speaker or headphone output signal. Lower priority objects may be converted to low resolution HOAs (e.g., having lower orders, such as up to first order ambisonics). Higher priority objects may have a high resolution HOA (e.g., 6 th order ambisonics).

At encoder 456, object audio 402 may be obtained. The object audio 402 may be associated with a first priority (e.g., P1). In some examples, object audio 402 may be converted to TF ambisonics audio 460 based on metadata 436 spatially describing the object audio. For example, at block 406, object audio 402 may be converted to TD ambisonics audio 438, and then at block 408, the TD ambisonics audio may be converted to TF ambisonics audio 460.

At block 444, second object audio 440 may be converted to TD ambisonics audio 448. The second object audio 440 may be associated with a second priority different from the first priority. For example, the first priority of the object audio 402 may have a higher priority than the second priority of the object audio 440. The priority may be characterized by a value (e.g., a number) or an enumerated type.

The object audio 402 and the object audio 440 may be part of the same object audio (e.g., from the same audio scene). In some examples, the audio scene may indicate a priority for each object as determined during authoring of the audio scene. The audio authoring tool may embed the priorities or types of objects into the metadata. The decoder may obtain the priority of each object in its corresponding metadata or derive the priority from the type associated with the object.

At block 408, TF ambisonics audio 460 may be encoded as the first bitstream 432. In other examples, encoder 456 may encode TD ambisonics audio 438 as first bitstream 432 instead of converting to TF ambisonics audio. At block 410, metadata 436 associated with the first object audio 402 may be encoded into a second bitstream 434. At block 446, the TD ambisonics audio 448 may be encoded into a third bitstream 462. In some examples, in response to the priorities of the object audio 440 and its corresponding metadata 442 not meeting a threshold (e.g., indicating a low priority), the object metadata 442 is not encoded or transmitted to the decoder 458.

In some examples, encoder 456 may determine a priority for each object in the object audio. If the priority meets a threshold (e.g., indicates a high priority), the object may be encoded as first TF ambisonics audio or first TD ambisonics audio. If the priority does not meet the threshold, the object may be encoded as second TD-Hi-Po audio, or second TD-Hi-Po audio having a lower order than the first TF-Hi-Po audio or the first TD-Hi-Po audio, or both. In this way, lower priority objects may be encoded with lower spatial resolution. Higher priority objects may be encoded as TF ambisonics audio or TD ambisonics audio with higher order and higher resolution.

At block 412, the decoder 458 may decode the first bitstream 432 to obtain TF ambisonics audio 460 (or TD ambisonics audio 438). At block 414, the second bitstream 434 is decoded to obtain metadata 426. Metadata 426 may correspond to metadata 436. Metadata 426 may be identical to metadata 436 or quantized versions of metadata 426.

At block 450, the third bitstream 462 may be decoded to obtain TD ambisonics audio 464. The tdambisonics audio 464 may correspond to or be the same as the TDs ambisonics audio 448.

At block 416, object audio 428 is converted from audio 424, which may be TF ambisonics audio or TD ambisonics audio. The decoder 458 may extract the object audio 428 using metadata 426 that spatially describes the object audio, as described in other portions.

Decoder 458 may generate a plurality of output audio channels 468 based on object audio 428 and TD ambisonics audio 464. Generating the plurality of output audio channels 468 may include rendering object audio 428 at object renderer box 418 and TF ambisonics audio 464 at TD ambisonics renderer 454. The rendered object audio 430 and the rendered ambisonics audio 466 may be combined (e.g., added together) at block 452 into respective output audio channels 468 to generate the plurality of audio channels 468. Object audio 430 and TF ambisonics audio 466 may be rendered based on the generic desired output layout 420.

The output audio channel 468 can be used to drive the speaker 422. Speaker 422 may be integrated into decoder 458. In other examples, speakers 422 may be integrated into one or more remote playback devices. For example, each of speakers 422 may be independent speakers. In another example, each of the speakers 422 may be integrated into a general-purpose playback device, such as a speaker array, set of ears, or other playback device.

In some aspects, methods may be performed using aspects such as those described with respect to fig. 4. The method may be performed by processing logic of the encoder 456 or decoder 458, other audio processing devices, or a combination thereof. Processing logic may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof.

In one approach, processing logic may convert the object audio 402 into TF domain ambisonics audio 460 based on metadata 436 spatially describing the object audio 402, wherein the object audio 402 is associated with a first priority. Processing logic may convert the second object audio 440 to TD ambisonics audio 448, where the second object audio is associated with a second priority different from the first priority.

Processing logic may encode TF ambisonics audio 460 as first bit stream 432. Alternatively, processing logic may encode the TD ambisonics audio 438 (converted from object audio 402) as the first bitstream 432. Processing logic encodes metadata 404 into a second bitstream 434. Processing logic may encode the TD ambisonics audio 448 (encoded from the object audio 440) as the third bitstream 462. The first priority may be higher than the second priority.

In another approach, processing logic may decode the first bitstream 432 to obtain TF ambisonics audio, which may correspond to TF ambisonics audio 460. Alternatively, processing logic may decode the first bit stream 432 to obtain the TD ambisonics audio, which may correspond to the TD ambisonics audio 438. This may depend on whether encoder 456 encodes first bitstream 432 as TF or TD ambisonics audio. The resulting decoded audio 424 may correspond to the object audio 402 that may be associated with the first priority. Processing logic may decode the second bitstream 434 to obtain metadata 426. Metadata 426 may correspond to object metadata 436 that may be associated with object audio 402. Processing logic may decode the third bitstream 462 to obtain the TD ambisonics audio 464. The tdambisonics audio 464 may correspond to the object audio 440, which may be associated with a second priority that may be different from the first priority. Processing logic may extract object audio 428 from audio 424, which may be TF ambisonics audio or TD ambisonics audio, using metadata 426 that spatially describes object audio 428. Processing logic may generate a plurality of output audio channels 468 based on object audio 428 (which is associated with a first priority) and TD ambisonics audio 464 (which is associated with a second priority).

In some aspects, multiple priority levels may be supported. For example, an object with priority 1 (lowest priority) may be encoded as first ambisonics audio. Objects with priority 3 (higher priority) may be encoded with second ambisonics audio having a higher order than the first ambisonics audio. Objects with priority 5 (higher than priorities 1 and 3) may be encoded as third ambisonics audio with higher order than the first and second ambisonics audio, and so on.

Fig. 5 illustrates an example of time-frequency domain ambisonics audio in accordance with some aspects. The TF ambisonics audio may correspond to the various examples described. The time-frequency domain (TF) ambisonics audio may include a time-frequency tiling of conventional ambisonics audio (which may be referred to as time-domain ambisonics audio). The time-frequency domain ambisonics audio may correspond to or characterize object audio 512.

The object audio 512 may include a plurality of frames, such as frame 508, frame 510, and so on. Each frame may include a time-varying block of each audio signal of each object and metadata of each object. For example, one second of audio may be divided into "X" frames. The audio signal of each object and the metadata of each object may change over time (e.g., from one frame to another).

Traditionally, ambisonics audio (such as HOA) includes multiple components, where each of those components may represent a unique polarity pattern and direction of the microphone. The number of components increases as the order of the ambisonics audio format increases. Thus, the higher the step, the higher the spatial resolution of the ambisonics audio. For example, B-format ambisonics (with up to three orders) has 16 components, each with unique polarity patterns and directions. The audio signal of each component may vary over time. Thus, a traditional ambisonics audio format may be referred to as being in the time domain, or as Time Domain (TD) ambisonics audio.

As described in many examples, conventional ambisonics audio may be converted to time-frequency ambisonics audio including metadata of object audio using time-frequency analysis. Time-frequency represents a time-domain signal that characterizes both time and frequency across. Each tile may represent a subband or a frequency range. Processing logic may generate TF ambisonics audio by converting object audio 512 to TD ambisonics using object metadata (e.g., metadata 516, 520). Processing logic may perform a tile-frequency analysis to divide the components of the TD ambisonics audio into tiles and embed spatial information of the metadata in each tile, depending on which objects contribute to the tile. TF ambisonics audio may be converted back to object audio by performing the inverse operation using the same spatial information or a subset of the spatial information.

The TF ambisonics audio may include multiple time frequency tiles, such as 502a, 502b, 502c, 502d, 502e, 502f, 502g, etc. Each of the plurality of time-frequency tiles may represent audio in a subband of the ambisonics component. TF tile 502a may represent audio in a subband in component a ranging from frequency a to frequency B. The audio in tile 502a may represent contributions of audio from each of objects 514, as spatially picked up in that sub-band (from frequency a to frequency B) by the polarity pattern and direction of component a. Each collage may have contributions from different combinations of objects, depending on the manner in which the objects are spatially distributed in the sound field relative to the components and the acoustic energy of the objects.

For example, collage 502b may include contributions from one or more of the objects 514. Collage 502e may have contributions from different sets of objects 514. Some tiles may not have contributions from any object. In this example, tiles 502a through 502e may have different frequency ranges in component a. Each component (such as component a, component B, etc.) may have its own set of tiles. For example, tile 502f and tile 502e may cover the same frequency band, but for different components.

Further, each tile of the plurality of time-frequency tiles may include a portion of metadata that spatially describes a corresponding portion of the object audio in the tile. For example, if tile 502f includes contributions from one or more of objects 514 (e.g., chirped birds), metadata 516 corresponding to the chirped birds may be included in tile 502f along with the audio contributions of the chirped birds. The metadata may identify the object (e.g., with the object ID), and/or provide spatial information for the bird. This may improve the mapping from TF ambisonics audio back to object audio.

Further, the TD ambisonics audio may include a set of the plurality of time-frequency tiles corresponding to audio frames of the subject audio. The set of tiles may cover each subband and each component of the TF ambisonics audio. For example, the set of time-frequency tiles 504 may include a tile for each subband for each component. The set may correspond to or characterize a portion or frame of object audio 512, such as frame 508. Another set 506 of time-frequency tiles may correspond to or characterize a subsequent portion of the object audio 512 (e.g., at the next frame 510). The set 506 may have tiles that each cover each of the same subbands and components as the previous set. For example, tile 502g may cover the same subband and the same component as tile 502a in set 504. Thus, each set may represent a time dimension, and each tile in the set may represent a different component or subband.

For example, in set 504, object x and object y may contribute to audio in subband 1 (component a). In collage 502a, object audio from object x and object y may be represented in the audio signal of 502a along with metadata 516 that identifies and spatially describes object x and object y. In (collage) set 506, collage 502g may also represent sub-band 1 (component A), but characterizes a different time of object audio 512.

Furthermore, the object contributions in each collage may change from one set to another due to changes in the audio signal of the object over time or the position of each object or both. For example, if object y becomes quieter or moves from frame 508 to frame 510, tile 502g may contain object x but no object y, or less object y. The metadata 516, 520 may change from frame to represent changes in spatial information of each object over time. Similarly, object 514 and object 518 may change from frame to represent changes in the audio signal of the object over time.

Fig. 6 illustrates an example of an audio processing system 600 according to some aspects. The audio processing system may operate as an encoder and/or decoder as described in many examples. The audio processing system may be an electronic device, such as a desktop computer, tablet, smart phone, laptop, smart speaker, media player, home appliance, set of ears, head Mounted Display (HMD), smart glasses, infotainment system for an automobile or other vehicle, or other computing device. The system may be configured to perform the methods and processes described in this disclosure.

Although various components of an audio processing system are shown that may be incorporated into headphones, speaker systems, microphone arrays, and entertainment systems, this illustration is merely one example of a particular implementation of the types of components that may be present in an audio processing system. This example is not intended to represent any particular architecture or manner of interconnecting the components, as such details are not germane to the aspects described herein. It should also be appreciated that other types of audio processing systems having fewer or more components than shown may also be used. Thus, the processes described herein are not limited to use with the hardware and software shown.

The audio processing system may include one or more buses 616 for interconnecting the various components of the system. One or more processors 602 are coupled to the bus as is known in the art. The one or more processors may be a microprocessor or special purpose processor, a system on a chip (SOC), a central processing unit, a graphics processing unit, a processor created by an Application Specific Integrated Circuit (ASIC), or a combination thereof. The memory 608 may include Read Only Memory (ROM), volatile memory, and nonvolatile memory, or combinations thereof, coupled to the bus using techniques known in the art. The sensor 614 may include an IMU and/or one or more cameras (e.g., RGB camera, RGBD camera, depth camera, etc.) or other sensors described herein. The audio processing system can also include a display 612 (e.g., an HMD or touch-screen display).

Memory 608 may be connected to the bus and may include DRAM, a hard drive, or flash memory, or a magnetic optical drive or magnetic memory, or an optical drive or other type of memory system that maintains data even after the system is powered down. In one aspect, the processor 602 retrieves computer program instructions stored in a machine-readable storage medium (memory) and executes the instructions to perform the operations described herein of an encoder or decoder.

Although not shown, audio hardware may be coupled to the one or more buses to receive audio signals to be processed and output by the speakers 606. The audio hardware may include digital-to-analog converters and/or analog-to-digital converters. The audio hardware may also include audio amplifiers and filters. The audio hardware may also be connected to a microphone 604 (e.g., a microphone array) to receive audio signals (whether analog or digital), digitize them as appropriate, and transmit these signals to the bus.

The communication module 610 may communicate with remote devices and networks through wired or wireless interfaces. For example, the communication module may communicate via known techniques such as TCP/IP, ethernet, wi-Fi, 3G, 4G, 5G, bluetooth, zigBee, or other equivalent techniques. The communication module may include wired or wireless transmitters and receivers that may communicate (e.g., receive and transmit data) with a networking device such as a server (e.g., cloud) and/or other devices such as a remote speaker and remote microphone.

It should be appreciated that aspects disclosed herein may utilize memory that is remote from the system, such as a network storage device coupled to the audio processing system through a network interface, such as a modem or ethernet interface. Buses may be connected to each other through various bridges, controllers and/or adapters as is well known in the art. In one aspect, one or more network devices may be coupled to the bus. The network device may be a wired network device (e.g., ethernet) or a wireless network device (e.g., wi-Fi, bluetooth). In some aspects, various aspects described (e.g., simulation, analysis, estimation, modeling, object detection, etc.) may be performed by a networked server in communication with the capture device.

Various aspects described herein may be at least partially embodied in software. That is, the techniques may be implemented in an audio processing system in response to its processor executing sequences of instructions contained in a storage medium, such as a non-transitory machine-readable storage medium (e.g., DRAM or flash memory). In various aspects, hard-wired circuitry may be used in combination with software instructions to implement the techniques described herein. Thus, these techniques are not limited to any specific combination of hardware circuitry and software nor to any particular source for the instructions executed by the audio processing system.

In this specification, certain terms are used to describe features of various aspects. For example, in some cases, the terms "decoder," "encoder," "converter," "renderer," "extract," "combiner," "unit," "system," "device," "filter," "block," "component" may represent hardware and/or software configured to perform one or more processes or functions. For example, examples of "hardware" include, but are not limited to, integrated circuits such as processors (e.g., digital signal processors, microprocessors, application specific integrated circuits, microcontrollers, etc.). Thus, as will be appreciated by those skilled in the art, different combinations of hardware and/or software may be implemented to perform the processes or functions described by the above terms. Of course, the hardware may alternatively be implemented as a finite state machine or even as combinatorial logic elements. Examples of "software" include executable code in the form of an application, applet, routine or even a series of instructions. As described above, the software may be stored in any type of machine-readable medium.

Some portions of the preceding detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the audio processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and is conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the following claims, refer to the actions and processes of an audio processing system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the system's registers and memories into other data similarly represented as physical quantities within the system memories or registers or other such information storage, transmission or display devices.

The processes and blocks described herein are not limited to the specific examples described, and are not limited to the specific order used herein as examples. Rather, any of the processing blocks may be reordered, combined, or removed, performed in parallel, or serially, as desired, to achieve the results described above. The processing blocks associated with implementing the audio processing system may be executed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium to perform the functions of the system. All or part of the audio processing system may be implemented as dedicated logic circuits, e.g., FPGAs (field programmable gate arrays) and/or ASICs (application specific integrated circuits). All or part of the audio system may be implemented with electronic hardware circuitry comprising electronic devices such as, for example, at least one of a processor, memory, programmable logic device, or logic gate. Additionally, the processes may be implemented in any combination of hardware devices and software components.

In some aspects, the disclosure may include a language such as "[ element a ] and [ element B ]. The language may refer to one or more of these elements. For example, "at least one of a and B" may refer to "a", "B", or "a and B". In particular, "at least one of a and B" may refer to "at least one of a and B" or "at least either a or B". In some aspects, the disclosure may include languages such as "[ element a ], [ element B ], and/or [ element C ]". The language may refer to any one of or any combination of these elements. For example, "A, B and/or C" may refer to "a", "B", "C", "a and B", "a and C", "B and C" or "A, B and C".

While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative and not restrictive, and that this disclosure not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art.

To assist the patent office and any readers of any patent issued in this application in interpreting the appended claims, the applicant wishes to note that they do not intend any of the appended claims or claim elements to call 35u.s.c.112 (f) unless the word "means for" or "steps for" is used explicitly in a particular claim.

It is well known that the use of personally identifiable information should follow privacy policies and practices that are believed to meet or exceed industry or government requirements for maintaining user privacy. In particular, personally identifiable information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use, and the nature of authorized use should be specified to the user.

Claims

1. A computer-implemented method, comprising:

obtaining object audio and metadata spatially describing the object audio;

Converting the object audio to time-frequency domain ambisonics audio based on the metadata; and

the time-frequency domain ambisonics audio and the subset of metadata are encoded into one or more bitstreams to be stored in a computer readable memory or transmitted to a remote device.

2. The method of claim 1, wherein the time-frequency domain ambisonics audio comprises a plurality of time-frequency tiles, each tile of the plurality of time-frequency tiles representing audio in a subband of a ambisonics component.

3. The method of claim 2, wherein each tile of the plurality of time-frequency tiles includes a portion of the metadata that spatially describes a corresponding portion of the object audio in the tile.

4. The method of claim 3, wherein the time-frequency domain ambisonics audio comprises a set of the plurality of time-frequency tiles corresponding to audio frames of the subject audio.

5. The method of claim 1, wherein converting the object audio to the time-frequency domain ambisonics audio comprises: converting the object audio to time domain ambisonics audio; encoding the time-domain ambisonics audio into the time-domain ambisonics audio.

6. The method of claim 5, wherein the time-frequency domain ambisonics audio is a compressed version of the time-domain ambisonics audio.

7. The method of claim 1, wherein the time-frequency domain ambisonics audio comprises Higher Order Ambisonics (HOA) components.

8. The method of claim 1, wherein the metadata comprises a direction associated with an object of the object audio.

9. The method of claim 8, wherein the metadata comprises a distance associated with an object in the object audio.

10. A processing device configured to:

obtaining object audio and metadata spatially describing the object audio;

converting the object audio to time-frequency domain ambisonics audio based on the metadata;

encoding the time-frequency domain ambisonics audio and a subset of the metadata into one or more bitstreams; and

the one or more bitstreams are transmitted to a remote device.

11. A computer-implemented method, comprising

Decoding one or more bitstreams to obtain time-frequency domain ambisonics audio and metadata,

Extracting object audio from the time-frequency domain ambisonics audio using the metadata spatially describing the object audio; and

the metadata is utilized to render the object audio based on a desired output layout.

12. The method of claim 11, wherein the metadata is used to extract the object audio directly from the time-frequency domain ambisonics audio.

13. The method of claim 11, wherein extracting the object audio comprises converting the time-frequency domain ambisonics audio to time-domain ambisonics audio, and extracting the object audio from the time-domain ambisonics audio using the metadata.

14. The method of claim 11, wherein the time-frequency domain ambisonics audio comprises a plurality of time-frequency tiles, wherein each tile of the plurality of time-frequency tiles represents audio in a subband of a ambisonics component, and each tile comprises a portion of the metadata that spatially describes a corresponding portion of the object audio in the tile.

15. The method of claim 14, wherein the time-frequency domain ambisonics audio comprises a set of the plurality of time-frequency tiles corresponding to audio frames of the subject audio.

16. The method of claim 11, wherein the object audio is a quantized version of an original version of the object audio.

17. The method of claim 16, wherein the metadata comprises a quantized version of an original version of the metadata, the original version of the metadata being associated with the original version of the object audio.

18. The method of claim 11, wherein the metadata comprises at least one of a distance or a direction associated with an object of the object audio.

19. The method of claim 11, wherein the object audio is rendered as a plurality of audio channels corresponding to the desired output layout as a multi-speaker layout.

20. The method of claim 11, wherein the object audio is rendered as binaural audio corresponding to the desired output layout as a head mounted speaker layout.

21. A computer-implemented method, comprising:

Obtaining object audio and metadata spatially describing the object audio;

converting the object audio to ambisonics audio based on the metadata;

encoding the ambisonics audio in a first bitstream; and

at least a subset of the metadata is encoded in a second bitstream.

22. The method of claim 21, wherein encoding at least the subset of the metadata comprises encoding less than all of the metadata in the second bitstream.

23. The method of claim 22, wherein the metadata is used by a decoder to convert the ambisonics audio to the object audio.

24. The method of claim 23, wherein the ambisonics audio is time domain ambisonics audio.

25. The method of claim 24, wherein the metadata comprises at least one of a direction of an object relative to a listening position, a distance of the object relative to the listening position, or a position of the object relative to the listening position.

26. The method of claim 25, wherein converting the object audio to the ambisonics audio based on the metadata comprises mapping a contribution or acoustic energy of each object in the object audio to each component in the ambisonics audio using spatial information in the metadata.

27. A computer-implemented method, comprising:

decoding the first bitstream to obtain ambisonics audio;

decoding the second bitstream to obtain metadata;

extracting object audio from the ambisonics audio using the metadata spatially describing the object audio; and

28. The method of claim 27, wherein the metadata is used to extract the object audio directly from the ambisonics audio.

29. The method of claim 28, wherein extracting the object audio comprises reconstructing a quantized version of the ambisonics audio and extracting the object audio from the quantized version of the ambisonics audio.

30. The method of claim 28, wherein extracting the object audio comprises reconstructing a quantized version of the metadata and extracting the object audio from the ambisonics audio using the quantized version of the metadata.

31. The method of claim 28, wherein rendering the object audio with the metadata comprises rendering a quantized version of the object audio based on a quantized version of the metadata and the desired output layout.

32. The method of claim 27, wherein the ambisonics audio is obtained as time-domain ambisonics audio when the first bitstream is decoded.

33. The method of claim 27, wherein the metadata comprises at least one of a direction of an object relative to a listening position, a distance of the object relative to the listening position, or a position of the object relative to the listening position.

34. A computer-implemented method, comprising:

converting the first object audio to time-frequency domain ambisonics audio based on metadata spatially describing the first object audio, wherein the object audio is associated with a first priority;

converting second object audio to time-domain ambisonics audio, wherein the second object audio is associated with a second priority different from the first priority;

encoding the time-frequency domain ambisonics audio into a first bitstream;

encoding the metadata into a second bitstream; and

the time domain ambisonics audio is encoded into a third bitstream.

35. The method of claim 34, wherein the first priority has a higher priority than the second priority.

36. The method of claim 35, wherein the time-domain ambisonics audio is encoded at a lower resolution than the time-domain ambisonics audio.

37. The method of claim 36, wherein the time-frequency domain ambisonics audio comprises a plurality of time-frequency tiles, each tile of the plurality of time-frequency tiles representing audio in a subband of a ambisonics component.

38. A computer-implemented method, comprising:

decoding the first bitstream to obtain time-frequency domain ambisonics audio associated with the first priority;

decoding the second bitstream to obtain metadata;

decoding a third bitstream to obtain time-domain ambisonics audio associated with a second priority, the second priority being lower than the first priority;

A plurality of output audio channels is generated based on the object audio and the time-domain ambisonics audio.

39. The method of claim 38, wherein generating the plurality of output audio channels comprises rendering the object audio and the time-frequency domain ambisonics audio to generate the plurality of audio channels.

40. The method of claim 39, wherein the object audio and the time-frequency domain ambisonics audio are rendered based on an audio playback output format.