CN117768832A - Method and system for efficient encoding of scene locations - Google Patents

Method and system for efficient encoding of scene locations Download PDF

Info

Publication number
CN117768832A
CN117768832A CN202311226348.8A CN202311226348A CN117768832A CN 117768832 A CN117768832 A CN 117768832A CN 202311226348 A CN202311226348 A CN 202311226348A CN 117768832 A CN117768832 A CN 117768832A
Authority
CN
China
Prior art keywords
scene
origin
sound source
encoded
location
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311226348.8A
Other languages
Chinese (zh)
Inventor
F·鲍姆加特
D·森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apple Inc
Original Assignee
Apple Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apple Inc filed Critical Apple Inc
Publication of CN117768832A publication Critical patent/CN117768832A/en
Pending legal-status Critical Current

Links

Landscapes

  • Stereophonic System (AREA)

Abstract

The present disclosure relates to methods and systems for efficiently encoding scene locations. A method comprising: receiving a bit stream, the bit stream comprising: an encoded version of an audio signal associated with a sound source within a 3D scene, comprising a scene tree structure of an origin of a first scene relative to an origin of a second scene, and a position of the sound source within the first scene relative to the origin of the first scene, wherein the position references the origin of the first scene using an identifier, wherein the scene tree structure defines an initial configuration of the sound source relative to the first and second scenes; determining a position of a listener; generating a set of spatially rendered audio signals by spatially rendering the audio signals according to the position of the sound source relative to the position of the listener; and driving a speaker using the spatially rendered audio signal.

Description

Method and system for efficient encoding of scene locations
Related patent application
The present application claims the benefit of priority from U.S. provisional application No. 63/376,960, filed on 9/23, 2022, which is incorporated herein by reference.
Technical Field
One aspect of the invention relates to a system that may include at least one of: an encoder that encodes a three-dimensional (3D) scene as a scene tree structure into a bitstream, and a decoder that receives the bitstream having the scene tree structure and spatially renders the 3D scene based on a position of a listener. Other aspects are also described.
Background
Today, many devices provide users with the ability to stream media content (such as sound programs that may include music, podcasts, live recorded short video clips, or feature films) over the internet. For example, a playback device (such as a digital media player) may be electronically coupled to an output device (or a portion of an output device) such as a speaker, and may be configured to stream content for playback through the speaker. The content may be selected by a user (e.g., through a graphical user interface of a playback device) and streamed from one or more content providers that provide the content on a subscription basis.
Disclosure of Invention
One aspect of the invention includes a method (e.g., performed by a decoder side of an audio codec system), the method comprising: receiving a bit stream, the bit stream comprising: an encoded version of an audio signal associated with a sound source within a three-dimensional (3D) scene, comprising a scene tree structure of an origin of a first 3D scene relative to an origin of a second 3D scene, and a position of the sound source within the first 3D scene relative to the origin of the first 3D scene, wherein the position references the origin of the first 3D scene using an identifier, wherein the scene tree structure defines an initial configuration of the sound source relative to the first and second 3D scenes; determining a position of a listener relative to an origin of the first 3D scene; generating a set of spatially rendered audio signals by spatially rendering the audio signals according to the position of the sound source relative to the position of the listener; and driving a set of speakers to produce the sound source using the set of spatially rendered audio signals.
In one aspect, the identifier is a first identifier, wherein the origin of the first 3D scene comprises the first identifier and a position of the origin of the first 3D scene relative to the origin of the second 3D scene, wherein the position of the origin of the first 3D scene references the origin of the second 3D scene using the second identifier. In another aspect, the first and second identifiers are stored as six-bit integers within the bitstream. In some aspects, the bitstream is a first bitstream, wherein the method further comprises: receiving a second bitstream comprising a location update payload comprising a new location of an origin of the first 3D scene relative to an origin of the second 3D scene, the new location referencing the origin of the second 3D scene using a second identifier; and determining that the position of the sound source has moved relative to the movement of the origin of the first 3D scene from its original position to its new position; the spatial rendering of the audio signal is adjusted based on the movement of the position of the sound source.
In one aspect, the position of the sound source comprises a maximum distance parameter and encoded position data, wherein the method further comprises determining a decoded position of the sound source at the spatial resolution based on the maximum distance parameter and the encoded position data, wherein the audio signal is spatially rendered using the decoded position of the sound source relative to the position of the listener. On the other hand, a 3D scene is a part of an audio program being received through a bitstream, where the spatial resolution remains constant as the position of a sound source changes within the 3D scene during a playback session of the audio program. In some aspects, the maximum distance parameter is a first maximum distance parameter and the spatial resolution is a first spatial resolution, the method further comprising: receiving a new position of the sound source, the new position comprising a second maximum distance parameter and new encoded position data; and determining a new decoding position of the sound source at a second spatial resolution based on the second maximum distance parameter and the new encoding position data, wherein the second spatial resolution is different from the first spatial resolution.
In one aspect, the audio signal is associated with an audio program, wherein the location is an initial location of a sound source within the first 3D scene at a beginning of the audio program. In another aspect, the bitstream is a first bitstream, wherein the method further comprises: obtaining a second bitstream, the second bitstream comprising: an encoded version of the audio signal and a new position of the sound source relative to the origin of the first 3D scene, the new position being different from the position of the sound source, the new position referencing the origin of the first 3D scene with an identifier; and adjusting the spatial rendering of the audio signal based on the new location. In some aspects, the first and second bitstreams include a single bit indicating whether the origin is to be updated. In some aspects, the second 3D scene is a 3D global scene and the first 3D scene is a 3D sub-scene located within the 3D global scene. On the other hand, the listener's position is within the 3D sub-scene.
In one aspect, the location of the sound source comprises at least one of: 1) A location of an origin of the first 3D scene referenced by the identifier; and 2) encoded coordinate data within the coordinate system relative to an origin of the first 3D scene, and encoded rotation data indicative of an orientation of the sound source relative to the origin of the first 3D scene.
In another aspect, the encoded location data comprises a maximum distance parameter and the encoded coordinate data comprises a set of encoded cartesian coordinates, wherein the method further comprises determining the set of cartesian coordinates of the location of the sound source within the coordinate system relative to the origin of the first 3D scene by scaling the normalized set of encoded cartesian coordinates with the maximum distance parameter. In some aspects, each normalized set of cartesian coordinates is a ten-bit integer and the maximum distance parameter is a four-bit integer, which are stored within the bitstream.
In one aspect, the method further includes determining a number of added bits per encoded cartesian based on a four-bit identifier in the received bitstream; a total number of bits including a number of added bits is determined for each encoded cartesian coordinate, wherein the total number of bits includes at least six bits, wherein the normalized set of encoded cartesian coordinates is scaled according to the total number of bits. In another aspect, the encoded location data comprises a set of encoded spherical coordinates comprising an encoded azimuth value, an encoded elevation value, and an encoded radius, wherein the method further comprises determining a set of spherical coordinates of a location of the sound source within the coordinate system relative to an origin of the first 3D scene, the spherical coordinates comprising an azimuth value and an elevation value using a first normalization function based on the encoded azimuth value and the encoded elevation value, respectively, and a radius using a second normalization function based on the encoded radius. In one aspect, the encoded azimuth value is an integer of at least seven bits, the encoded elevation value is an integer of at least six bits, and the encoded radius value is an integer of at least five bits. In another aspect, the method further includes determining whether the position of the sound source includes rotation data based on the one-bit value; and in response to determining that the position of the sound source includes rotational data, extracting four encoded quaternions from the bitstream that are indicative of an orientation of the sound source, wherein each encoded quaternion is an integer of at least eight bits in size, a set of spatially rendered audio signals being spatially rendered based on the four encoded quaternions.
In one aspect, the encoded position data comprises a maximum distance parameter and a first set of spherical coordinates comprising an azimuth value, an elevation value, and a normalized radius, wherein the method further comprises determining a second set of spherical coordinates of the position of the sound source within the coordinate system relative to the origin of the first 3D scene, the second set of spherical coordinates comprising an azimuth value, an elevation value, and a radius that is the normalized radius scaled with the maximum distance parameter. In some aspects, the azimuth value is an eleven-bit integer, the elevation value is a ten-bit integer, and the normalized radius is an eight-bit integer. In another aspect, the position data further includes a rotation parameter indicating an orientation of the sound source relative to an origin of the first 3D scene. On the other hand, the rotation parameter is four eleven-bit quaternions.
In one aspect, the audio signal is associated with an audio/video (a/V) program, wherein the method further comprises displaying video content of the a/V program on a display of the audio decoder device. On the other hand, the sound source is an active sound source associated with an object or localization within video content displayed on a display. In some aspects, the video content of the a/V program is an extended reality (XR) environment, wherein the set of spatially rendered audio signals is a first set of spatially rendered audio signals, wherein the bitstream further comprises: a position of a passive sound source within the first 3D scene relative to an origin of the first 3D scene, the position of the passive sound source referencing the origin of the first 3D scene with an identifier, wherein the passive sound source is arranged to produce reflected or diffracted sound from the active sound source away from a surface within the XR environment; and a set of acoustic parameters of the passive sound source, wherein a set of spatially rendered audio signals is generated by spatially rendering the passive sound source based on the position of the passive sound source according to the set of acoustic parameters.
In one aspect, spatially rendering includes: determining an audio filter based on a set of acoustic parameters; generating a filtered audio signal by applying the audio filter to the audio signal; and generating the set of spatially rendered audio signals by applying one or more spatial filters to the audio signals and filtering the audio signals. In some aspects, the set of acoustic parameters includes at least one of a diffusion level, a cut-off frequency, a frequency response, a geometry of the object, an acoustic surface parameter of the object, a reflectance value, an absorbance value, and a material of the object.
According to another aspect of the present disclosure, a method performed by an encoder side of an audio codec system is disclosed, the method comprising: receiving an audio program comprising an audio signal associated with a sound source within a first three-dimensional (3D) scene; encoding the audio signal into a bitstream; adding the following to metadata of the bitstream: 1) A scene tree structure comprising an origin of a first 3D scene relative to an origin of a second 3D scene of the audio program, and 2) a position of a sound source relative to the origin of the first 3D scene, the position referencing the origin of the first 3D scene with an identifier, wherein the metadata defines an initial configuration of the sound source relative to the first and second 3D scenes to be rendered by the audio playback device; and transmitting the bitstream to an audio playback device.
In one aspect, the identifier is a first identifier, wherein the origin of the first 3D scene comprises the first identifier and a position of the origin of the first 3D scene relative to the origin of the second 3D scene, wherein the position of the origin of the first 3D scene references the origin of the second 3D scene using the second identifier. In one aspect, the first and second identifiers are stored as six-bit integers into the metadata. In some aspects, the bitstream is a first bitstream, wherein the method further comprises: determining whether the position of the origin of the first 3D scene is to be moved to a different position; encoding a position update of the origin of the first 3D scene relative to the origin of the second 3D scene into new metadata of the second bitstream in response to determining that the position of the origin of the first 3D scene is to be moved, the position update referencing the origin of the second 3D scene using the second identifier; and transmitting a second bitstream including the new metadata to the audio playback device. In another aspect, the new metadata includes a single bit having a value indicating that at least one origin will be updated.
In one aspect, the second 3D scene is a 3D global scene and the first 3D scene is a 3D sub-scene located within the 3D global scene. On the other hand, the initial configuration indicates that the position of the sound source is an initial position within the first 3D scene at the beginning of the audio program. In some aspects, the bitstream is a first bitstream, wherein the method further comprises: determining whether the position of the sound source is to be moved to a different position within the first 3D scene; in response to determining that the position of the sound source is to be moved, a new position of the sound source relative to the origin of the first 3D scene is encoded into new metadata of the second bitstream, the new position referencing the origin of the first 3D scene using the identifier.
In one aspect, encoding the location includes adding to the metadata: 1) A location of an origin of the first 3D scene referenced by the identifier, and 2) position data of the sound source within a coordinate system. In another aspect, the position data includes a maximum distance parameter and a set of cartesian coordinates indicative of a position of the sound source relative to an origin of the first 3D scene in a coordinate system, wherein the set of cartesian coordinates is normalized relative to the maximum distance parameter. In some aspects, each set of cartesian coordinates is a ten-bit integer and the maximum distance parameter is a four-bit integer, which are stored within the metadata. In another aspect, the position data includes a maximum distance parameter and a set of spherical coordinates indicative of a position of the sound source relative to an origin of the first 3D scene in a coordinate system, wherein the set of spherical coordinates includes an azimuth value, an elevation value, and a radius normalized relative to the maximum distance parameter. In some aspects, the azimuth value is an eleven-bit integer, the elevation value is a ten-bit integer, and the radius is an eight-bit integer.
In one aspect, the method further comprises determining an orientation of the sound source relative to an origin of the first 3D scene, wherein the position of the sound source comprises a rotation parameter indicative of the orientation of the sound source. On the other hand, the rotation parameter is four eleven-bit quaternions.
In one aspect, the audio program is an audio/video (a/V) program, wherein the method further comprises: determining that the video content of the a/V program has a surface associated with a passive sound source arranged to produce reflected or diffracted sound; determining one or more acoustic parameters of the passive acoustic source based on the surface; the location of the passive sound source is encoded into the metadata by combining the identifier with the one or more acoustic parameters. In another aspect, the set of acoustic parameters includes at least one of a diffusion level, a cut-off frequency, a frequency response, a geometry of the object, an acoustic surface parameter of the object, a reflectance value, an absorbance value, and a material of the object. In some aspects, the sound source is an active sound source that represents sound produced by objects or locations within the video content. On the other hand, the video content of the a/V program is an augmented reality (XR) environment that is audibly represented by the sound of the audio signal.
In one aspect, the method further comprises: determining that the second 3D scene is a 3D global scene and the first 3D scene is a 3D sub-scene located within the 3D global scene, wherein an origin of the 3D global scene is a 3D global origin; storing the second identifier as a one-bit integer in the metadata, the one-bit integer indicating that the origin of the 3D global scene is a 3D global origin; and storing the first identifier as a six-bit integer in the metadata. On the other hand, the location update is encoded into new metadata using delta encoding, wherein the delta between the location and the previous location is encoded into the new metadata, wherein the new metadata comprises less data than the metadata. In some aspects, the new metadata includes a single bit having a value indicating that the location update has been encoded using delta encoding.
According to another aspect of the invention is a non-transitory machine-readable medium having instructions that, when executed by at least one processor of an electronic device, cause the electronic device to: receiving a bit stream, the bit stream comprising: audio content of a first three-dimensional (3D) scene, and encoding metadata comprising an origin of the first 3D scene relative to an origin of the second 3D scene and a position of a sound source within the first 3D scene relative to the origin of the first 3D scene, wherein the position references the origin of the first 3D scene using an identifier; determining a listener position relative to an origin of the first 3D scene; and spatially rendering the audio content according to the position of the sound source relative to the listener position.
In one aspect, the identifier is a first identifier, wherein the origin of the first 3D scene comprises the first identifier and a position of the origin of the first 3D scene relative to the origin of the second 3D scene, wherein the position of the origin of the first 3D scene references the origin of the second 3D scene using the second identifier.
In another aspect, the encoded metadata includes: the second identifier being a one-bit integer indicating that the origin of the second 3D scene is the 3D global origin of the 3D global scene, and the first identifier being a six-bit integer. In one aspect, the instructions further comprise receiving new metadata via the bitstream, the new metadata comprising a location update of the location of the sound source; and adjusting the spatially rendered audio content in accordance with the location update. On the other hand, the position update is encoded into new metadata using delta encoding, wherein delta between the new position of the sound source and the previous position of the sound source is encoded into the new metadata, wherein the new metadata comprises less data than the metadata. In some aspects, the new metadata includes a single bit having a value indicating that the location update has been encoded using delta encoding.
The above summary does not include an exhaustive list of all aspects of the disclosure. It is contemplated that the present disclosure includes all systems and methods that can be practiced by all suitable combinations of the various aspects summarized above, as well as those disclosed in the detailed description below and particularly pointed out in the claims. Such combinations may have particular advantages not specifically set forth in the foregoing summary.
Drawings
Aspects are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements. It should be noted that references to "a" or "an" aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. In addition, for simplicity and to reduce the total number of figures, a certain figure may be used to illustrate features of more than one aspect, and for a certain aspect, not all elements in the figure may be required.
Fig. 1 illustrates an example of a three-dimensional (3D) scene of a media program according to one aspect.
FIG. 2 illustrates a scene tree structure from the 3D scene of FIG. 1 and defining a relationship between locations of elements within the 3D scene (e.g., origin points associated with 3D sub-scenes), according to one aspect.
Fig. 3 shows a system for generating a bitstream comprising encoded metadata for spatially rendering a 3D scene.
Fig. 4 is a block diagram of an audio encoder system that generates a bitstream of encoded audio content and scene metadata at the encoder side and receives the bitstream at the decoder side and uses the scene metadata to spatially render the audio content from within the bitstream, according to one aspect.
Fig. 5 is a flow chart of one aspect of a process at the encoder side for encoding scene metadata and audio content into a bitstream for transmission to the decoder side.
Fig. 6 is a flow chart of one aspect of a process at the encoder side for encoding a scene tree structure as metadata into a bitstream.
Fig. 7 is a flow chart of one aspect of a process at a decoder side for receiving a bitstream and spatially rendering audio content of the bitstream using scene metadata encoded therein.
FIG. 8 illustrates a table of enhancements to the bitstream syntax of an MPEG-D DRC, according to some aspects.
Fig. 9 illustrates another table of enhancements to the bitstream syntax of an MPEG-D DRC, in accordance with some aspects.
FIGS. 10a-10c illustrate tables of enhancements to the bitstream syntax of an MPEG-D DRC, according to some aspects.
FIG. 11 illustrates a 3D scene having a scene tree structure from FIG. 2, wherein elements of the 3D scene have been moved, according to one aspect.
FIG. 12 is a system flow diagram of one aspect of a process in which an encoder side sends scene metadata updates that are used by a decoder side to adjust spatial rendering.
FIG. 13 illustrates another table of enhancements to the bitstream syntax of an MPEG-D DRC, according to some aspects.
Fig. 14 shows an example of system hardware.
Detailed Description
Aspects of the disclosure will now be explained with reference to the accompanying drawings. The scope of the disclosure herein is not limited to the components shown for illustrative purposes only, provided that the shape, relative position, and other aspects of the components described in a certain aspect are not explicitly defined. In addition, while numerous details are set forth, it should be understood that some embodiments may be practiced without these details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Moreover, unless the meaning clearly indicates to the contrary, all ranges shown herein are to be understood to include the end of each range.
As used herein, an augmented reality (XR) environment (or presentation) may refer to a fully or partially simulated environment in which people sense and/or interact via one or more electronic systems. For example, the XR environment may include Augmented Reality (AR) content, mixed Reality (MR) content, virtual Reality (VR) content, and the like. There are many different types of electronic systems that enable a person to sense and/or interact with various XR environments. Examples include head-mounted systems, projection-based systems, head-up displays (HUDs), vehicle windshields integrated with display capabilities, windows integrated with display capabilities, displays formed as lenses designed for placement on a person's eyes (e.g., similar to contact lenses), headphones/earphones, speaker arrays, input systems (e.g., wearable or handheld controllers with or without haptic feedback), smartphones, tablets, and desktop/laptop computers.
As referred to herein, a "media program" may be (and include) any type of media content, such as an audio program that may include audio content (e.g., as one or more audio signals with audio data). For example, the audio program may include a musical composition, podcast, audio of an XR environment, tracks of a movie, and the like. On the other hand, the media program may be an audio/video (a/V) program including audio content and/or video content (e.g., as one or more video signals with video (or image) data), etc. Examples of a/V programs may include movies that include video or image content of the movie as well as audio tracks of the movie (e.g., companion audio programs that include the audio tracks). As another example, the a/V program may include audio content and video content of an XR environment. In one aspect, a media program may include one or more audio components (or sound sources), each audio component being associated with (or having) one or more audio signals of the program. For example, an audio program of an audio track of a movie may include an audio component of an audio track of a person within a scene, and may include another audio component of a bark of a dog within the scene. As another example, one audio component may be a dialogue of a movie and the other audio component may be a soundtrack of the movie. For another example, the audio program may include a plurality of tracks (e.g., tracks of a music album or tracks of a movie series), the audio component may be an individual track and/or may represent an entire group (or album) of tracks. In some aspects, the media program may be a program file (e.g., of any type) configured to include audio content and/or video content, as described herein.
In one aspect, an audio program may include audio content for spatial rendering in one or more (e.g., 3D) audio formats as one or more data files, such as having one or more audio channels. For example, the audio program may include a mono audio channel or may be in a multi-audio channel format (e.g., two stereo channels, six surround source channels (in a 5.1 surround format), etc.). In another aspect, an audio program may include one or more audio objects, each having at least one audio signal, and position data in 3D sound (the audio signals used to spatially render the objects). On the other hand, the media program may be represented in a ambisonics format, such as a first or Higher Order Ambisonics (HOA) audio format.
In one aspect, a media program may include one or more 3D scenes, each 3D scene may represent one or more portions (or segments) of the media program. For example, an audio program may include one or more 3D scenes as one or more audio signals (e.g., one or more sound clips thereof), where each 3D scene may include one or more audio scene components or elements (or may be characterized thereof), such as sound sources positioned within (and originating within) the 3D scene. For example, a 3D scene of a virtual living room (e.g., of an XR environment) may include a conversation of a person in the room as one sound source (e.g., an audio component) located at one location within the virtual living room, while a sound of a bark of a dog may be located elsewhere in the room (e.g., originating from a virtual dog) as another sound source. Thus, when spatially rendering audio content of a 3D scene, a listener may perceive a sound source as originating from a particular location within acoustic (e.g., physical) space (e.g., around the listener), which may correspond to the location of the 3D scene (e.g., may correspond to a location within a virtual living room, as if the listener were physically located in the living room, as described in the examples above). Alternatively, the a/V program may include one or more 3D scenes as video content, where each 3D scene may include one or more visual objects (structures or locations). Thus, as described herein, a 3D scene may represent a 3D space that may include sound sources and/or visual objects (positioned therein), wherein the 3D scene may be rendered such that a user (listener) may perceive at least a portion of the 3D space from one or more angles.
Fig. 1 illustrates an example of a 3D global scene 30 (or 3D scene) of a media program according to one aspect. In particular, the figure shows a 3D scene 30 comprising a tourist bus 31, a person 34 standing beside the tourist bus 31, a ship 33 and a cabin 32 within the ship 33, in which cabin the person stands beside several windows. In one aspect, the 3D scene may be part of an a/V program (such as an XR environment), where the 3D scene includes a visual (e.g., virtual) environment having listed elements, and audio components including the visual environment, where a listener may perceive sound within the 3D global scene 30 from various listener locations within the scene. For example, the 3D scene 30 may include several audio components as sound sources 46a-46g (each shown as having three curves, which may represent sound originating from the sound source). For example, the engine of the travel bus is making an engine sound 46a (e.g., engine noise), the tour guide on the top floor of the travel bus 31 is speaking 46b, and the person 34 is speaking 46c toward the tour guide; as the boat 33 travels in the water, the chimney of the boat 33 emits sound 46e, and the boat wheels that propel the boat emit sound 46d; and the person within the cabin 32 is speaking 46f, there is a reflection 46g of sound (of the speaking person) exiting from the windows of the cabin.
In one aspect, when spatially rendering the 3D scene, the position of the sound source may be perceived by the listener based on the position and orientation of the listener within the 3D scene 30. For example, if the listener position is facing the rear of the travel bus 31, the sound sources of the travel bus 31 (e.g., engine sound 46 a) and the person 34 may be spatially rendered to originate in front of the listener, while the sound sources of the boat 33 (e.g., boat wheel sound 46d and chimney sound 46 e) may be spatially rendered to be perceived to be located behind the listener. More is described herein with respect to sound sources within the 3D scene 30.
As shown herein, the 3D scene 30 may include a visual (3D) environment (as shown) and an audible (3D) acoustic space representing sounds of the visual environment. On the other hand, a 3D scene may include an audio component that does not have associated video content, such as is the case when the media program may be a musical composition.
For the 3D scene 30, it is important to manage or track the relationship between the sound source and the listener position (not shown) within the 3D scene in order to render the 3D scene to the listener spatially effectively and efficiently. This may be the case when the listener position and/or the sound source position are moving within the 3D global scene 30. For example, a listener location near person 34 may remain stationary while the travel bus 31 and/or boat 33 may move away from that location as the media program is played.
A scene graph is a data structure that arranges objects as a collection of nodes in the graph, where the positioning of objects in the graph defines a hierarchical relationship between objects. Such graphs have been used in vector-based graphics editing applications, where each node of the graph may be associated with a graphics object related to any other node within the graph, or with a graphics object related to a single parent node to which all nodes are coupled. In one aspect, scene graphs may be used in audiovisual media to describe object localization. In particular, the scene graph may describe the localization of audio objects (e.g., localization within a virtual acoustic environment) and may be used to spatially render the audio objects at their localization relative to the localization of a listener.
However, there are drawbacks to using a scene graph to describe the localization of (audio) objects (e.g., sound sources). As described herein, a scene graph is a data structure that may describe a relationship between one or more (child) nodes and a single parent node. Each node (and parent node) within the data structure may require specific data for linking the nodes to each other (and to describe features) as well as to the parent node. Such data structures may use and require large amounts of data and may therefore not work in situations where the memory capacity of the computer is limited. In addition, using such data structures may be inefficient in situations where the data structure is to be transmitted between electronic devices (along with its associated media) at low bit rates (which may be especially the case when media programs are live or streamed in real-time from one device to one or more other devices along with the data structure). Thus, there is a need for a scene tree structure that can be efficiently encoded by encoder-side devices as scene metadata (along with media, such as audio data) in a bitstream that is sent to a decoder side for spatially rendering the audio data of the bitstream.
To solve this problem, the present disclosure provides an audio codec system that efficiently encodes one or more audio signals of an audio program at an encoder side (or encoder side device) and encodes scene metadata (hereinafter may be referred to as "metadata") as a scene tree structure into a bitstream, and a decoder side (or decoder side device) that receives the bitstream and spatially renders the audio signals using the encoded metadata. In particular, the encoder side (which may be implemented by a programmed processor, for example, executing instructions stored in or configured by one or more processors that are part of a memory of the encoder side device) receives an audio program that includes audio signals associated with sound sources within a 3D scene of the audio program. For example, the audio program may be a musical composition in which the sound source is sound (e.g., singing voice) originating from a particular direction relative to the origin of the 3D scene. The encoder side encodes (e.g., an audio signal of an audio program) into a bitstream and as a scene tree structure comprising an origin of a first 3D scene (e.g., a 3D sub-scene) relative to an origin of a second 3D scene (e.g., the 3D global scene 30) into metadata of the bitstream. The encoder side also encodes a position of the sound source relative to an origin of the first 3D scene, the position referencing the origin with an identifier. The encoded scene tree structure may define an initial configuration of sound sources relative to the first and second 3D scenes such that the sound sources may be rendered by the decoder-side device. In particular, the scene tree structure defines an initial configuration such that when the decoder side starts rendering the audio program spatially (e.g., when the audio program is played back from a (e.g., starting) point along the playback duration), it may do so such that the sound source originates from its position within the first 3D scene (e.g., relative to the listener's position). The encoder-side device then transmits the bitstream to the decoder-side device. In one aspect, this scene tree structure, which uses identifiers to reference the origin of the first 3D scene in which the sound source is located, together with other encoding aspects associated with the tree structure described herein, reduces the amount of data required to spatially describe the 3D scene, while still describing the relationship between the sound source and its respective origin in great detail. Thus, the encoding methods described herein describe positioning that results in a minimum amount of data, including limited capacity for storage or transmission of media content. Due to the limited capacity, there is typically a tradeoff between the bit rate efficiency described and the quality of experience of the media presentation, which means that the highest bit rate efficiency will provide the best overall result. The present disclosure allows an efficient way to encode 3D positions in a 3D scene at low bit rates for media applications.
Fig. 2 illustrates the 3D global scene 30 from fig. 1 and a scene tree structure 36 defining a relationship (e.g., associable with visual elements within the scene) between the locations of audio components (e.g., sound sources) within the 3D scene 30, according to one aspect. In particular, the figure shows a scene tree structure 36 that may be generated (e.g., and encoded into a bitstream) by the audio codec system described herein (e.g., encoder side thereof) to manage location information of an origin of one or more 3D scenes associated with one or more sound sources within the 3D global scene 30 of an audio program for spatially rendering the audio program, as described herein. In particular, the tree structure 36 may be (or include) a scene location payload within the bitstream that includes a number of leaves (or nodes), where each leaf within the tree structure may include a location (e.g., with location data) of a location of a 3D sub-scene from which one or more sound sources within the 3D scene are related to each other and/or to the scene as a whole. More is described herein with respect to encoding locations within a 3D scene within a bitstream as a scene tree structure.
As shown, the 3D global scene 30 includes several 3D sub-scenes 35a-35c, each of which is a separate 3D scene within (e.g., global) the 3D scene 30, and may have one or more associated sound sources (located therein). In one aspect, a 3D sub-scene may represent a structure or location within video content of a media program, where one or more audio signals of the media program include sound (e.g., sound sources) associated with the structure or location. For example, the 3D scene 30 includes a first sub-scene 35a of the tour bus 31 and the person 34 (e.g., including or associated with the tour bus and the person), a second sub-scene 35b of the boat 33, and a third sub-scene 35c of the cabin 32 within the boat 33. In one aspect, each sub-scene may include a location of a sound source. In this case, each of the sub-scenes may include a location of a sound source associated with the visual object within each sub-scene. For example, the first 3D sub-scene 35a includes sound source locations (or origins) 39a-39c of sound sources 46a-46c, the second 3D sub-scene 35b includes sound source locations 39D and 39e of sound sources 46D and 46e, and the third 3D sub-scene 35c includes sound source locations 39f and 40 of sound sources 46f and 46 g.
As described herein, the present disclosure provides a scene tree structure 36 that describes the locations of sub-scenes relative to each other or relative to a 3D global scene. In particular, the structure includes one or more 3D sub-scene locations within the global 3D scene, wherein each of the locations of the 3D sub-scenes may be defined relative to a reference point (e.g., origin) of one or more other 3D sub-scenes and/or the global scene. In particular, the scene tree structure may include position data (information) about each origin it represents, with the position data of one origin being relative to the other origin. In one aspect, the sub-scenes 35a and 35b (e.g., their origins) are located relative to the global scene origin 37 of the 3D global scene 30. For example, the sub-scene origin 38a of the first sub-scene 35a is located relative to the global scene origin 37, and the sub-scene origin 38b of the second sub-scene 35b is located relative to the global scene origin 37. Alternatively, the origin of a sub-scene may be located relative to another sub-scene (and/or sound source). For example, the sub-scene origin 38c is located relative to the sub-scene origin 38b of the sub-scene 35 b. This may be due to the fact that the interior of the cabin 32 is part of (or associated with) the ship 33. Thus, any movement of a sub-scene corresponding to the sub-scene origin will be relative to the sub-scene origin to which it refers, in this case the origin 38b. More about the origin of a mobile sub-scene is described herein.
The figure also shows the location of the sound source in relation to the origin of the sub-scene. Specifically, the sound source positions 39a-39c of the respective sound sources 46a-46c are relative to the sub-scene origin 38a of the first sub-scene 35a (e.g., defined within a tree structure), the sound source positions 39d and 39e of the respective sound sources 46d and 46e are relative to the sub-scene origin 38b of the second sub-scene 35b, and the sound source positions 39f and 40 of the respective sound sources 46f and 46g are defined relative to the sub-scene origin 38c of the third sub-scene 35 c. In one aspect, the location of the sound source may be relative to a global scene origin. As described herein, by defining the location of sound sources relative to an origin of a scene tree structure, an audio renderer may be configured to spatially render sound sources relative to listener locations based on the location of the listener relative to one or more origins of the structure. More about rendering is described herein.
In one aspect, a 3D scene may have different types of sound sources whose locations are defined by scene tree structure 36. For example, sound sources 46a-46f may be "active" sound sources, meaning that they each produce sound of one or more audio signals associated with the sources. For example, to generate sound for the travel bus engine 46a, the audio program may include audio data (as one or more audio signals) including one or more sounds associated with the sound of the engine that is located at the location 39a when spatially rendered according to its location (relative to the listener's location). On the other hand, active sound sources may be defined based on (e.g., visual) objects within the 3D scene. In particular, the source (e.g., producing sound) may represent sound produced by an object (or location) within the video content of the media program that is arranged (e.g., in a real physical environment) to produce sound, such as an operating engine would produce sound in the physical world. On the other hand, a 3D scene may have other sound sources, such as sound source 46g, which is a "passive" sound source, meaning that it produces reflected or diffracted sound from one or more active sound sources. For example, the passive sound source may be a reflective surface arranged to reflect or diffract sound (e.g. in the physical world), which is a window as shown. For example, the passive sound source in the cabin 32 may be arranged to produce reflected or diffracted sound of a person's voice 46f when the person within the cabin speaks. More about passive and active sound sources is described herein.
In one aspect, the scene tree structure may enable efficient encoding of sub-scene origins and provide accurate rendering of sound source locations within the 3D scene 30. For example, when the scene tree structure is used by an audio renderer on the decoder side to spatially render the audio content of the XR environment, the audio renderer may adjust the spatial rendering based on the listener's position in the XR environment. For example, in an XR environment, a listener (e.g., a listener's avatar) may be able to be close to or remote from the sound source. Therefore, the position information should be provided with sufficient resolution even if the listener approaches the sound source to reach the shortest allowable distance. Furthermore, given that the ship 33 is driven off the port and can be heard within a distance of a few kilometers, conventional distance coding based on a global reference point alone may require a resolution of better than one meter to allow an accurate presentation in case the listener has been boarding the ship instead of staying in the port. Thus, for large distances, encoding of sound source positions with respect to reference points only (e.g., within a 3D global scene) is not the most efficient method. Accordingly, the present disclosure provides a scene tree structure that includes location data for an origin, such as defining the location of a ship as a sub-scene location origin relative to the global scene origin 37. This also allows the encoder side to encode the sound source position relative to the sub-scene origin within the scene tree structure.
In another aspect, the present invention provides an encoding method that reduces redundancy of locations within a 3D scene (e.g., to be more efficient in low bit rate systems). For example, conventional distance coding methods update the position of a sound source relative to a reference point within a 3D scene. However, some sound sources may share (or be part of) a common sub-scene. For example, sound sources 46d and 46e are associated with vessel 33. As the ship moves, its associated sound source may also move relative to a reference point within the 3D global scene, but remain stationary relative to a reference point (e.g., origin) associated with the ship. Since these sound sources share the same motion as the ship, there is significant redundancy in the separate updates for each sound source. The present disclosure reduces (or eliminates) this redundancy by correlating sound sources with sub-scene origins, which provides a more efficient and intuitive encoding method.
As described herein, the decoder side (which may be implemented as (or by) a programmed processor of the decoder side device) obtains a bitstream comprising an encoded audio signal and an encoded scene tree structure defining an initial configuration of the 3D scene. The decoder side determines the position of the listener relative to the origin of the 3D scene. In one aspect, the "listener position" may relate to a location and/or orientation within the 3D scene that the audio renderer may use to spatially render sound sources of the 3D scene such that a listener may spatially perceive sound of the sound sources as if the listener were located at the listener position. In particular, the listener may be located at the origin itself, or may be located somewhere within the 3D scene. For example, referring to FIG. 2, the listener position may be located within the 3D scene 30 (e.g., adjacent to the tour bus 31, or may be within the tour bus itself (e.g., sitting on top of it). Where the 3D scene belongs to (or represents) an XR environment, the position of the listener may correspond to the position of the listener's avatar within the XR environment.
In addition, the present disclosure may provide efficient encoding for location updating. As described herein, an audio codec system may generate a tree scene structure defining an initial configuration of audio component locations within a 3D scene that an audio renderer (of a decoder-side device) may use to spatially render audio components. However, some sub-scenes within a 3D scene may be dynamic such that they move within the 3D scene (e.g., relative to the origin of the 3D scene). For example, turning to fig. 2, the vessel 33 may be moving (e.g., relative to the global scene origin 37). In this case, the sound source locations associated with the vessel (e.g., 39d and 39 e) may remain stationary relative to the vessel 33 (e.g., the sub-scene origin 38b of its sub-scene 35 b), as described herein. Thus, the encoder side of the audio codec system may be configured to provide a positional update of the ship 33 (e.g., its sub-scene origin 38 b) without other positional data, such as positional data of the sub-scene origin 38a or positional data of any of the ship's sound sources (which may have been provided with an initial configuration) as these sound sources have remained stationary relative to the origin 38 b.
In particular, the present disclosure describes an encoder-side method of receiving an audio signal of an audio program receivable from or via a media application, wherein the audio signal is for a 3D scene of the audio program and determining that a 3D sub-scene exists within the 3D scene. Referring to fig. 2, the encoder side may determine that the ship is a sub-scene 35b based on determining that the ship 33 is associated with a sound source (e.g., a sound of a ship wheel and a sound of a chimney). Further content is described herein with respect to determining that a 3D scene includes sub-scenes. The encoder side determines 1) a position of the 3D sub-scene within the 3D scene (e.g., by determining a position of an origin of the sub-scene) and 2) a position of a sound source of the audio signal within the 3D sub-scene (e.g., relative to the origin of the sub-scene). The encoder side generates a first bitstream by encoding the audio signal and including a first set of metadata having the position of the 3D sub-scene and the position of the sound source (e.g., as an initial configuration of the 3D scene). The encoder side determines that the position of the 3D sub-scene has changed. For example, the encoder side may determine that sub-scene 35b has moved based on the movement of the boat (e.g., as indicated by the audio program). The encoder side generates a second bitstream comprising the encoded audio signal and a second set of metadata having a changed position of the 3D sub-scene. Thus, the encoder side may reduce the required bitrate by sending a location update for the sound source (and/or sub-scene) location within the 3D scene while omitting the location data of the stationary source and sub-scene. To describe the motion of the audio components within the moving sub-scene, it is sufficient to dynamically update the positions of the sub-scenes instead of having to update the positions of all audio components.
The present disclosure describes decoder-side methods for spatially rendering 3D scenes based on location updates, as described herein. For example, the decoder side receives a first bitstream comprising an encoded version of an audio signal of a 3D scene and a first set of metadata having 1) a position of a 3D sub-scene within the 3D scene and 2) a position of a sound source associated with the audio signal within the 3D sub-scene. The decoder side determines the position of the listener within the 3D scene (e.g., where the listener position may be relative to the global scene origin or relative to the sub-scene origin when the listener is located within the sub-scene). The decoder side spatially renders the 3D scene to generate a sound source at a position of the sound source relative to the position of the listener using the audio signal. The decoder receives a second bitstream comprising a second set of metadata having different positions of the 3D sub-scene within the 3D scene, and adjusts the spatial rendering of the 3D scene such that the position of the sound source changes to correspond to the movement of the 3D sub-scene from the original position of the 3D sub-scene to its new position. Thus, the decoder side is configured to use the position update of the sub-scene to adjust the position of the sound source relative to the movement of its associated sub-scene.
Fig. 3 shows a system 41 (e.g., an audio system) for generating a bitstream comprising encoded (e.g., scene) metadata for spatially rendering 3D scenes of an audio program. In particular, the system includes a playback (or audio playback) device 44, an output (or audio output) device 45, a (e.g., computer) network 43 (e.g., the internet), and a media content device (or server) 42. In one aspect, the system 41 may include more or fewer elements, such as having one or more (additional) servers, or no playback devices. In this case, the output device may be communicatively coupled (e.g., directly) to the media content device, as described herein.
In one aspect, the media content device 42 may be a stand-alone electronic server, a computer (e.g., a desktop computer), or a cluster of server computers configured to perform digital (audio) signal processing operations, as described herein. In particular, the content device may be configured to generate and/or receive media (e.g., audio, video, and/or audio/video (a/V)) programming, which may include one or more audio components, and may be configured to perform encoder-side operations as described herein to generate a bitstream having encoded audio programming and having associated metadata (or scene metadata). As shown, the media content device 42 is communicatively coupled (e.g., via a network 43) to a playback device 44 for providing digital audio data and metadata using an encoded bitstream. More content is described herein regarding operations performed by the media content device 42.
In one aspect, playback device 44 may be any electronic device (e.g., having electronic components such as a processor, memory, etc.) capable of performing decoding operations on the bitstream to decode the encoded audio signal and extract metadata associated with the audio signal, and performing audio signal processing operations on the decoded audio signal in accordance with the extracted metadata. On the other hand, the playback device may be capable of spatially rendering the audio signal for spatial audio playback (e.g., via one or more speakers that may be integrated within the playback device and/or within the output device, as described herein) using one or more spatial filters, such as head-related transfer functions (HRTFs). In another aspect, the playback device may be configured to perform at least some encoder-side operations, as described herein. In some aspects, the playback device may be a desktop computer, a laptop computer, a digital media player, or the like. In one aspect, the device may be a portable electronic device (e.g., capable of handheld operation), such as a tablet computer, smart phone, or the like. On the other hand, the device may be a head-mounted device, such as smart glasses, or a wearable device, such as a smart watch.
In one aspect, the output device 45 may be any electronic device that includes at least one speaker and is configured to output (or playback) sound by driving the speaker with one or more (e.g., spatially rendered) audio signals. For example, as shown, the device is a wireless headset (e.g., an in-ear headset or a wireless ear bud) that is designed to be positioned on (or in) the user's ear and is designed to output sound into the user's ear canal. In some aspects, the earbud may be of the sealing type having a flexible earpiece tip for acoustically sealing an entrance of the user's ear canal from the surrounding environment by blocking or occluding in the ear canal. As shown, the output device includes a left earpiece for a left ear of the user and a right earpiece for a right ear of the user. In this case, each earpiece may be configured to output at least one audio channel of audio content (e.g., a right earpiece outputs a right audio channel of a dual channel input of a stereo recording (such as a musical piece) and a left earpiece outputs a left audio channel). In another aspect, each earpiece may be configured to play back one or more spatially rendered audio signals. In this case, the output device may play back binaural audio signals generated using one or more HRTFs, with the left earpiece playing back the left binaural signal and the right earpiece playing back the right binaural signal. In another aspect, the output device may be any electronic device comprising at least one speaker and arranged to be worn by a user and arranged to output sound by driving the speaker with an audio signal. As another example, the output device may be any type of headset, such as an ear-mounted (or on-the-ear) headset that at least partially covers the user's ear and is arranged to direct sound into the user's ear.
In some aspects, the output device 45 may be a head mounted device, as exemplified herein. In another aspect, the output device 45 may be any electronic device arranged to output sound into the surrounding environment. Examples may include a stand-alone speaker, a smart speaker, a home theater system, or an infotainment system integrated within a vehicle.
As described herein, the output device 45 may be a wireless headset. In particular, the output device 45 may be a wireless device communicatively coupled to the playback device 44 for exchanging digital data (e.g., audio data). For example, playback device 44 may be configured to establish a wireless connection with output device 45 via a wireless communication protocol (e.g., a bluetooth protocol or any other wireless communication protocol). During the established wireless connection, playback device 44 may exchange (e.g., send and receive) data packets (e.g., internet Protocol (IP) packets) with output device 45, which may include audio digital data in any audio format.
Alternatively, the playback device 44 may be communicatively coupled with the output device 45 via other methods. For example, both devices may be coupled via a wired connection. In this case, one end of the wired connection may be (e.g., fixedly) connected to the output device 45, while the other end may have a connector, such as a media jack or Universal Serial Bus (USB) connector, that plugs into a socket of the playback device 44. Once connected, playback device 44 may be configured to drive one or more speakers of output device 45 with one or more audio signals via a wired connection. For example, playback device 44 may transmit the audio signal as digital audio (e.g., PCM digital audio). In another aspect, the audio may be transmitted in analog format.
In some aspects, the playback device 44 and the output device 45 may be different (independent) electronic devices, as shown herein. Alternatively, the playback device 44 may be part of (or integrated with) the output device 45. For example, at least some of the components of the playback device 44 (such as one or more processors, memory, etc.) may be part of the output device 45, and/or at least some of the components of the output device 45 may be part of the playback device. In this case, at least some of the operations performed by the playback device 44 may be performed by the output device 45. For example, the output device 45 may be configured to perform one or more decoder-side operations described herein to decode the received bitstream for spatially rendering audio content using metadata of the received bitstream.
Fig. 4 is a block diagram of an audio encoder system 10 (or system) described herein that generates a bitstream of encoded audio content and scene metadata at the encoder side and receives the bitstream at the decoder side and uses the scene metadata to spatially render the audio content, according to one aspect. The system 10 includes a media source 13, an encoder side 11, and a decoder side 12. In one aspect, media source 13 may be any type of electronic device (e.g., media content device 42) from which system 10 receives media programs 16 (e.g., audio programs, such as musical compositions). Alternatively, the media source may be a (e.g., internal) memory of the encoder-side device.
In one aspect, encoder side 11 may be implemented by one or more processors executing or being configured by instructions stored in a memory, which is commonly referred to herein as a "programmed processor," for example in one or more (encoder side) devices. For example, the encoder side may be implemented by the media content device 42 and/or may be implemented by one or more servers communicatively coupled to one or more devices via the internet. The decoder side 12 may be implemented by one or more (decoder side) devices, such as a playback device 44 and/or an output device 45 of the system 41, by a programmed processor.
In one aspect, audio codec system 10 may perform operations for encoding and decoding audio data and/or metadata of an audio program in real-time. In this case, the digital signal processing operations described herein may be performed continuously (e.g., periodically) on the audio data stream of the media program. In particular, operations may be performed from the beginning of a media program (or the start time at which the media program will be streamed) to the end of the media program (or the stop time at which the media program is no longer streamed in real-time). In some aspects, the operations may be performed periodically such that the audio codec system performs the operations for one or more segments of the media program being received and streamed for playback at the decoder-side device.
The encoder side 11 will now be described. The encoder side 11 receives a media program comprising (e.g., optional) video content 17 (e.g., as one or more video signals) and audio content 18 (e.g., as one or more audio signals), wherein the audio content may comprise one or more audio components associated with (e.g., being part of or constituting) a 3D (e.g., acoustic) scene of the media program. In one aspect, video content 17 may optionally be received (as shown as being associated with a dashed line). For example, the media program may be an audio program that includes only audio content 18. In one aspect, a media program may include data associated with audio (and/or video) content. For example, the program may include spatial parameters (e.g., spatial characteristics) associated with the audio component. For example, the audio component may be an audio object, and the program may include one or more audio signals for the object and include spatial parameters indicating a location of the object (e.g., a sound source within a 3D scene associated therewith) for spatially rendering the audio object. In another aspect, the program may include additional data, such as acoustic parameters that may be associated with one or more sound sources of the program. More about acoustic parameters is described herein.
As shown, the encoder side 11 includes several operational blocks for performing one or more digital signal processing operations described herein. For example, the encoder side 11 includes a sound source/scene position (e.g., origin of scene) identifier 14 (hereinafter referred to as "identifier") and an encoder 15. The identifier 14 may be configured to identify sound sources within one or more 3D scenes of the media program. In one aspect, the identifying may identify one or more sound sources (e.g., as one or more audio signals) in a 3D scene of the program based on performance of a spectral analysis (e.g., blind Source Separation (BSS)) of audio content of the program. Alternatively, the identifier may identify the sound source contained within the program using any type of digital signal processing operation on the program (e.g., one or more audio signals of the program).
On the other hand, the identification of the sound source may be based on information (e.g., data) within the program. For example, a program may include one or more audio objects, each object associated with a sound source of the program. On the other hand, the program may include data (e.g., metadata) indicating the presence of one or more sound sources within the 3D scene of the program. More content is described herein regarding identifying sound sources from data of programs.
In one aspect, the identifier 14 may be configured to determine the location of the identified sound source within the 3D scene. In particular, the identifier may determine a location within the scene relative to a reference point within the 3D scene (e.g., a global scene origin, such as origin 37 in fig. 2). In some aspects, the global scene origin may be located at the origin of the coordinate system (e.g., in cartesian coordinates, the location may be x=0, y=0, and z=0). Thus, the identified location may comprise coordinates of a coordinate system and/or may comprise a rotation parameter that may indicate the orientation (relative to the origin) of the sound source. In one aspect, the location of the sound source may be based on data received within the media program. For example, when the media program includes one or more audio objects, the objects may include location information that the identifier may use to determine the location of the objects within the 3D scene. In another aspect, the identifier may perform a sound localization function to determine the location of the identified sound source.
As described above, the identifier may perform acoustic analysis (e.g., spectral analysis) on audio data of the media program to identify one or more sound sources and/or source locations/orientations within the 3D scene. In one aspect, this may be the case when the media program is an audio program (e.g., has only audio content). In another aspect, the identifier may be configured to perform (e.g., in addition to or instead of acoustic analysis) one or more operations to identify the sound source (and/or its position/orientation) based on the video content of the media program (e.g., the a/V program). For example, the identifier may perform an image analysis function (e.g., an object recognition algorithm) on video content of the program to identify objects within the 3D scene that may be associated with the sound source. For example, referring to fig. 1, the identifier may determine that the mouth of person 34 is moving, indicating that person 34 is speaking. Thus, the identifier may determine that the person 34 is a sound source and that the position/orientation of the sound source is at (or near) his mouth when the person speaks. Thus, the identifier may determine whether an object within the 3D scene produces sound of the 3D scene and/or a location of a sound source (e.g., associated with the production) based on the identified visual object within the scene. In addition to (or instead of) identifying the sound source, the identifier may perform an image analysis function to identify the localization of the sound source (e.g., based on the localization of the object relative to the global scene origin of the 3D scene).
As described herein, a media program may include active sound sources and/or passive sound sources. As described above, the identifier is configured to identify (and/or determine the location of) the active sound source. In another aspect, the identifier may be configured to identify a passive sound source. For example, the identifier may receive an indication that the 3D scene includes a passive sound source with the media program (e.g., as metadata). On the other hand, the identifier may identify the passive sound source based on the performance of the image analysis function, as described herein. The passive sound source may be a source of reflected or diffracted sound arranged to produce sound within the 3D scene. Thus, the identifier may perform image analysis of at least a portion of the video content of the media program to identify objects (structures or locations) (e.g., surfaces thereof) within the 3D scene that may be associated with passive sound sources that may be arranged to reflect or diffract sound. For example, referring to fig. 2, the identifier may identify a passive sound source on a window of the interior cabin 32 based on determining that the window has a flat surface (e.g., typically) known to produce reflected or diffracted sound (e.g., produced in the physical world). Once identified, the identifier may determine the location of the passive sound source, as described herein (e.g., based on its localization within the 3D scene relative to a reference point).
In one aspect, the identifier may be configured to determine one or more acoustic parameters associated with the passive acoustic source. In particular, since the passive sound source produces reflected or diffracted sounds produced within the 3D scene, the media program may not include audio data associated with the passive sound source. Instead, the identifier may be configured to determine acoustic parameters that may be used by a decoder side (e.g., an audio renderer at the decoder side) to generate one or more audio signals for the passive sound source based on one or more other spatialized sounds (e.g., of the one or more active sound sources) within the 3D scene. In one aspect, the acoustic parameters may include at least one of: diffusion level (e.g., based on a 3D scene), cut-off frequency, frequency response, geometry of surfaces and/or objects associated with passive sound sources, acoustic surface parameters of objects, reflectivity values, absorption values, and material type of objects. In one aspect, at least some of these parameters may be predefined. In another aspect, at least some of which may be determined based on image analysis of an object associated with the passive sound source. In some aspects, at least some of these parameters may be received (e.g., as part of a media program).
The identifier 14 is configured to determine whether the 3D scene includes one or more 3D scenes (e.g., 3D sub-scenes) within the 3D scene (e.g., one or more sound sources are part of (or located in) thereof). In one aspect, the identifier may determine whether the 3D scene includes a 3D sub-scene based on behavior of sound sources within the 3D scene. For example, the identifier may determine whether a sound source within the 3D scene moves with (or has the same trajectory as) one or more other sound sources within the 3D scene. In particular, the identifier may determine whether the position of the sound source moves with the position of another sound source. If so, this may indicate that the two sound sources are associated with each other. Referring to fig. 2, the identification of the second sub-scene 35b may be based on the two sound sources 46D and 46e (e.g., in particular their respective positions 39D and 39e within the 3D scene 30) moving in the same trajectory (e.g., moving in a forward direction as the boat 33 is propelled forward). In another aspect, the determination may be based on audio content and video content of the media program. Returning to the previous example, the identifier may identify the second sub-scenario based on identifying the vessel 33 within the 3D scenario 30 (e.g., based on an object recognition algorithm) and determining that the locations 39D and 39e are on the identified vessel. Alternatively, the identifier may determine that the sound source is part of the same sub-scene based on the sound that the sound source is producing. For example, the identifier may perform an acoustic analysis of the acoustic source of the chimney 46e to determine that it is the sound of the ship chimney. The identifier may determine that the location is associated with sub-scene 35b, as that sub-scene is that of the ship that will produce such sound.
In one aspect, the identifier may be configured to determine whether one or more 3D sub-scenes are nested within other 3D sub-scenes within the 3D scene. For example, the identifier may be configured to identify a 3D sub-scene that may be separated from (and/or contained within) another 3D sub-scene based on a logical separation between the scenes. For example, referring to fig. 2, the identifier may identify the 3D sub-scene 35c as being separate from the second sub-scene 35b (e.g., inside the ship 33), as the third sub-scene is within the cabin inside the ship 33.
As described above, the identifier may identify the sub-scene based on the behavior of the sound source of the sub-scene (e.g., moving in the same trajectory). On the other hand, the identifier may determine that the 3D scene includes a 3D sub-scene based on the (e.g., new) behavior of the sound source and the identified 3D sub-scene. For example, the identifier may determine that the sound source location has the same trajectory as the (e.g., existing) 3D sub-scene, and may include the sound source as part of the 3D sub-scene. For example, referring to fig. 2, the identifier may identify (or include) the tour bus 31 as a first sub-scene 35a, and may then determine that the sub-scene 35a includes the person 34 based on the person's 34 movement toward (or with) the tour bus 31.
In some aspects, the identifier 14 may be configured to determine a location (e.g., origin) of the identified 3D sub-scene. In particular, the identifier 14 may determine a position of the identified 3D sub-scene as an origin of the 3D sub-scene relative to another origin within the 3D scene. For example, the identifier may assign (or specify) a location within the 3D scene as the origin of the sub-scene. In one aspect, the origin may be defined based on the positioning of one or more sound sources within the sub-scene. For example, the origin of a sub-scene may be centered with respect to the location of one or more sound sources of the sub-scene (e.g., centered within the 3D space of the sub-scene). In one aspect, the origin of the 3D sub-scene may be located relative to a reference point (e.g., global scene origin) of the (e.g., global) 3D scene. However, if the first 3D sub-scene is within the second 3D sub-scene, the origin of the first 3D sub-scene may be defined relative to the origin of the second 3D sub-scene, as shown in fig. 2, with the origin 38c relative to the origin 38b. More about defining the location of the origin of a 3D sub-scene is described herein.
In one aspect, the identifier may identify a location of an origin of one or more 3D sub-scenes to generate a scene tree structure (e.g., structure 36 in fig. 2), and from this structure, the identifier may determine a location of the sub-scene sound source such that the location of the sound source is relative to its respective sub-scene. On the other hand, the location of the sound source may be part of a scene tree structure. Alternatively, the locations of the sound sources may be separate from the scene tree structure (e.g., each sound source is associated with or has a separate payload that includes the location data of the corresponding source). More about determining a location is described herein.
As described above, the identification of sound sources and/or sub-scenes and their locations (origins) may be based on an analysis of the audio content and/or video content of the media program. On the other hand, the identification may be based on user input. For example, encoder side 11 may receive user input (e.g., via an input device such as a touch-sensitive display) identifying sources and/or sub-scenes and their positioning within the 3D scene of the program.
The identifier 14 may be configured to generate location data as location 19, the location data comprising: 1) a position (e.g., origin) and/or orientation of a sound source (e.g., origin relative to a 3D scene of a media program), 2) a position (e.g., origin) and/or orientation of one or more 3D scenes (e.g., 3D sub-scenes and 3D global scenes), and/or 3) other data, such as an indication of which sound sources/3D sub-scenes are associated with a position, one or more acoustic parameters, etc. The identifier 14 may be configured to provide the location 19 to the encoder 15. In particular, the location 19 may include an origin as a scene tree structure generated by identifiers identifying the locations (and/or characteristics) of origins of the 3D sub-scenes relative to each other. The location 19 may also include a (e.g., individual) location payload or location data of a location payload for each sound source, where each location payload references at least one origin from the scene tree structure.
In one aspect, encoder 15 may be configured to receive audio content 18 (e.g., as one or more audio signals) and data generated by identifier 14 (e.g., location 19), and may (optionally) receive video content 17 (e.g., as one or more video signals), and may be configured to generate bitstream 20 having audio (and/or video) content and including (e.g., scene) metadata 21 based on location 19. For example, the encoder 15 may be configured to encode the scene tree structure as scene metadata (e.g., using the location of one or more origins of the 3D scene), and/or the encoder 15 may be configured to encode the location of sound sources within the 3D scene relative to the encoded scene tree structure as metadata 21. For example, referring to fig. 2, an encoder may encode the following as an encoded scene tree structure: 1) global scene origin 37,2) origins (e.g., 38a and 38 b) of 3D sub-scenes (e.g., 35a and 35 b) relative to the global origin, and 3) origins of other 3D sub-scenes relative to the origins of other sub-scenes (such as origin 38c of sub-scene 35 c). The encoder may also encode the locations of the sound sources (e.g., 39a-39f and 40) relative to the origin of the encoding origin of the 3D scene and/or 3D sub-scene. More content is described herein regarding encoding scene tree structures defining relationships between audio components of 3D scenes of a media program.
In one aspect, encoder 15 may encode audio content associated with a media program (e.g., when the media program is an audio program) according to any audio codec, such as Advanced Audio Coding (AAC). On the other hand, when the media program is an a/V program, the encoder may encode audio and video content according to an a/V codec, according to the Moving Picture Experts Group (MPEG) standard. The encoder 15 may be configured to send the bitstream 20 (together with the encoded metadata 21) to the decoder side 12 via the network 43. In particular, the electronic device performing the encoder operation may send the bitstream to another electronic device that will perform (or is performing) the decoder operation (and the spatial rendering operation to play back the spatial audio). In one aspect, the encoder side 11 may store (at least a portion of) the bitstream 20 in a memory (e.g., local or remote) (e.g., for later transmission).
The decoder side 12 will now be described. The decoder side 12 comprises several operational components such as a decoder 22, an audio renderer 23, a listener position estimator 25, an (optional) display 24, and two loudspeakers 26a and 26b. In one aspect, the decoder side may include more or fewer components, such as having more displays and/or speakers, or no displays (e.g., that are part of the decoder side device performing the decoder side operations).
The decoder side 12 receives a bitstream 20, which is generated and transmitted by the encoder side 11. In particular, decoder 22 may be configured to receive a bitstream 20 comprising an encoded version of audio content 18 (e.g., as one or more audio signals) and/or an encoded version of video content 17 (e.g., as one or more video signals) associated with media program 16, and (encoded) metadata 21 comprising: 1) An encoded scene tree structure describing a positional relationship between origins of one or more 3D scenes of the media program, and 2) a position of one or more sound sources of the media program. For example, when the media program is an audio program, the bitstream 20 received by the decoder side 12 may include encoded versions of one or more audio signals associated with one or more sound sources within a 3D scene of the audio program, and receive a scene tree structure (e.g., representing an initial configuration of the 3D scene, as described herein). The decoder may be configured to decode the encoded content and metadata 21 (e.g., according to an audio codec used by the encoder to encode the bitstream 20).
In some aspects, the display 24 is designed to present (or display) digital images or video (or image) content) of one or more video signals. In one aspect, display 24 is configured to receive decoded video content 17 from decoder 22 for display.
The listener position estimator 25 may be configured to estimate (determine) the position of the listener (e.g., within a 3D scene of the media program) and provide the position to the audio renderer 23 for spatially rendering the media program. For example, where the 3D scene of the media program includes an XR environment in which a listener (e.g., an avatar of the listener) is participating, the estimator 25 may determine the position of the listener within the XR environment. Such determination may be based on user input through one or more input devices (not shown) that may be used by a listener to navigate the XR environment. On the other hand, the listener position may be located at (or near) the location of the global scene origin within the 3D scene. On the other hand, the position of the listener may be predefined within the 3D scene.
On the other hand, the estimator 25 may be configured to determine a change in the position and/or orientation of the listener within the 3D scene. For example, the estimator 25 may receive head tracking data from a head tracking device that may be coupled to a listener (e.g., a tracking device on a headset worn by the listener) and estimate the change from the head tracking data, which may be provided to an audio renderer that may be configured to use the change to adjust spatial rendering of the media program, as described herein.
The audio renderer 23 may be configured to receive the audio content 18 of the media program and to receive the position 19 of the global scene origin of the 3D scene of the media program, the position of the origin of one or more sub-scenes within the 3D scene and/or the position of (active and/or passive) sound sources within the 3D scene as indicated by the scene metadata 21 within the bitstream 20. Using these locations, the audio renderer may determine the location of the sound source within the global 3D scene relative to the location of the listener. For example, referring to fig. 2, the audio renderer may determine the position of the listener relative to the global scene origin 37 (or may receive from the estimator 25) and may determine the position of the sound source relative to the global scene origin (e.g., the sound source position 39 a) (e.g., based on the position of the sub-scene origin 38a relative to the global scene origin 37) and based on these determinations determine the relationship between the sound source of the 3D scene and the listener position within the 3D scene.
The audio renderer is configured to spatially render the 3D scene (e.g., one or more sound sources therein) using the audio content 18 and the position data received from the decoder and/or estimator 25. In particular, the renderer is configured to generate a set of one or more spatially rendered audio signals by spatially rendering the audio signals associated with the one or more sound sources according to the sound source positions relative to the listener positions. In particular, continuing with the previous example, the audio renderer 23 may use the relationship between sound sources and listener positions (based on a scene tree structure) to determine one or more spatial filters. For example, the spatial filter may comprise a Head Related Transfer Function (HRTF), which may be selected by the renderer 23 based on the position (and/or orientation) of the source relative to the listener position (and/or orientation). In one aspect, an audio renderer may generate a spatially rendered audio signal by applying one or more determined spatial filters to one or more audio signals of a media program. In one aspect, when the spatial filter is an HRTF, the spatially rendered audio signal may be a set of binaural audio signals. The audio renderer is configured to render the audio signals using the generated space, and the renderer 23 may drive one or more speakers (e.g., speakers 26a and 26 b) so that the speakers generate sound of the sound source at their determined locations in the acoustic space, as perceived by a listener. For example, the audio renderer may use binaural audio signals to drive one or more speakers of a headset being worn by the user (e.g., where a left binaural signal drives a left speaker of the headset and a right binaural signal drives a right speaker of the headset).
Fig. 5-7 and 12 are flowcharts of processes 50, 60, 70 and 80, respectively, for performing one or more digital (e.g., audio) signal processing operations and/or network operations to encode and decode a bitstream having metadata. In one aspect, at least some of these operations may be performed by the encoder side 11 and/or the decoder side 12 of the system 10. For example, at least some of the operations of processes 50, 60, and/or 80 may be performed by an encoder side (e.g., any electronic device performing encoder side operations, such as media content device 42 of system 41); and at least some of the operations of processes 60 and/or 80 may be performed by a decoder side (e.g., any electronic device performing decoder side operations, such as playback device 44 and/or output device 45 of system 41).
Turning to fig. 5, this figure is a flow chart of one aspect of a process 50 performed at the encoder side 11 for encoding scene metadata and audio content into a bitstream for transmission to the decoder side. Process 50 begins with encoder side 11 receiving a media (e.g., audio) program that includes one or more audio signals and/or one or more video signals (at block 51). The encoder side identifies (at block 52) one or more sound sources associated with one or more audio signals within a 3D scene of the media program. As described herein, the identifier 14 may identify one or more sound sources by performing an acoustic analysis (e.g., BSS) on the one or more audio signals. Where the media program includes one or more video signals, the identifier 14 may identify a sound source based on image analysis (e.g., performing an object recognition function) to identify one or more objects associated with the sound source. The encoder side (optionally) identifies one or more passive sound sources associated with the 3D scene (at block 53). For example, the identifier may determine whether the media program indicates that the 3D scene includes a passive sound source (e.g., based on whether the program includes acoustic parameters). On the other hand, the identifier 14 may identify a structure or location within the video content of the media program that is associated with a passive sound source (e.g., a flat surface). In one aspect, upon identifying a passive sound source, the encoder side may be configured to determine one or more acoustic parameters associated with the identified passive source, as described herein.
The encoder side 11 identifies (at block 54) one or more 3D sub-scenes within the 3D scene. In particular, the identifier 14 may identify the 3D sub-scene based on sound sources within the 3D scene. For example, the identifier 14 may identify that a portion of the 3D scene is a 3D sub-scene based on whether (the positions of) sound sources are moving in the same trajectory (e.g., and within a threshold distance of each other within the 3D scene). As another example, the identifier may identify the sub-scene based on whether the sound sources are within a vicinity (e.g., a threshold distance) of each other. As another example, the identifier may identify a sub-scene based on whether the sound source is associated with similar types of audio content. On the other hand, the identification of sub-scenes may be based on image analysis of video content associated with the 3D scene, as described herein. For example, the identifier 14 may determine whether there are any structures (objects) or locations within the video content that will be associated with the 3D sub-scene.
The encoder side 11 determines (at block 55) the location of the identified sound source and/or the location of the identified 3D sub-scene (e.g., source). Specifically, the encoder side determines the location of the origin of one or more identified 3D sub-scenes and determines the location of one or more identified sound sources relative to the origin of one or more of the identified 3D sub-scenes (e.g., the sound source associated with a particular sub-scene relative to the origin of that sub-scene, as described herein). In one aspect, the encoder side 11 may also identify other position information, such as rotation data indicating the orientation of the origin. For example, the identifier may determine the location of the sound source, as described herein. For example, the media program may include location data associated with one or more sound sources. On the other hand, the identifier 14 may perform a sound localization function to identify the location of the sound source within the 3D scene. In some aspects, the identifier 14 may determine a location (e.g., origin) of the 3D sub-scene to correspond to a location of a structure or position within the 3D scene of the video content of the media program. On the other hand, the identifier may determine the location of the 3D sub-scene based on the location and/or characteristics of the sound source, as described herein (e.g., whether the sound source locations are within proximity of each other).
In one aspect, the encoder side 11 may determine a relationship between origins of 3D sub-scenes (e.g., to determine a scene tree structure). For example, the encoder side may form a relationship between a 3D sub-scene and a 3D global scene, where the 3D sub-scene location is within a threshold distance from the global scene. On the other hand, the encoder side may determine the position of a sub-scene with respect to the origin of another sub-scene based on various conditions. For example, the sub-scene origin may be relative to another sub-scene origin, which is the sub-scene in which the original sub-scene is located. Alternatively, the sub-scene origins may be relative to sub-scene origins, where these origins are within (or around) structures or locations within the 3D scene.
In one aspect, based on this information, the encoder side may determine whether any sound sources will be associated with a particular 3D sub-scene (e.g., based on the sound sources having the same trajectory and/or being within the same vicinity of each other). In one aspect, the identifier may determine the location of the sound source relative to the origin of the 3D sub-scene with which the sound source is associated, e.g., referring to fig. 2, the encoder side may determine the sound source location 39e of the chimney relative to the sub-scene origin 38b, as the origin is the origin of the ship 33 of which the chimney is a part. On the other hand, the encoder side may determine the position of the sound source relative to an origin point located near the sound source (e.g., within a threshold distance of the sound source within the 3D scene).
The encoder side 11 (e.g., its encoder 15) encodes one or more audio signals (and/or one or more video signals) into a bitstream and into metadata (e.g., scene tree structure) of the bitstream that includes the locations of the one or more origins of the identified 3D sub-scene and that includes one or more sound sources of the media program (e.g., their location data and other associated data) (at block 56). Specifically, the encoder encodes the following into metadata of the bitstream: 1) A scene tree structure comprising at least one origin (e.g., origin of a 3D sub-scene) relative to another origin of another 3D scene (e.g., which may be another 3D sub-scene or a 3D global scene) of the audio program; and 2) the position of the sound source (of the audio program) relative to the at least one origin point, the position of the sound source referencing the at least one origin point using an identifier. In one aspect, the encoding metadata at the encoder side may define an initial configuration of the sound source relative to one or more 3D scenes of the audio program to be rendered by the decoder side. More about encoding and rendering a position of a sound source is described herein. The encoder-side 11 sends (at block 57) a bitstream comprising encoded metadata to the decoder-side device.
In one aspect, at least some of the operations described in process 50 may be performed linearly. In another aspect, at least some of the operations may be performed concurrently with other operations. For example, the encoder side 11 may identify the sound source and determine its position within the 3D scene (e.g., simultaneous or relatively simultaneous). As described herein, at least some operations may be performed by the identifier 14 to identify locations (and/or origins) and determine their locations within the 3D scene. On the other hand, at least some of these operations may be performed by encoder 15 while the encoder is encoding the data into the bitstream (as a scene tree structure). On the other hand, the operation to send the bitstream may be optional, and the audio content, the determined location, and/or the encoded bitstream may instead be stored in memory for later transmission.
Fig. 6 is a flow chart of one aspect of a process 60 performed at the encoder side 11 of the audio codec system 10 (e.g., by the encoder 15) for encoding into a bitstream a scene tree structure (e.g., determined by the encoder side) defining relationships between audio components within a 3D scene as metadata. Specifically, the encoder 15 encodes a scene tree structure (e.g., the location of the scene origin and/or sound source received by the identifier as location data 19) into the bitstream. In particular, this process 60 may be performed (e.g., at least in part) by the encoder 15 at block 56 in the process 50 of fig. 5 to encode the location determined by the identifier 14 into the metadata 21 of the bitstream 20. The process 60 begins with the encoder side 11 determining (at block 61) the location of the (e.g., global) scene origin within the 3D scene of the received media program. In one aspect, a 3D scene comprising one or more sound sources and/or 3D sub-scenes may comprise an origin as a global scene origin from which the positions of the sound sources and origin of the sub-scenes originate. In some aspects, the location may be located at the origin of a coordinate system, such as x=0, y=0, and z=0 of a cartesian coordinate system.
The encoder side 11 encodes a (global scene) origin of the 3D scene, the origin comprising a location of the origin (e.g. within a coordinate system) and an identifier identifying the origin (at block 62). In one aspect, the location may include location data describing the location of the origin, such as Cartesian coordinates having the origin of a coordinate system. In one aspect, the identifier may be a unique identifier of the origin.
The encoder side 11 determines whether any 3D sub-scene origins will be encoded into the scene tree structure with respect to the encoding origins (at decision block 63). In particular, encoder 15 may determine whether any 3D sub-scene origin has been defined within the 3D scene relative to the global scene origin (e.g., based on location 19 received from identifier 14). For example, referring to fig. 2, the encoder may identify a sub-scene origin 38a of the sub-scene 35a to be encoded, because the position of this origin within the 3D scene is linked to (relative to) the global scene origin 37. If so, the encoder encodes a new origin of the (identified) 3D sub-scene into the metadata (e.g., its scene tree structure) with respect to the (e.g., previous) encoding origin, the new origin having: 1) A new (e.g., unique) identifier for the new encoding origin, and 2) a location of the new origin relative to the encoding origin, the location of the new encoding origin referencing the (previous) encoding origin (at block 64) using the identifier of the (previous) encoding origin. Specifically, the location of the new encoding origin may include: 1) A location of a (previously) encoded origin (from which a new encoded origin is linked) referenced by its identifier (e.g., an identifier having only a previously encoded origin), and 2) location data describing a location of the origin of the newly encoded 3D sub-scene relative to the previously encoded origin within the 3D scene. For example, the position data may include a location of an origin of a 3D sub-scene as coordinates within a coordinate system of the 3D scene. For example, the position data may include a set of cartesian coordinates (e.g., x, y, and z coordinates) indicating a position of the origin relative to the origin of the 3D scene in a cartesian coordinate system. As another example, the position data may include a set of spherical coordinates (e.g., azimuth value, elevation value, and radius) that indicate the location of the origin relative to the origin of the 3D scene within a spherical coordinate system. In one aspect, the position data may include additional data, such as rotation data (rotation parameters) of an origin being encoded relative to a previously encoded origin. The location data may include a maximum distance parameter describing a maximum distance (e.g., relative to an origin) that may be encoded. In one aspect, at least some of the location data, such as encoded coordinate data (e.g., cartesian coordinates), may be encoded relative to the maximum distance parameter. Alternatively, the location data may include other data, as described herein. More about how the encoder side 11 encodes metadata is described herein.
The process 60 returns to decision block 63 to determine whether there are any (e.g., additional) 3D sub-scene origins to be encoded within the scene tree structure relative to any previously encoded origins. Thus, the encoder may construct a scene tree structure comprising at least some of the origins of the 3D sub-scenes identified by the encoder side 11, which origins may be linked to global scene origins and/or other 3D sub-scene origins. As a result, each sub-scene origin (and global scene origin) may be assigned a unique identifier that may be used by one or more origins as reference origins.
However, if all (or at least some) of the origins of the 3D sub-scene have been encoded, process 60 continues with determining if there are any sound source locations associated with the encoded origins to be encoded (at decision block 65). If so, the encoder side 11 encodes (at block 66) the location of the sound source relative to the origin (e.g., of the 3D sub-scene) into (e.g., location payload) metadata. In one aspect, the encoding location may include data similar to the location of the encoding origin, such as having 1) an identifier of the origin referenced by the location of the sound source, and 2) location data (e.g., a set of Cartesian coordinates) of the sound source. In one aspect, the location data may also include orientation information about the sound source. For example, the identifier 14 may determine an orientation of the sound source relative to an origin of the 3D scene, and the orientation may be included (e.g., in 3D space) as part of the position 19 provided to the encoder. Thus, the encoder may include within the position data a rotation parameter indicative of the orientation of the sound source (e.g., relative to its origin). On the other hand, the location data may include characteristics of its associated sound source. For example, when the sound source is a passive sound source, the location data may include one or more acoustic parameters associated with the passive sound source, as described herein.
The encoder side 11 returns to decision block 65 to determine whether additional sound source positions are to be encoded. If so, the process 60 returns to block 66.
In one aspect, the position data may be normalized to a (e.g., predefined) scaling factor that indicates a maximum distance parameter (e.g., within any direction in the 3D scene) that may be encoded. In particular, when the position data includes a localization of a sound source in cartesian coordinates, each coordinate may be normalized with respect to a maximum distance parameter, wherein the maximum distance may be included within the position data for use by the decoder side 12 in decoding the position data. In one aspect, an encoder may encode a square root of the maximum distance parameter into the bitstream. More about normalization and maximum distance of location data is described herein.
In one aspect, if all sound source positions have been encoded, process 60 may end. On the other hand, the encoder side 11 may perform the process 60 for each identified 3D sub-scene origin. In this case, after encoding the new origin, the encoder side may proceed to decision blocks 65 and 66 to encode the sound source position of the sub-scene. After all sound source positions are encoded, the encoder side 11 may proceed back to decision block 63 to encode a new origin and/or sound source position of another 3D sub-scene. The encoder side 11 may continue to perform these operations until all positions have been encoded, as described herein.
Fig. 7 is a flow chart of one aspect of a process 70 at the decoder side 12 for receiving a bitstream (e.g., bitstream 20) and spatially rendering audio content of the bitstream using scene metadata (e.g., metadata 21) encoded therein. The decoder side 12 receives (or obtains) a bitstream comprising: 1) encoded audio content of a media (e.g., audio) program (e.g., an encoded version of at least one audio signal associated with the program) having one (or at least one) sound source within a 3D scene (e.g., a 3D sub-scene) of the media program (and/or encoded video content comprising the media program), 2) an encoded scene tree structure (e.g., which represents an initial configuration of the 3D scene of the media program), and 3) a location of the sound source (at block 71). In particular, the encoding tree structure may include a location of an origin of a 3D scene (e.g., a 3D global scene) and one or more other origins (e.g., origins of 3D sub-scenes) related to the origin of the 3D global scene or the origin of another 3D scene. For example, the tree structure may include an origin of a first 3D scene (e.g., a 3D sub-scene) relative to an origin of a second 3D scene (e.g., a 3D global scene). As described herein, the location of the sound source may be relative to at least one origin within the tree structure. Continuing with the previous example, the location of the sound source may be within the 3D sub-scene (e.g., chimney sound 46e on the vessel 33) relative to the origin of the 3D sub-scene (e.g., location 39e relative to origin 38 b). As described herein, the location of the sound source may reference the origin of the 3D sub-scene (e.g., by the encoder side) using a (e.g., unique) identifier associated with the 3D sub-scene. In one aspect, a decoder side may receive a plurality of sound sources, where each sound source is associated with different position data indicating a position of the respective sound source within a 3D scene (e.g., relative to an origin of the 3D scene). In one aspect, a bitstream may be received in response to user input to play back a media program (e.g., on a decoder-side device). On the other hand, the obtained bitstream may include an initial (or beginning) portion of the media program being streamed to the decoder side, wherein metadata included within the bitstream defines an initial configuration of one or more origins (e.g., of one or more 3D sub-scenes) within the 3D scene with which the sound source is associated. As described herein, additional metadata may be obtained at the decoder side after receiving metadata associated with an initial configuration for updating playback at the decoder side. More about updates is described herein.
The bitstream is decoded (at block 72), e.g., by decoder 22. In particular, decoder 22 may extract encoded audio content (and/or video content) from within the bitstream and extract the encoded scene tree structure. For example, the decoder may extract the location of one or more origins (e.g., the origin of the 3D scene and/or one or more origins of the 3D sub-scene) and may extract the location of the sound source (e.g., relative to the origin). In one aspect, the location data extracted from the metadata may include a location of the sound source and/or origin as coordinates (e.g., cartesian coordinates) within a coordinate system, a maximum distance parameter associated with the coordinates, a rotation parameter indicative of an orientation of the sound source (and/or 3D sub-scene), and/or an acoustic parameter (e.g., associated with a passive sound source). Based on the position data, the decoder may be configured to reproduce a scene tree structure of the 3D scene prior to encoding. For example, as described herein, the coordinates of the encoded scene tree structure may be normalized with respect to the maximum distance parameter. Thus, to determine a position, the audio renderer uses the maximum distance to scale the position relative to its origin before the encoder side encodes the position. Thus, the decoder may reproduce the position 19 of the sound source (relative to one or more origins of the 3D scene) and provide the position 19 to the audio renderer 23. More about extracting metadata from a bitstream is described herein.
A position of the listener within the 3D scene is determined (at block 73) (e.g., listener position estimator 25). For example, the estimator may determine that the listener is within a 3D sub-scene of a media program within a (e.g., global) 3D scene of the program (or has acoustic stereo perception from within the 3D sub-scene). In this case, the decoder side may determine how to spatially render the media program based on the listener's position within the 3D scene relative to sound sources within the 3D scene. For example, the audio renderer 23 may output one or more spatial filters (which the audio renderer uses to spatially render the audio data) using (e.g., previously defined) a location model that is responsive to the location data of the sound source (and/or the location data of the origin of one or more 3D scenes) and the listener's location within the 3D scene as inputs. On the other hand, with the audio renderer 23, the position of the sound source with respect to its corresponding origin is transformed so that the position is correlated with the position of the listener. For example, referring to FIG. 2, the listener position may be on the travel bus 31 and have a position (and orientation) relative to the sub-scene origin 38 a. The position of the sub-scene origin 38b of the ship must be transformed so that it is also related to the sub-scene origin 38a of the listener. For example, the audio renderer may pan and rotate the sub-scene origin 38b to map the position of the boat relative to the global scene origin 37. The position of the boat (e.g., its sub-scene origin 38 b) may then be mapped to the listener's position (and orientation) by applying a reverse translation and rotation of the sub-scene origin 38a and a reverse translation and rotation of the listener's position on the tour bus. Thus, by relating the listener position to the boat, the audio renderer can accurately render the sound source of the boat with respect to the listener position.
On the other hand, the audio renderer may determine that only a portion of the received sound sources are to be spatially rendered based on the listener's position. For example, referring to fig. 2, upon determining that the listener's position is within the cabin 32, the decoder side may determine that only sound sources within the cabin (e.g., sound sources at position 39f and sound sources at position 40) will be spatially rendered, other sound sources may not be spatially rendered because the user may not be within hearing distance (e.g., within physical audible range) of those sound sources, such as sound sources associated with a travel bus.
The decoder side 12 determines an audio filter based on the received set of acoustic parameters associated with the passive sound source of the 3D scene and generates one or more filtered audio signals associated with the passive sound source by applying the audio filter to the one or more audio signals (at block 74). As described herein, passive sound sources reproduce reflected or diffracted sound produced by one or more sound sources within a 3D scene due to acoustic parameters associated with the sound sources. In this case, the audio filter generated by the decoder side 12 applied to the audio signal takes into account the result of reflection or diffraction of the sound of the audio signal. In one aspect, the filter may be any type of filter, such as a low pass filter, a high pass filter, an all pass filter, and/or a band pass filter.
The decoder side generates a set of spatially rendered audio signals (e.g., binaural audio signals) by spatially rendering at least one received audio signal (and/or filtered audio signal) according to the position of the sound source relative to the position of the listener (at block 75). For example, the audio renderer may be configured to determine one or more spatial filters (e.g., HRTFs) based on the extracted scene tree structure (and the locations of the one or more sound sources within the 3D scene) relative to the determined locations of the listener, as described herein. The decoder side drives one or more speakers using one or more spatially rendered audio signals to produce sound sources of the 3D scene (at block 76). For example, where the listener is wearing a headset (as an output device), the audio renderer may drive the left and right speakers of the headset with one or more spatial rendering signals. Thus, the decoder side is configured to spatially render the 3D scene of the obtained media program to generate one or more sound sources with respect to the position of the listener. For example, in the case of a 3D scene having two sound sources, the bitstream may include (e.g., encode) the audio content and corresponding position data of the two sources, which the decoder side uses to spatially render the sources relative to the listener position. The decoder side (optionally) displays video content (e.g., audibly represented by the 3D scene) on a display (at block 77). In particular, the media program may be an a/V program of a movie, wherein the audio content is a movie soundtrack of video content of the movie. In one aspect, this operation is optional because the decoder side may not receive video content when the media program is an audio program (e.g., a musical composition).
As described herein, encoded audio content received within a bitstream (e.g., audio content of an a/V program) may audibly represent video content displayed on a display. In one aspect, the video content may be an XR environment in which a listener is participating, which is audibly represented by the sound of the encoded audio data. In this case, the structure or positioning within the XR environment may correspond to one or more spatially rendered sound sources, and the location of the listener may also be within the environment. In one aspect, the at least one sound source spatially rendered by the audio renderer 23 of the decoder side 12 may be an active sound source associated with an object or localization within video content displayed on a display. On the other hand, the at least one sound source may be a passive sound source corresponding to a structure or object displayed on a display on the decoder side (e.g., within a visual environment). In this case, the bitstream may comprise the position of the passive sound source relative to the origin, wherein the passive sound source is arranged to produce reflected or diffracted sound from one or more active sound sources exiting from a surface within the XR environment. Thus, when audio content is spatially rendered, a listener may hear the sound of an active sound source from one location within the XR environment, and may hear reflected or diffracted sound of the active sound source from another location within the environment (e.g., the location of a passive sound source).
In one aspect, the audio renderer 23 may adjust the spatial rendering based on changes in the listener's position and/or orientation (e.g., based on head tracking data). For example, the audio renderer may determine that the listener has moved (e.g., moved within an XR environment). The audio renderer may determine how the listener moves from the original position. For example, the audio renderer may determine panning and rotation of the listener based on the listener's movement. The audio renderer may determine a new position of the sound source (a new position relative to the listener) based on an inverse of the panning and the inverse of the rotation relative to the position of the listener. In this case, the spatially rendered audio renderer may adjust the rendered 3D scene as if the sound source is moving while the listener position (e.g., within the 3D acoustic space) remains stationary. Thus, the audio renderer adjusts the spatial rendering of the 3D scene based on the new position of the sound source relative to the position of the listener.
Another aspect of the present disclosure is a manner to add a scene tree structure to a bitstream following a future Moving Picture Experts Group (MPEG) standard (e.g., the MPEG-D DRC standard), wherein the manner is extended herein to support a scene location payload comprising a scene tree structure to be added at the decoder side for sending to the decoder side for providing a 3D scene (which may comprise a 3D sub-scene) and a location of a sound source within the 3D scene for further spatially rendering media content at the decoder side. Accordingly, the present disclosure provides enhancements to existing MPEG-D DRC standards (e.g., ISO/IEC, "information technology-MPEG Audio technology-part 4: dynamic Range control," ISO/IEC 23003-4:2020) to allow an audio codec system to efficiently encode locations within a 3D scene within an encoded bitstream as described herein as scene metadata.
FIGS. 8-10c and 13 illustrate tables of enhancements to the bit stream syntax of an MPEG-D DRC in accordance with some aspects. In particular, these figures illustrate tables including syntax of scene tree structure payloads and position payloads, where an encoder side creates and encodes metadata of positions of audio components within a 3D scene from the position payloads and encodes positions of one or more origins to which the audio components are linked as encoded scene tree structure payloads from the enhancement syntax, and a decoder side extracts metadata (e.g., positions of scene tree structures and audio components) from the enhancement MPEG-D DRC bitstream for spatial rendering of the 3D scene from the enhancement syntax. Thus, encoder side 11 may perform at least some of the operations described herein to encode a bitstream according to the syntax, and decoder side 12 may perform at least some of the operations described herein to decode a bitstream according to the syntax.
Turning to fig. 8, this figure shows table 1, which includes syntax of scenepositons () payload including a scene tree structure encoded by encoder side 11 (e.g., initial configuration for a 3D scene). The payload may be part of a global configuration of an encoding tree structure that includes a scene origin (e.g., a 3D sub-scene origin), where the root of the origin is located at the global 3D scene origin. The syntax is described as follows. In one aspect, the payload may be generated as described in at least a portion of process 60 of fig. 6, and decoder side 12 may decode (or extract) data from the payload, as described in fig. 7.
In one aspect, the ScenePositions () payload may indicate an initial configuration of an origin within the 3D scene. In one aspect, when the flag forConfig is true (or equal to one), the initial configuration may be indicated by the flag. As described herein, this initial configuration may be provided by the encoder side at the beginning (e.g., for providing a media program once streaming between the encoder side and the decoder side is initiated) for playback of the media program at the decoder side.
For each of a plurality of scene origins starting from a first origin (e.g., i=0) encoded within ScenePositions () indicated by numscenepoigins, the decoder may extract each scenepoigin (i+1, forconfig=1) in the 3D scene origins during initial configuration. In one aspect, the number of scene origins may be stored as a six-bit data structure (or integer). In one aspect, each sceneoigin can be associated with an origin of a global 3D scene associated with at least one 3D sub-scene and/or coded scene tree structure. As shown, each scene origin encoded within the metadata includes a unique identifier sceneoigind=id of the origin, and includes a location payload Position (forConfig) that includes location data of the originally configured scene origin. In one aspect, the identifier = 0 may be reserved for a global scene origin (e.g., origin 37 of fig. 2) of the 3D scene. In one aspect, the location payloads of at least some of the origins may indicate a location characteristic of the origins relative to a reference origin of the 3D scene. The references between origins may represent encoded scene tree structures. More about the Position () payload is described herein.
Thus, the decoder may iteratively extract the locations of the origins within the 3D scene until each origin location has been extracted (e.g., when i=numsceneoorigins).
In one aspect, the syntax may include a flag (e.g., as a bit) that may be defined by the encoder, which, when determined by the decoder to have a first value in the bitstream, may indicate whether the bitstream includes an encoded scene tree. If this is the case, the decoder may be configured to determine that the metadata of the bitstream includes a ScenePositions () payload having a coding tree structure of the scene origin. However, if the bitstream includes a second value of the flag, this may indicate to the decoder that the ScenePositions () payload has been sent to the decoder side.
Turning now to fig. 9, this figure shows table 2, which includes syntax of ObjectMetadata () payload and WallPhysics () payload, both of which can be added to a bitstream according to the MPEG-D DRC standard. In one aspect, encoder side 11 may encode these payloads based on at least some of the operations as described in process 60 of fig. 6, and decoder side 12 may decode (or extract) data from the payloads, as described in fig. 7. The ObjectMetadata () payload and the WallPhysics () payload may include data encoded by the encoder side 11 that may be associated with one or more sound sources within a 3D scene encoding a media program. In particular, the ObjectMetadata () payload may include (e.g., location) data of active sound sources within the 3D scene, while the WallPhysics () payload may include data of passive sound sources within the 3D scene, as described herein. In another aspect, the at least one payload may include data of one or more sound sources of one or more 3D scenes (e.g., 3D sub-scenes within a 3D scene).
With the ObjectMetadata () payload, the decoder side 12 may determine whether a flag of a first value (e.g., as a one-bit integer) is present in the bitstream, the value indicating that metadata of the bitstream includes position data (e.g., initial configuration) of an associated sound source of the encoding tree structure, as in if (scenePositionsPresent). If so, the decoder extracts the Position data of the active sound source stored in the Position () payload, which is described herein. Otherwise, the decoder may extract other location metadata associated with the sound source. In one aspect, the other location metadata may be conventional metadata that indicates the location of the sound source without referencing the encoded scene tree structure. The ObjectMetadata () payload may also include other object metadata, which may include parameters and/or characteristics of the sound source used by the audio renderer in spatially rendering the sound source. For example, the other object metadata may include parameters for the audio renderer to apply one or more audio signal processing operations, such as a (e.g., desired) loudness level with sound sources. As another example, the other metadata may indicate whether one or more audio filters are to be applied to one or more audio signals associated with the sound source, such as a low-pass filter, a high-pass filter, and the like. As another example, the other metadata may indicate characteristics of the sound source, such as a reverberation level. On the other hand, the other object metadata may include spatial characteristics of the sound source.
With the WallPhysics () payload, the decoder can determine the Position data of the associated passive sound source(s) as the Position () payload. In particular, when the passive sound source is encoded within the WallPhysics payload, the encoder may already encode the position of the passive sound source by combining an identifier (e.g., reference sceneOrigind, as described herein) that references the passive sound source scene origin, acoustic parameters, and position data (relative to its reference scene origin). The WallPhysics () payload also includes other wall metadata, such as one or more acoustic parameters associated with the passive sound source, as described herein.
Turning to fig. 10a-10c, these figures illustrate tables 3-5 that include a syntax of a Position () payload that includes (e.g., position) data describing the Position (coordinates and/or rotation) of an active sound source, a passive sound source, and/or an origin within a 3D scene, as described herein. As described herein, the number of payloads may include a Position () payload, such as a sceneorig payload, an ObjectMetadata payload, and a WallPhysics payload. Thus, the encoder side 11 may perform at least some of the operations described herein to encode the payload, and the decoder side 12 may perform at least some of the operations described herein with respect to the Position () syntax to extract the location data of the sound source and/or origin within the encoded scene tree structure. By having one payload describing the position of each of these elements, complexity is reduced and an efficient encoding position of the 3D scene is provided.
In addition, the following Position () payload may be defined by the encoder side 11 based on one or more criteria. For example, the payload may be defined (and/or adjusted) based on a current (or desired) bit rate at which the encoder-side transmits the bitstream 20 over a data connection between the encoder-side device and the decoder-side device. For example, where the bit rate is low (e.g., below a threshold), the size of the payload may be adjusted (e.g., reduced). In particular, the encoder may encode the initial configuration payloads and/or future payloads with fewer bits in order to reduce the size of these payloads for low bit rate conditions. This may be the case when the decoder-side device is communicatively coupled to the encoder-side device through a slow data connection (e.g., due to slow network 43). As described herein, the encoder-side device may adjust (initial and/or future) payloads by adjusting the size (e.g., number of bits) of the payloads of the position data that may be used to define the source and/or origin of the 3D scene. As another example, the payload may be adjusted based on a change in the location of sound sources and/or 3D sub-scenes within the 3D scene. All of these may allow system 10 to transmit payloads in data stream 20 under limited (or reduced) bit rate conditions. Further details regarding this criteria and adjustment payloads are described herein.
The description about the syntax of the encoder-side 11 and the decoder-side 12 is as follows. In particular, the following description of the location payload will be with respect to a sound source location within a 3D scene (e.g., location 39a as shown in fig. 2), such as the sound source of engine 46a in fig. 1. Additionally, the location payload may describe a location of an origin of the scene tree structure, as described herein. First, the decoder side 12 determines whether the Position () payload is being received at the initial configuration by determining whether the flag fortnfig is a first value, for example if (fortnfig= 1). This may be the case when the payload is the first payload received for initializing the bitstream (or once the bitstream has been established). If so, the decoder side 12 may determine that the payload includes one or more one-bit identifiers (e.g., having particular values) that indicate whether certain parameters associated with the origin of the Position () payload may be adjusted. For example, the payload may include a reforigin adaptation identifier that indicates whether a reference origin of an origin (e.g., the location 39a of the sound source 46 a) associated with the payload may be adapted or changed, and a distanceMaxAdaptation identifier that indicates whether a distance parameter (e.g., a maximum distance parameter) associated with the payload may be adapted or changed from a previous payload. Such adaptation may allow the encoder side to adjust the position data associated with the payload and/or may adjust the size of the encoded payload (e.g., reduce the number of bits). In one aspect, since the payload is being received during initial configuration, the two identifier assignments may be assigned a value (e.g., zero) that may indicate that no adaptation is needed. This may be the case when the payload is the first payload to be received by the decoder. However, these identifiers may be adjusted by the encoder side 11 during the transmission of later payloads within the bitstream. Further content regarding these identifiers is described herein.
The payload also includes a number of bit stream identifiers that adjust a bit size of at least some of the location data of the payload. As described herein, the payload may include (e.g., as an integer) a number of position values representing the position and/or orientation of the origin/sound source (relative to a reference origin) stored in the payload having a particular bit size. As described herein, each integer value represents a corresponding step value (or position in the coordinate system relative to the reference origin) within a range of values (e.g., a maximum distance parameter). The size (or spatial resolution) of each step may be based on the bit size of the position value, where more bits available to represent these integers may reduce the size. More about spatial resolution is described herein.
However, increasing the bit size may also increase the size (number of bits) of the payload. Thus, the encoder side 11 may be configured to add the number of bits associated with the position data. In one aspect, the number of bits of one or more of these identifiers may be set by the encoder to a predefined value for the initial configuration. On the other hand, the encoder side 11 may determine the number of bits to be added based on a network analysis of the data connection between the encoder side and the decoder side. For example, under low bit rate conditions, encoder side 11 may set these values to low values (e.g., below a threshold) so as not to oversaturate the connection.
The decoder side 12 may determine the number of bits added to the position data based on one or more four-bit integers that may be received through the bitstream. Specifically, the identifiers may include bscoordddedbits and bsrotaddbis. Specifically, bscodeddedbits are encoded bit stream values of bits to be added to coordinate values of position data that can be used to identify (decode) (encode) the position of an origin relative to a reference origin in one or more coordinate systems. The identifier bsrotaddredbits includes an encoded bit stream value to be added to bits of a rotation value of the position data, which rotation value may be used to indicate an orientation of the origin with respect to the reference origin. To identify the number of bits added, the decoder side may multiply each identifier by two. In this case, the number of bits added to the coordinate value may be corddedbits, and the number of bits added to the rotation data may be rotAddedBits. More about the added bits is described herein.
The location payload includes an identifier of the scene origin of the location of the Position () payload (e.g., referenced by the sound source location 39 a). In particular, the referencesceneOrigind may be an identifier of the origin within the scene tree structure (e.g., as a six-bit integer). Continuing with the previous example, as described herein, the identifier, which may be unique for each origin within the scene tree structure, may be an identifier of a sub-scene origin 38a of the scene tree structure 36. Thus, as described herein, the location defined by the location payload may be relative to the reference origin. In one aspect, when the referencescenOrigind is zero, it references the global scene origin (e.g., origin 37), which may be the origin of the coordinate system (e.g., x=0, y=0, and z=0 in a Cartesian coordinate system).
During initial configuration (e.g., as a fortnfig= 1), the decoder side 12 may determine whether the flag has been defined within the payload as having a first value (e.g., as a one-bit integer) in the bitstream (e.g., if (reference sceneOriginidIszero= 1)), which may indicate whether the scene origin (e.g., sound source Position 39 a) of the Position () payload references the global scene origin. If not, the payload may include the referencesceneoorignind. However, if the preferencesceneOriginidIszero is equal to one, the decoder side 12 may determine that the preferencesceneOriginid=0, which indicates a reference source origin global scene origin. In this case, when the origin referenced by the Position () payload is the global scene origin, the encoder side 11 can encode the payload with five fewer bits than originally, because in this case the payload will not include the referencesceneoiginid to another reference origin of the encoding tree structure. Again, this may reduce the size of the payload, allowing the payload to be transmitted during low bit rate conditions.
In addition, during initial configuration, the decoder side 12 may retrieve an encoded maximum distance parameter bsdistance max (e.g., based on if ((fourth= 1| (adaptive distance max= 1))) which is a four-bit encoded integer that may be used by the decoder to determine the maximum distance that the origin (or sound source) may be positioned relative to its reference origin
maxdistance=2≡bsdistance max [ meter ]
In one aspect, maxDistance may be a distance in meters and may be user-defined (and/or encoder-defined) based on an audio program. For example, when bsdistance max is equal to eight, maxDistance may be 256 meters. On the other hand, maxDistance may be the maximum distance in terms of another unit of length (such as a centimeter). As another example, when bsdistance max is eight, the encoded Position data of the Position () payload may not extend 256 meters from its reference origin (e.g., have a range between-256 meters and 256 meters from the reference origin). In one aspect, the maximum distance parameter may describe or affect the spatial resolution of locations (e.g., of sound sources and origin) within a 3D scene (e.g., within a 3D sub-scene and/or a 3D global scene). In one aspect, the resolution of the 3D scene may be proportional (e.g., inversely proportional) to the maximum distance value. For example, for large scenes (e.g., marine vessels), the maximum distance may be large, in which case the 3D scene may have low resolution (based on the bit size of the position data), while small scenes (e.g., within the cabin of the vessel) may have a small maximum distance, and in which case the 3D scene may have high resolution. More about how spatial resolution is affected by the maximum distance is described herein. In one aspect, different Position () payloads may have different encoded maximum distance parameters for a given media program. On the other hand, each location of a sound source referencing a particular origin of a sub-scene (and/or scene origin) may have the same maximum distance. More about table 3 is described herein.
Fig. 10b includes table 4 of a continuation of the syntax of the Position () payload. The decoder determines whether a flag has been defined within the Position () payload as having a first value (e.g., if (transitionpresent= 1)) in the bitstream that indicates whether a Position associated with the payload is translated relative to the referenced scene origin. In particular, if the sound source position is located somewhere within the coordinate system that is not the origin of the system, the encoder may already indicate that the flag is a first value.
The decoder side 12 may determine whether the encoder side 11 has encoded at least a portion of the position data using differential (or delta) encoding based on whether the payload is associated with the initial configuration. For an initial configuration (e.g., forConfig= 1), the decoder may determine that delta encoding is not used because corddelatacoding = 0. In one aspect, delta encoding involves transmitting data as differences or delta between successive data. Specifically, the encoder determines a difference between a previously transmitted data piece and a current data piece to be transmitted, and then encodes and transmits the difference. Thus, when transmitting delta encoded data, the position data within the payload may require fewer bits as it is stored as a change to the previously transmitted position data. However, since the initial configuration does not include previously transmitted data, the encoder will not use delta encoding to encode the first payload. However, if the payload is a subsequent payload (e.g., based on a change to the origin position), the payload may include a one-bit flag corddelacoding that indicates (e.g., based on a value) that the data stored in the payload has been encoded using delta encoding. More about delta encoding is described herein.
If delta encoding is not used to encode the data, for example if (chord deltacoding= 0), the decoder side 12 may determine what coordinate system the Position data of the Position () payload is defined to (e.g., defined by the encoder). In one aspect, the coordinate system may be a Cartesian coordinate system or a spherical coordinate system. In one aspect, the encoder side may select any coordinate system (e.g., based on system parameters or requirements). A one-bit integer is defined based on the value of the cordinatesystem. For a first value, for example, the coordinatesystem=1, the coordinate system of the location within the payload may be a spherical coordinate system, and for a second value, for example, the coordinatesystem=0, the coordinate system of the location may be a cartesian coordinate system.
In the case where the coordinate system is a cartesian coordinate system, the Position () payload stores the coordinate values of the Position data (e.g., the Position of the sound source) as a set of cartesian coordinates indicating the Position of the origin (or sound source) within the coordinate system relative to the origin of the 3D scene referenced by the reportesceneoorigin id. Specifically, the encoded coordinate values bsX, bsY, and bsZ may be integers representing normalized values (e.g., values normalized with respect to maxDistance), where each of these cartesian coordinates may be a normalized integer that adds bits based on the assigned coordinates. In one aspect, for each of the encoded coordinates, decoder side 12 may determine a total number of bits for each integer that may include six bits plus any additional bits allocated by corddedbits. Thus, the spatial resolution of each of these values may fluctuate based on, for example, added bits and/or changes to maxDistance. More about resolution is described herein. On the other hand, each of the coordinate values may be an integer of the defined number of bits, such as a ten-bit integer.
To decode encoded cartesian coordinates within the bitstream into cartesian coordinates (e.g., as x, y, and z coordinates) that can be used by the audio renderer to spatially render the audio program (e.g., for an origin position or sound source position relative to the origin of the 3D scene in the coordinate system), the decoder scales the encoded coordinates using maxDistance. In one aspect, the decoder may normalize the value by a function V norm () These values are normalized for each of these integer values and the resulting scalar value is multiplied by maxDistance, such as:
x=maxDistance*V norm (bsX,6+coordAddedBits)
y=maxDistance*V norm (bsY,6+coordAddedBits)
z=maxDistance*V norm (bsZ,6+coordAddedBits)
in one aspect, V norm Can be defined by the following formula:
wherein V is BS Is a coded Cartesian coordinate (integer value) and N bits Is the number of bits constituting the encoded coordinate value, where V norm With a range of [ -1,1]. In one aspect, the left shift operation is performed by "<<"indication, wherein y=x<<b means that the value of x is shifted to the left by b bits. Alternatively, the value may be calculated using y=x (2^b).
Thus, the encoder side 11 may encode the position of the origin (sound source) in a stepwise manner, wherein the size of the steps may be the spatial resolution of the position, which may be based on the maxDistance and/or the total number of bits of the encoded position data. In one aspect, the resolution may be limited to quantization of the bit stream values of the Position data encoded within the Position () payload (e.g., bsX, bsY, and bsZ). For example, when maxDistance is 256 meters (e.g., bsDistanceMax is eight) and the total number of bits of bsX is eight (e.g., cordaddedbits is two), each value of bsX (e.g., 2 possible) 56 values) may correspond to at least two meter steps in the range of-256 meters to 256 meters. For example, when bsX is 128, x is two meters (in the positive direction) from its reference origin, and when bsX is 129, x is four meters from its reference origin. Here, the spatial resolution of the decoded x position of the decoding origin is two meters. However, the resolution may vary based on the maxDistance and/or the change in the total number of bits. Continuing with the previous example, if maxDistance is increased to 512 (bsDistanceMax is nine), while the total number of bits of bsX remains eight, then the spatial resolution of x is nowIs thatFour meters. The spatial resolution of the encoded position data is reduced by reducing the granularity at which positions can be spatially rendered. Accordingly, the decoder side 12 can determine the decoding position of the origin (sound source) at the spatial resolution based on the maximum distance parameter and the encoded position data.
In one aspect, the encoder side 11 may encode the position data according to a spatial resolution. Specifically, the encoder side 11 may align (e.g., round up or round down) the position of the origin (sound source) to the nearest step size according to the bit size and maxDistance to which the position is to be encoded. Continuing with the previous example, the encoder side 11 may identify the position of the sound source (e.g., position 39 a) as 1.5 meters in the x-direction from the reference origin (e.g., origin 38 a). In this case, the encoder side 11 may encode the position to two meters, since the position data may have a spatial resolution of at least two meters. On the other hand, the identified location 19 (e.g., by the identifier 14) may already correspond to a step size in which the encoder 15 will encode the location.
As described herein, maxDistance may be adjusted (e.g., in a future location payload) such that a maximum distance of the location data may be increased, but spatial resolution of the location data may also be degraded (e.g., reduced) (e.g., increased step size). Thus, the spatial resolution of a sound source may depend on the number of bits of the encoded value added to the bitstream by the encoder and the maxDistance. This is in contrast to conventional resolution schemes, in which the position is encoded for the largest possible distance of a particular scene, while the resolution cannot be changed. However, in the present disclosure, maxDistance may be adjusted by the encoder side 11, providing the audio codec system with the ability to adjust the resulting spatial resolution. On the other hand, the bit quantization may be different and/or maxDistance may be defined differently.
Returning to this syntax, in the case where the coordinate system is a spherical coordinate system, the Position () payload stores the Position data of the payload as a set of spherical coordinates that indicate the Position (e.g., of the origin or sound source) within the coordinate system relative to the origin at the origin of the 3D scene referenced by the relaticescenoriginal id. In particular, the spherical coordinates may be encoded values representing normalized spherical coordinates, such as an encoded azimuth value bsAzimuth, an encoded elevation value bsElevation, and an encoded radius bsRadius. In one aspect, each of the encoded values may be a normalized value (integer). The size of each of these values may be based on whether any coordinates bits are added. In this case, each of these values may be stored in the bitstream as an integer having different (or similar) bits based on whether the bits are added to the position data. For example, bsazemuth is an integer having a bit size of seven bits plus corddedbits, bseleration is an integer having a bit size of six bits plus corddedbits, and bsRadius is an integer having a bit size of five bits plus corddedbits. In one aspect, bsRadius may be normalized with respect to maxDistance. Alternatively, the encoded value may be an integer having a defined number of bits, such as bsazemuth being an eleven-bit integer, bsElevation being a ten-bit integer, and bsRadius being an eight-bit integer.
In one aspect, to decode the encoded spherical coordinates into values (e.g., azimuth, elevation, radius) that can be used by the audio renderer to spatially render the sound source, a decoder may apply a value normalization function to each of the encoded values. For example, the decoder may use V norm To determine the decoded azimuth and elevation. In particular, the decoder may determine these values as:
azimuth = 180 °. V norm (bsAzimuth,7+coordAddedBits)
Elevation = 90 ° xv norm (bsElevation,6+coordAddedBits)
However, in one aspect, to determine the radius, a decoder may useThe following magnitude normalization function M norm
Wherein M is norm With a range of [0,1 ]]. Specifically, the radius may be determined based on the following equation
Radius = maxDistance x M norm (bsRadius,5+coordAddedBits)
Similar to the values of the Cartesian coordinate system, the solution of these spherical coordinate values may be based on the number of bits encoding the integer and/or the maxDistance. In one aspect, the Position () payload includes a Position that is spherical coordinates with an encoding radius. In this case, the decoder side is configured by being based on the encoding radius and M norm The normalized radius is determined to decode the encoding radius and the decoding radius for use by the audio renderer is determined by scaling the result according to maxDistance.
Turning to table 5 of fig. 10c, returning to the Position () payload syntax, the decoder side 12 may determine whether the bitstream includes a flag (as a one-bit value within the bitstream) at a first value indicating whether the Position data includes (e.g., rotation data as) a rotation parameter indicating that a sound source (or origin) associated with the Position () payload is rotating (or will rotate) relative to (or has an orientation different from) its reference origin. In one aspect, the Position () payload associated with the global scene origin (e.g., referencesceneoorigin id=0) may not include a rotation parameter (e.g., because it does not reference another origin in the 3D scene). However, if the flag is a first value (e.g., quaternionspresent=1), the bitstream indicates that the payload includes rotation data. The decoder side 12 may determine whether delta encoding has been used to encode the rotation data. Also, in the case of the initial configuration if (fortnfig= 1), the bitstream may not include a one-bit flag, or the flag may be a first value, e.g., rotdeltacoding=0, so the bitstream includes four rotation quaternions bsQ0, bsQ1, bsQ2, and bsQ3, each of which is an integer representing a corresponding normalized value, with a total of eight bits plus an additional bit rotadddbits added for rotation data. On the other hand, the four normalized rotation quaternions may have a defined number of bits, such as an eleven-bit integer.
The decoder side 12 may be configured to decode (extract) from the bitstream an encoded rotation parameter (encoded quaternion) indicative of the orientation of the sound source. In one aspect, the encoded quaternion may be an integer of at least eight bits based on whether bits are added according to rotAddedButs. Decoder side 12 may generate the following quaternions q0, q1, q2, and q3, which may be used by the audio renderer to adjust the orientation of positions within the 3D scene using at least some of the extracted encoded quaternions. Specifically, decoder side 12 may generate a decoding quaternion as follows:
q0=V norm (bsQ0,8+rotAddedBits)
q1=V norm (bsQ1,8+rotAddedBits)
q2=V norm (bsQ2,8+rotAddedBits)
q3=V norm (bsQ3,8+rotAddedBits)
in one aspect, decoder side 12 may spatially render sound sources from four encoded (decoded) quaternions. In particular, decoder-side audio rendering may use decoding quaternions to rotate sound sources during spatial rendering. In particular, the audio renderer may spatially render the rotating sound source based on the orientation of the sound source relative to the listener (e.g., the orientation of the listener, which may be determined based on head tracking data, as described herein) indicated by the rotation parameters such that the listener may perceive the sound source from different angles (e.g., within the acoustic space). Thus, as described herein, the scene origin is encoded using the sceneooriginal () payload, where the Position () payload syntax includes: 1) The location (and/or rotation) of the origin, which is the location data, and 2) a reference to its origin (identifier referencesceneoorigin id).
As described above, the bitstream may include a scene tree structure (e.g., as a ScenePositions () payload) indicating an initial configuration of a 3D scene origin. On the other hand, the bitstream may include a sound source location (e.g., objectMetadata () payload or WallPhysics payload) along with an initial configuration of the origin. In one aspect, the location of the sound source within the object and the wall payload received with the ScenePositions payload may be the initial location of the sound source (e.g., at a beginning portion along the playback duration of the audio program). Thus, the decoder side may be configured to define an initial configuration of sound sources relative to a listener position for spatial rendering, which may be at a beginning portion of a rendered media program. In one aspect, the location of the origin and/or sound source may change during playback of the media program. For example, the position may change based on the media program (with respect to time) and/or the position may change with respect to the listener position based on the listener's movement within the 3D scene (e.g., within a 3D sub-scene or a 3D global scene).
Fig. 11 shows an example of the location of changes within one or more 3D scenes. For example, the figure shows a 3D global scene 30 with a scene tree structure 36 from fig. 3, wherein elements of the 3D scene have been moved (and/or changed in orientation) according to one aspect. In particular, the figure shows how a 3D sub-scene (origin) and/or sound source (location) moves as its associated visual structure or localization moves within the video content. Alternatively, the location of the origin and/or sound source may be moved in the audio program, as described herein. Specifically, as shown, the sound source position 39c of the speaking person 34 has moved away from its original position (e.g., a white circle shown with a dashed boundary) into the first sub-scene 35a and a new position relative to the sub-scene origin 38a (e.g., away from the bus). In addition to moving sound sources within one or more 3D scenes, the 3D sub-scenes may also be moved. For example, the second sub-scene 35b has moved away from the tour bus 31, which causes the sub-scene origin 38b to move from its original position to a new position relative to the 3D global scene 30. As described herein, the encoder side may provide location (or scene metadata) updates of elements within a 3D scene being encoded and streamed via a bitstream to a decoder device. The decoder side may then use the update to adjust spatial rendering, as described herein. In one aspect, these updates may include less data than the scene tree structure encoded for the initial configuration of the 3D scene, thereby enhancing the efficiency of data transmission by the encoder side (e.g., to accommodate low bit rate scenarios).
Fig. 12 is a system flow diagram of one aspect of a process 80 in which an encoder side 11 sends scene metadata updates that are used by a decoder side 12 to adjust the spatial rendering of a media program. The process 80 begins with the encoder side 11 receiving an audio signal of an audio program (which may be part of an a/V program, for example) for a 3D scene of the audio program (at block 81). For example, the audio signal may be associated with a sound source within a 3D scene. The encoder side 11 determines (at block 82) that there are 3D sub-scenes within the 3D scene. The encoder side 11 determines: 1) A position of the 3D sub-scene (e.g., its origin) within the 3D (e.g., global) scene, and 2) a position of a sound source of the audio signal within the 3D sub-scene (at block 83). The encoder side generates (at block 84) a first bitstream comprising a first set of metadata (e.g., as position data) by encoding the audio signal, wherein the metadata has the position of the 3D sub-scene and the position of the sound source. Specifically, the encoder side may encode the positions of the sub-scenes as ScenePositions () payloads (which may include Position () payloads) and the positions of the sound sources as Position () payloads, and then add these payloads to the bitstream.
The decoder side 12 receives the first bitstream and decodes (at block 85) the audio signal and the first set of metadata stored therein. The decoder side 12 determines the position of the listener (at block 86) and then spatially renders the 3D scene to generate a sound source at a position of the sound source relative to the position of the listener using the audio signal.
Sometime after the first bitstream is sent by the encoder side 11 to the decoder side 12, the encoder side 11 may determine that at least a portion of the position data has (or will) changed (at block 88). In particular, the encoder side may determine whether at least some of the position data (e.g., coordinates and/or rotation data) is to be updated.
In one aspect, an encoder may determine that data is to be updated based on changes or updates within a 3D scene. For example, when an origin of a 3D scene (e.g., a first origin of a first 3D sub-scene) moves relative to another origin (e.g., a 3D global scene origin of a 3D scene) to a different location (and/or with a different orientation) within another 3D scene (e.g., which may be a 3D global scene or another 3D sub-scene where the moving first 3D sub-scene is located), location data of the origin of the 3D scene may need to be updated. For example, identifier 14 may perform at least some of the operations described herein to determine that the sub-scene is moving within the 3D global scene, such as ship 33 moving relative to global scene origin 37, as shown in fig. 11. On the other hand, the encoder side 11 may send updated Position () payloads of at least some of the origins (e.g., of sound sources and/or 3D sub-scenes) of the 3D scene for each audio frame (or frames) of the audio content over a period of time. More about how the encoder side 11 determines to update the payload is described herein.
In this case, the encoder side 11 may generate (at block 89) a second bitstream comprising the encoded audio signal and a second set of metadata (as updated position data) with the changed position of the 3D sub-scene. In particular, the encoder side may encode a new position of the first origin with respect to the second origin into new metadata of the second bitstream. In one aspect, the new location may reference one or more origins (e.g., origins related to the sub-scene) using identifiers of the one or more origins, as described herein. On the other hand, the encoded audio signal may be part of an audio signal that follows the audio signal sent to the decoder side together with the first set of metadata over the playback duration. In one aspect, the second set of metadata may include a scene metadata update, which may include less data than the first set of metadata. In particular, the second set of metadata may include only the locations of the origin of the sound source and/or sub-scene that need to be updated. Other locations (e.g., which are or have been stationary) may be omitted from the second set of metadata. More content about scene metadata updates is described herein.
The decoder side 12 receives the second bitstream and decodes (at block 90) the audio signal and the second set of metadata. In one aspect, the second set of metadata may include a location update payload, as described herein, that includes a new location (and/or an orientation indicated by a rotation parameter, as described herein) of the first origin relative to the second origin. The decoder side adjusts the spatial rendering of the 3D scene such that the position of the sound source changes to correspond to the movement of the 3D sub-scene from the position of the 3D sub-scene to a different position of the 3D sub-scene within the 3D scene (at block 91). For example, the decoder determines that the position of the sound source has moved relative to the movement of the origin from its original position to its new position, as indicated by the second set of metadata. For example, referring to fig. 11, the decoder side determines that the sound source position 39e has moved relative to the movement of the sub-scene origin 38 b. In one aspect, from this movement, the audio renderer 23 of the decoder side 12 can determine a new pan and/or a new rotation of the sound source position relative to the position of the listener. Accordingly, the decoder side adjusts spatial rendering of the audio signal based on the movement of the position of the sound source.
In one aspect, at least one sound source may remain in its position relative to the listener (e.g., its original position as indicated by the sound source's position payload during initial configuration of the media program) even when one or more other sound source positions (and/or origins) are updated. For example, the position of one sound source may remain in its position relative to the listener when 1) the position of a different sound source (and/or origin of the 3D sub-scene) changes or 2) the position of the listener changes. While the sound source may remain in its position, the decoder side may still adjust the spatial rendering of the sound source, as described herein. For example, as the position of the listener changes (e.g., due to head tracking data received at the decoder side), the audio renderer 23 may adjust the spatial rendering of the (e.g., stationary) sound sources such that the listener continues to perceive the sound sources as being in the same position before the listener moves.
Turning to fig. 13, this figure shows table 6 including a syntax of a scenepositionupdate () payload that includes scene metadata updates describing updated (or changed) locations within the scene tree structure, as described herein. Thus, to update the origin position, the subsequent bitstream may include a scene updates payload, as opposed to having an entire scene tree structure (scene () payload), in order to reduce the amount of encoded data subsequently sent to the decoder side, while maintaining the exact position of the sub-scene within the 3D scene. In one aspect, encoder side 11 may encode these payloads, and decoder side 12 may decode (or extract) data from the payloads (for spatial rendering) based on at least some of the operations as described in process 80 of fig. 12.
The syntax is described as follows. The decoder side 12 may receive a scene update () payload from the encoder side and, for each of a plurality of scene origins encoded within the scene (), previously received by the decoder, the decoder determines whether the payload indicates that a scene update exists (e.g., as a single bit flag), as if (updatePresent). In particular, the metadata may include a single bit having a first value that indicates that the associated origin is not updated. However, if the single bit is a second value, it may indicate that at least one origin is to be updated. If so, the payload may include an updated 3D scene origin SceneOrigin (i+1, fortnfig=0) for that particular updated scene, where fortnfig may be false, indicating that the scene location is to be updated after the initial configuration. The updated sceneorig may include a new Position () payload that is updated by the encoder side (e.g., with updated Position data). For example, the new Position () payload may include at least one updated 1) bsdistance max, 2) bsX, bsY, bsZ or bsAzimuth, bsElevation, bsRadius, and 3) bsQ0, bsQ1, bsQ2, bsQ3. On the other hand, the new payload may include only one of these parameters, such as bsX only when the origin is moving only in the X-direction. In some aspects, when the resolution of the 3D scene has changed (e.g., by the content creator), the new payload may include different maximum distance parameters. Thus, for example, the encoder side encodes a location update of a first origin relative to a second origin into new metadata (e.g., a new Position () payload) associated with the scenepositionupdate () that references the second origin using an identifier of the second origin.
In one aspect, the encoder side 11 may encode updated Position () for a limited rate condition. Referring to fig. 10a-10c, the encoder side 11 may adjust the size of the Position () to accommodate the low bit rate of the data connection while maximizing the spatial resolution of the Position data. In particular, the identifiers within the Position () payload may be adjusted by the encoder, whereby the decoder side 12 may determine and apply those changes to adapt the Position data of one or more origins (sound sources) for spatial rendering.
Referring to fig. 10a, the decoder side 12 may determine whether the reference origin of the updated scene origin is to be changed to the global scene origin. The decoder may determine whether reforigin adaptation has been set by the encoder side 11 to a value indicating that the reference origin is to be adjusted (e.g., a value of one). The decoder side 12 may determine whether the one-bit identifier adaptatref orignind is equal to a first value (e.g., one) based on whether if ((forconfig+=0) & & (reforignidadaptation+=1)). The decoder side may determine whether the one bit flag preferencescenoriginal idisezero is equal to one. If this is the case, the decoder side may set the referencesceneorieigind to zero, which may be the global scene origin. However, if the referencescenedeigind is not zero, the referencescenedeigind of the payload may remain the same or may indicate a different origin identifier. In one aspect, such an identifier may reduce the number of bits of the payload by five bits. For example, where field Jing Yuandian is the global scene origin, the updated payload may not include a six-bit referencesceneoorigin id.
In one aspect, the encoder side 11 may change the reference scene origin based on a change in the updated scene origin of the updated payload. For example, when the field Jing Yuandian has moved within a threshold distance of the global scene origin, the encoder side 11 can change the reference origin of the scene origin (e.g., it can be the origin of a sub-scene) to the global scene origin because it is within a very close range. On the other hand, the encoder side 11 may change the reference scene origin based on the movement of the sound source within the 3D scene. For example, during playback, a sound source (e.g., a dog barking) may be moved from one sub-scene to another, such as from one cockpit on a boat to another cockpit, where the two cabins may be separate sub-scenes. In this case, the encoder side 11 may determine whether the sound source is within a threshold distance of another origin, and if so, may change the reference origin of the sound source to the origin of the sub-scene in which it is located.
In one aspect, the bsdistance max may remain constant for the entirety of the audio program content (e.g., during a playback session) streamed via the bitstream 20. For example, the encoder-side 11 may determine (estimate) the maximum distance parameter and may set it such that it may cover future changes in the origin and/or the position data of the sound source. This may be the case when the encoder side 11 has no a priori knowledge about the sound source movement, such as during live broadcasting of an audio program.
On the other hand, the encoder side 11 may adjust the bsdistance max based on the motion of the origin (or sound source). As described herein, the bsdistance max is related to the maximum distance that the location of the origin can be positioned relative to its reference origin, where the location can be used for spatial rendering. As described herein, the spatial resolution of a sound source of an origin may depend on the relationship between bsdistance max and the number of bits of position data allocated to the sound source. Thus, as a sound source moves within a sound scene, it may move or need to move beyond bsdistance max. For example, when the maximum distance is 256 meters (e.g., the value of bsdistance max is eight), the encoder side 11 may encode the sound source up to 256 meters. However, over time, the sound source may move more than 256 meters (e.g., the sound source is an automobile traveling on a highway). Thus, the encoder side 11 may determine the movement of the origin of the sound source and then define the bsdistance max to ensure that it is positioned within the range of the maximum distance (e.g., increase the maximum distance to 512 by making the value of bsdistance max nine). Thus, when the decoder side 12 receives new positions of the sound source including the increased maximum distance parameter and new encoded position data, the decoder side 12 may determine new decoded positions of the sound source at lower spatial resolutions based on the increased maximum distance parameter and the updated position payloads with the new encoded position data. As the sound sources are closer together, this results in a higher spatial resolution of the sound sources (e.g., decreasing the step size of the locations within the maximum distance) and works properly as the sound sources move apart and thereby decrease their spatial resolution.
Adjusting the bsdistance max based on the movement may also provide bit rate savings. For example, in the case of 100 moving sound sources, and assuming that the localization of each source is to be updated in each audio frame (e.g., every 20 ms), this would require the encoder side 11 to send 5000 Position () payloads per second. Each bit in the payload may be converted to a bit rate of 5 kbps. When the sound source moves over a larger distance, adjusting the bsdistance max reduces the size of the future payload, as the bsdistance max may be set to cover those distances. In particular, the payloads of future updates may not need to include bsDistanceMax as long as their encoding positions remain within this maximum distance.
In one aspect, the encoder may update other location data in addition to the maximum distance parameter. For example, when the origin position of the sound source exceeds the current maximum distance parameter, the encoder side 11 may update the maximum distance parameter, and may encode new position data representing normalized position data with respect to the updated maximum distance parameter. However, as described herein, since the number of bits describing the encoded position data may remain the same, the spatial resolution of the position may change relative to the updated maximum distance parameter.
Returning to this syntax, the decoder side 12 may determine whether the bsDistanceMax has been updated. In one aspect, the encoder-side 11 may adjust the encoding maximum distance parameter based on how the position of the updated scene origin changes relative to its reference scene origin. For example, the encoder may adjust the maximum distance when the updated scene origin moves beyond a threshold distance (e.g., the ship moves 10 kilometers away), or may adjust the parameter when the origin moves within a threshold distance (e.g., the bees fly around or in clusters adjacent to the scene origin). Specifically, the decoder determines whether the one-bit flag adaptdistance max is set to a value (e.g., one) based on if ((forConfig+=0) & & (distance maxadaptation+=1.) if this condition is satisfied, the decoder may determine a new four-bit distance parameter bsdistance max, which may be different from this parameter in the previous Position () payload.
As described above, the parameters of the Position () payload may be updated based on changes to the sound scene and/or changes to the bit rate of the bit stream 20. On the other hand, the bit stream may support delta encoding to reduce the desired bit rate of the bit stream 20. Referring to table 4 of fig. 10b, the decoder may determine that the one-bit flag corddelacoding has been set to a first value (e.g., one) indicating that the encoded value of the position data has been encoded using delta encoding. In one aspect, the decoder may make this determination for the payload after the initial configuration. In one aspect, the encoder may set the flag to a second value (e.g., zero) in the event that delta encoding is not necessary or does not provide sufficient spatial resolution. For example, where the bit rate of the bit stream may be high, the normalized value may be encoded without using delta encoding. On the other hand, delta coding may not be used when the change between the previous payload and the current payload is above a threshold.
Returning to table 5 of fig. 10c, when cordddeltacoding= 1 and the coordinate system is a cartesian coordinate system, the decoder may determine the encoded position data as encoded delta values bsDeltaX, bsDeltaY and bsDeltaZ, where each of these cartesian coordinates is an integer and the number of bits is two plus corddedbits. Each of these delta values may be the difference between the current (or new) encoded value (or value to be encoded) and the previously transmitted encoded value. For example, the encoder side 11 may determine that the position of the origin will change in the x-direction and determine a new bsX value bsX Current . In one aspect, if delta coding is not used, the coded value may additionally be sent in the updated payload. The previous encoded value of bsX may be bsX Previous . In this case, bsdeltax= bsX current –bsX previous . In one aspect, the coded delta value may be a signed binary value, where a zero of the most significant bit indicates that the integer is positive or a one indicates that the integer is negative. In one aspect, when bsX Current The value indicates that the increment when the coordinate value has moved closer to the reference origin from the previous payloadThe value may be negative.
To update the coordinates, the decoder may add the increment value to the previous coordinate value. The controller may determine the current encoded value by adding the increment value to the previous value. Continuing with the previous example, the decoder side 12 may be based on bsDeltaX and bsX Previous Is combined (added) to determine bsX Current . In one aspect, the decoder side may keep track of previously encoded values while the bitstream is being streamed to the decoder side. The decoder may be configured to determine the coordinates by using the normalization function described herein (e.g., for the x-coordinate, with bsX Current Replacement bsX, where the number of bits will be based on bsDeltaX and bsX Previous ) Is added to the sum of (3).
The decoder may perform similar operations when the coordinate system is a spherical coordinate system. In this case, the decoder may determine the encoded position data as the following delta values: bsDeltaAzimuth, which is an integer with a number of bits of three plus coreddedbits, bsDeltaElevation, which is an integer with a number of bits of two plus coreddedbits, and bsDeltaRadius, which is an integer with a number of bits of one plus coreddedbits. Similar to Cartesian delta coordinates, each of these deltas may be a difference between the current sphere value and the previous value. Also, each of the normalized delta values may be a signed binary value, where the most significant bit indicates the negative sign of the value and the remaining bits are the magnitudes. In one aspect, when cordaddedbits is zero, bsDeltaRadius is zero, as the latter will only include one bit representing its sign. Likewise, the decoder adds these delta values to the previous normalized values (by binary addition) and then uses the normalization function described herein to determine spherical coordinates.
When there are quaternions, the quaternion present= =1 and rotdeltacoding= =1, the decoder may also decode the increment of the rotation data. In this case, the payload includes four increments of rotation quaternions, such as bsDeltaQ0, bsDeltaQ1, bsDeltaQ2, and bsDeltaQ3, where each of the normalized increments is an integer number of bits of four plus rotaddgdedbits. Each increment may be a difference between the current rotation quaternion (determined by the encoder) and the previous rotation quaternion. Each integer may be a signed binary integer, where the most significant bit is a sign value and the remaining values indicate the magnitude of the integer. The decoder may add these delta values to the previous normalized values (used during the previous rotation of the scene origin of the Position () payload) and then may apply these values to the normalization function to determine the rotation parameters q0, q1, q2, and q3.
As described herein, the encoder side may encode a scenepositionupdate () payload including a scene origin update into the bitstream for sending to the decoder side. In one aspect, when the location of a sound source is to be updated, the encoder side may be configured to generate a new Position () payload for the sound source and provide it to the decoder side for updating. For example, in response to determining that the Position of the sound source is to be moved to a different Position within the 3D scene, the encoder may adjust at least some of the Position data within the Position () payload of the sound source relative to the origin (e.g., the origin referenced by the previous payload of the sound source) (e.g., by adding a new Position). Based on the new (updated) Position () received in the (e.g., updated) ObjectMetadata () payload, the decoder side adjusts the spatial rendering based on the new Position of the sound source.
Thus, the ScenePositions () payload provides the decoder with the location of the rest origin before rendering begins. However, if the sub-scene moves over time during the presentation of the media program, its position must be dynamically updated by the encoder side. The syntax of the present disclosure supports updating the origin of the movement by using the scenepositionupdate () payload without the need to send the location of the origin of rest again (or at least until the location of the origin of rest needs to be updated).
In one aspect, the encoder side may send the new Position () payload within the scenepositionupdate () and/or updated ObjectMetadata payload along with the subsequent portions of the media program being encoded and sent to the decoder side for spatial rendering. In some aspects, the sound source payload with the rest position is not encoded in a future subsequent portion of the bitstream. For example, the WallPhysics () payload of passive sound sources (e.g., it includes location data and acoustic parameters, as described herein) may be sent only once (e.g., with the initial configuration) because these sound sources do not move relative to their origin.
Fig. 14 illustrates a block diagram of audio processing system hardware (e.g., media content device 42, playback device 44, and/or output device 45) that may be used in one aspect with any of the aspects described herein. The audio processing system may represent a general-purpose computer system or a special-purpose computer system. It is noted that while fig. 14 illustrates various components of an audio processing system that may be incorporated into one or more of the devices described herein, this is merely one example of a particular implementation and is merely intended to illustrate the types of components that may be present in the system. Fig. 14 is not intended to represent any particular architecture or manner of interconnecting components, as such details are not germane to aspects described herein. It should also be appreciated that other types of audio processing systems having fewer components or more components than shown in fig. 14 may also be used. Thus, the processes described herein are not limited to use with the hardware and software of fig. 14.
As shown in fig. 14, an audio processing system (or system) 120 (e.g., a laptop computer, desktop computer, mobile phone, smart phone, tablet computer, smart speaker, head Mounted Display (HMD), headphone device (headset), or infotainment system for an automobile or other vehicle) includes one or more buses 128 for interconnecting the various components of the system. One or more processors 127 are coupled to bus 128 as is known in the art. The one or more processors may be a microprocessor or special purpose processor, a system on a chip (SOC), a central processing unit, a graphics processing unit, a processor created by an Application Specific Integrated Circuit (ASIC), or a combination thereof. Memory 126 may include Read Only Memory (ROM), volatile memory, and nonvolatile memory, or combinations thereof, coupled to the bus using techniques known in the art. A camera 121, a microphone 122, a speaker 123, and a display 124 may be coupled to the bus.
Memory 126 may be connected to the bus and may include DRAM, a hard drive, or flash memory, or a magnetic optical drive or magnetic memory, or an optical drive or other type of memory system that maintains data even after the system is powered down. In one aspect, the processor 127 retrieves computer program instructions stored in a machine-readable storage medium (memory) and executes the instructions to perform the operations described herein.
Although not shown, audio hardware may be coupled to one or more buses 128 for receiving audio signals to be processed and output by speakers 123. The audio hardware may include digital-to-analog converters and/or analog-to-digital converters. The audio hardware may also include audio amplifiers and filters. The audio hardware may also be connected to a microphone 122 (e.g., a microphone array) to receive audio signals (whether analog or digital), digitize them if necessary, and transmit the signals to the bus 128.
The network interface 125 may communicate with one or more remote devices and networks. For example, the interface may communicate via known technologies such as Wi-Fi, 3G, 4G, 5G, bluetooth, zigBee, or other equivalent technologies. The interface may include wired or wireless transmitters and receivers that may communicate (e.g., receive and transmit data) with a networking device such as a server (e.g., cloud) and/or other devices such as a remote speaker and remote microphone.
It should be appreciated that aspects disclosed herein may utilize memory that is remote from the system, such as a network storage device coupled to the audio processing system through a network interface, such as a modem or ethernet interface. Bus 128 may be connected to each other by various bridges, controllers, and/or adapters as is well known in the art. In one aspect, one or more network devices may be coupled to bus 128. The one or more network devices may be wired network devices (e.g., ethernet) or wireless network devices (e.g., WI-FI, bluetooth). In some aspects, the various aspects described may be performed by a networked server in communication with one or more devices.
Various aspects described herein may be at least partially embodied in software. That is, the techniques may be implemented in an audio processing system in response to its processor executing sequences of instructions contained in a storage medium, such as a non-transitory machine-readable storage medium (e.g., DRAM or flash memory). In various aspects, hard-wired circuitry may be used in combination with software instructions to implement the techniques described herein. Thus, these techniques are not limited to any specific combination of hardware circuitry and software nor to any particular source for the instructions executed by the audio processing system.
In this specification, certain terms are used to describe features of various aspects. For example, in some cases, the terms "analyzer," "identifier," "renderer," "estimator," "controller," "component," "unit," "module," "logic component," "generator," "optimizer," "processor," "mixer," "detector," "encoder," and "decoder" represent hardware and/or software configured to perform one or more processes or functions. For example, examples of "hardware" include, but are not limited to, integrated circuits such as processors (e.g., digital signal processors, microprocessors, application specific integrated circuits, microcontrollers, etc.). Thus, as will be appreciated by those skilled in the art, different combinations of hardware and/or software may be implemented to perform the processes or functions described by the above terms. Of course, the hardware may alternatively be implemented as a finite state machine or even as combinatorial logic elements. Examples of "software" include executable code in the form of an application, applet, routine or even a series of instructions. As described above, the software may be stored in any type of machine-readable medium.
Some portions of the preceding detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the audio processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the following claims, refer to the actions and processes of an audio processing system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the system's registers and memories into other data similarly represented as physical quantities within the system memories or registers or other such information storage, transmission or display devices.
The processes and blocks described herein are not limited to the specific examples described, and are not limited to the specific order used herein as examples. Rather, any of the processing blocks may be reordered, combined, or removed, performed in parallel, or serially, as desired, to achieve the results described above. The processing blocks associated with implementing the audio processing system may be executed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium to perform the functions of the system. All or part of the audio processing system may be implemented as dedicated logic circuits, e.g., FPGAs (field programmable gate arrays) and/or ASICs (application specific integrated circuits). All or part of the audio system may be implemented with electronic hardware circuitry comprising electronic devices such as, for example, at least one of a processor, memory, programmable logic device, or logic gate. Additionally, the processes may be implemented in any combination of hardware devices and software components.
According to one aspect, a method includes: receiving a first bitstream comprising an encoded version of an audio signal of a three-dimensional (3D) scene and a first set of metadata having 1) a position of a 3D sub-scene within the 3D scene and 2) a position of a sound source associated with the audio signal within the 3D sub-scene; determining a position of a listener within the 3D scene; spatially rendering the 3D scene to generate a sound source at a position of the sound source relative to a position of the listener using the audio signal; receiving a second bitstream comprising a second set of metadata having different positions of the 3D sub-scene within the 3D scene; and adjusting the spatial rendering of the 3D scene such that the position of the sound source changes to correspond to a movement of the 3D sub-scene from the position of the 3D sub-scene to a different position of the 3D sub-scene within the 3D scene.
According to one aspect, a method performed by a programmed processor of a first electronic device, the method comprising: receiving an audio signal of an audio program, wherein the audio signal is for a three-dimensional (3D) scene of the audio program; determining that a 3D sub-scene exists within the 3D scene; determining 1) a position of the 3D sub-scene within the 3D scene and 2) a position of a sound source of the audio signal within the 3D sub-scene; generating a first bitstream comprising a first set of metadata by encoding the audio signal, wherein the first set of metadata has a position of a 3D sub-scene and a position of a sound source; transmitting the first bitstream to a second electronic device; determining that the position of the 3D sub-scene has changed; generating a second bitstream comprising the encoded audio signal and a second set of metadata having a changed position of the 3D sub-scene; and transmitting the second bit stream to the second electronic device.
According to one aspect, a method includes: receiving a bit stream, the bit stream comprising: an encoded version of an audio signal associated with a sound source within a three-dimensional (3D) scene, comprising a scene tree structure of an origin of a first 3D scene relative to an origin of a second 3D scene, and a position of the sound source within the first 3D scene relative to the origin of the first 3D scene, wherein the position references the origin of the first 3D scene using an identifier, wherein the scene tree structure defines an initial configuration of the sound source relative to the first and second 3D scenes; determining a position of a listener relative to an origin of the first 3D scene; generating a set of spatially rendered audio signals by spatially rendering the audio signals according to the position of the sound source relative to the position of the listener; and driving one or more speakers using the one or more spatially rendered audio signals to produce a sound source.
According to one aspect, a method includes: receiving an audio program comprising an audio signal associated with a sound source within a first three-dimensional (3D) scene; encoding the audio signal into a bitstream; adding the following to metadata of the bitstream: 1) A scene tree structure comprising an origin of a first 3D scene relative to an origin of a second 3D scene of the audio program, and 2) a position of a sound source relative to the origin of the first 3D scene, the position referencing the origin of the first 3D scene with an identifier, wherein the metadata defines an initial configuration of the sound source relative to the first and second 3D scenes to be rendered by the audio playback device; and transmitting the bitstream to an audio playback device.
According to another aspect of the disclosure, there is included a decoder side method, the method: receiving a first bitstream comprising an encoded version of an audio signal of a three-dimensional (3D) scene and a first set of metadata having 1) a position of a 3D sub-scene within the 3D scene and 2) a position of a sound source associated with the audio signal within the 3D sub-scene; determining a position of a listener within the 3D scene; spatially rendering the 3D scene to generate a sound source at a position of the sound source relative to a position of the listener using the audio signal; receiving a second bitstream comprising a second set of metadata having different positions of the 3D sub-scene within the 3D scene; and adjusting the spatial rendering of the 3D scene such that the position of the sound source changes to correspond to a movement of the 3D sub-scene from the position of the 3D sub-scene to a different position of the 3D sub-scene within the 3D scene.
In one aspect, spatial rendering may include applying at least one spatial filter to the audio signal based on a position of a sound source relative to a listener to generate one or more spatially rendered audio signals, wherein the sound source is generated by driving at least one speaker of an electronic device using the one or more spatially rendered audio signals. In another aspect, the spatial filter is a head-related transfer function and the electronic device is a headset, and the one or more spatially rendered audio signals are a set of binaural audio signals for driving left and right speakers of the headset.
In another aspect, the method further comprises: determining that the listener has moved; determining a translation and a rotation of the listener based on the movement of the listener; determining a new position of the sound source based on an inverse of the panning with respect to the position of the listener and an inverse of the rotating; and adjusting the spatial rendering of the 3D scene based on the new position of the sound source relative to the position of the listener. In some aspects, the sound source is a first sound source and the audio signal is a first audio signal, wherein the first bitstream further comprises an encoded version of a second audio signal, wherein the first set of metadata further has a position of a second sound source associated with the second audio signal within the 3D scene such that the spatial rendering of the 3D scene further utilizes the second audio signal to generate the second sound source at a position of the second sound source relative to a position of the listener. In some aspects, the second sound source remains in its position relative to the listener when either 1) the position of the first sound source changes or 2) the position of the listener changes. In another aspect, the second bitstream further comprises encoded versions of the first and second audio signals. In one aspect, the 3D scene is a 3D scene of an audio program, wherein the first bitstream is a beginning portion of the audio program and the second bitstream is a subsequent portion of the audio program, wherein a future received bitstream comprising the subsequent portion of the audio program does not comprise a location of the second sound source as metadata.
According to another aspect of the invention, there is provided an electronic device comprising: at least one processor; and a memory having instructions stored therein, which when executed by the at least one processor, cause the electronic device to perform decoder-side operations to: receiving a first bitstream comprising an encoded version of an audio signal of a three-dimensional (3D) scene and metadata having 1) a position of a 3D sub-scene within the 3D scene and 2) a position of a sound source associated with the audio signal within the 3D sub-scene; determining a listener position within the 3D scene; spatially rendering the 3D scene to generate a sound source at a position of the sound source relative to a listener position using the audio signal; receiving a second bitstream comprising new metadata having a different position of the 3D sub-scene within the 3D scene; and adjusting the spatial rendering of the 3D scene such that the position of the sound source changes to correspond to a movement of the 3D sub-scene from the position of the 3D sub-scene to a different position of the 3D sub-scene within the 3D scene.
In one aspect, the electronic device further comprises a display, wherein the memory has further instructions to display video content that is audibly represented by the 3D scene on the display. In another aspect, the 3D sub-scene represents a structure or location within the video content, and the audio signal includes sound associated with the structure or location. In some aspects, the position of the 3D sub-scene corresponds to the position of the structure or location such that the 3D sub-scene moves as the structure or location moves within the video content. In another aspect, the video content is an extended reality (XR) environment, wherein the structure of the listener's position or location is within the XR environment. In one aspect, spatially rendering the 3D scene includes generating a set of binaural audio signals by applying a head-related transfer function to the audio signals based on the positions of the sound sources relative to the listener positions.
According to another aspect of the invention, there is included a non-transitory machine-readable medium having instructions stored therein, which when executed by at least one processor of an electronic device, cause the electronic device to perform decoder-side operations to: receiving a first bitstream comprising an encoded version of an audio signal of a three-dimensional (3D) scene and metadata having 1) a position of a 3D sub-scene within the 3D scene and 2) a position of a sound source associated with the audio signal within the 3D sub-scene; determining a listener position within the 3D scene; spatially rendering the 3D scene to generate a sound source at a position of the sound source relative to a listener position using the audio signal; receiving a second bitstream comprising new metadata having a different position of the 3D sub-scene within the 3D scene; and adjusting the spatial rendering of the 3D scene such that the position of the sound source changes to correspond to a movement of the 3D sub-scene from the position of the 3D sub-scene to a different position of the 3D sub-scene within the 3D scene.
In one aspect, the sound source is a first sound source, wherein the metadata further has 1) a location of a second sound source within the 3D scene and 2) a set of acoustic parameters associated with the second sound source, wherein the spatial rendering of the 3D scene includes sound of the audio signal emanating at the location of the second sound source based on the set of acoustic parameters. On the other hand, spatially rendering a 3D scene includes: determining an audio filter based on the set of acoustic parameters; generating a filtered audio signal by applying the audio filter to the audio signal; and generating one or more spatially rendered audio signals by applying one or more spatial filters to the audio signals and filtering the audio signals.
In another aspect, the non-transitory machine-readable medium includes further instructions to display a visual environment that the 3D scene audibly represents on a display, wherein the second sound source is a reflected or diffracted sound source that produces sound of the audio signal as reflected or diffracted away from objects within the visual environment. In one aspect, the set of acoustic parameters includes at least one of a diffusion level, a cut-off frequency, a frequency response, a geometry of the object, an acoustic surface parameter of the object, a reflectance value, an absorbance value, and a material of the object. In some aspects, spatially rendering the 3D scene includes generating a set of binaural audio signals by applying a head-related transfer function to the audio signals based on the positions of the sound sources relative to the listener positions.
According to another aspect of the present disclosure, there is included an encoder-side method comprising: receiving an audio signal of an audio program, wherein the audio signal is for a three-dimensional (3D) scene of the audio program; determining that a 3D sub-scene exists within the 3D scene; determining 1) a position of the 3D sub-scene within the 3D scene and 2) a position of a sound source of the audio signal within the 3D sub-scene; generating a first bitstream comprising a first set of metadata by encoding the audio signal, wherein the first set of metadata has a position of a 3D sub-scene and a position of a sound source; transmitting the first bitstream to a second electronic device; determining that the position of the 3D sub-scene has changed; generating a second bitstream comprising the encoded audio signal and a second set of metadata having a changed position of the 3D sub-scene; and transmitting the second bit stream to the second electronic device.
In one aspect, the sound source is a first sound source, wherein determining that a 3D sub-scene exists includes determining that a location of the first sound source has the same trajectory within the 3D scene as a location of a second sound source within the 3D scene. On the other hand, determining the location of the 3D sub-scene includes assigning a position within the 3D scene as an origin of the 3D sub-scene. In some aspects, a location of a first sound source for a first audio signal is determined relative to an origin of a 3D sub-scene. On the other hand, determining that the position of the 3D sub-scene has changed includes determining that the position of the origin within the 3D scene has moved relative to the origin of the 3D scene.
In one aspect, determining that a 3D sub-scene exists includes determining that a sound source moves in the same trajectory as the 3D sub-scene. In another aspect, the sound source is a first sound source, wherein the method further comprises determining 1) a position of the second sound source within the 3D scene and 2) a set of acoustic parameters associated with the second sound source, wherein the first set of metadata in the first bitstream further comprises the position of the second sound source and the set of acoustic parameters. In another aspect, the second bitstream is transmitted after the first bitstream, and the second set of metadata transmitted with the second bitstream does not include the location of the second sound source and the set of acoustic parameters. In another aspect, the set of acoustic parameters includes at least one of a diffusion level, a cut-off frequency, a frequency response, a geometry of the object, an acoustic surface parameter of the object, a reflectance value, an absorbance value, and a material of the object.
While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative of and not restrictive on the broad invention, and that this invention not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
To assist the patent office and any readers of any patent issued in this application in interpreting the appended claims, the applicant wishes to note that they do not intend any of the appended claims or claim elements to call 35u.s.c.112 (f) unless the word "means for" or "steps for" is used explicitly in a particular claim.
It is well known that the use of personally identifiable information should follow privacy policies and practices that are recognized as meeting or exceeding industry or agency requirements for maintaining user privacy. In particular, personally identifiable information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use, and the nature of authorized use should be specified to the user.
As previously described, one aspect of the present disclosure may be a non-transitory machine-readable medium (such as a microelectronic memory) having instructions stored thereon that program one or more data processing components (generally referred to herein as "processors") to perform encoding, decoding, and spatial rendering operations, network operations, and audio signal processing operations, as described herein. In other aspects, some of these operations may be performed by specific hardware components that contain hardwired logic. Alternatively, those operations may be performed by any combination of programmed data processing components and fixed hardwired circuitry components.
While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative of and not restrictive on the broad disclosure, and that this disclosure not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
In some aspects, the disclosure may include a language such as "[ element a ] and [ element B ]. The language may refer to one or more of these elements. For example, "at least one of a and B" may refer to "a", "B", or "a and B". In particular, "at least one of a and B" may refer to "at least one of a and B" or "at least either a or B". In some aspects, the disclosure may include languages such as "[ element a ], [ element B ], and/or [ element C ]". The language may refer to any one of or any combination of these elements. For example, "A, B and/or C" may refer to "a", "B", "C", "a and B", "a and C", "B and C" or "A, B and C".

Claims (20)

1. A method, comprising:
Receiving a bitstream, the bitstream comprising:
an encoded version of an audio signal associated with a sound source within a first three-dimensional 3D scene, a scene tree structure comprising an origin of the first 3D scene relative to an origin of a second 3D scene, and
a location of the sound source within the first 3D scene relative to the origin of the first 3D scene, wherein the location references the origin of the first 3D scene using an identifier, wherein the scene tree structure defines an initial configuration of the sound source relative to the first 3D scene and the second 3D scene;
determining a position of a listener relative to the origin of the first 3D scene;
generating a set of spatially rendered audio signals by spatially rendering the audio signals according to the position of the sound source relative to the position of the listener; and
the sound source is generated using the set of spatially rendered audio signals to drive a set of speakers.
2. The method of claim 1, wherein the identifier is a first identifier, wherein the origin of the first 3D scene comprises a location of the first identifier and the origin of the first 3D scene relative to the origin of the second 3D scene, wherein the location of the origin of the first 3D scene references the origin of the second 3D scene using a second identifier.
3. The method of claim 2, wherein the bitstream is a first bitstream, wherein the method further comprises:
receiving a second bitstream comprising a location update payload, the location update payload comprising a new location of the origin of the first 3D scene relative to the origin of the second 3D scene, the new location referencing the origin of the second 3D scene using the second identifier; and
determining that the position of the sound source has moved relative to movement of the origin of the first 3D scene from its original position to its new position;
the spatial rendering of the audio signal is adjusted based on the movement of the position of the sound source.
4. The method of claim 1, wherein the location of the sound source comprises a maximum distance parameter and encoded location data, wherein the method further comprises determining a decoded location of the sound source at a spatial resolution based on the maximum distance parameter and the encoded location data, wherein the decoded location of the sound source relative to the location of the listener is used to spatially render the audio signal.
5. The method of claim 4, wherein the 3D scene is part of an audio program being received through the bitstream, wherein the spatial resolution remains constant as the position of the sound source changes within the 3D scene during a playback session of the audio program.
6. The method of claim 4, wherein the maximum distance parameter is a first maximum distance parameter and the spatial resolution is a first spatial resolution, wherein the method further comprises:
receiving a new position of the sound source, the new position comprising a second maximum distance parameter and new encoded position data; and
a new decoding position of the sound source at a second spatial resolution is determined based on the second maximum distance parameter and the new encoding position data, wherein the second spatial resolution is different from the first spatial resolution.
7. The method of claim 1, wherein the bitstream is a first bitstream, and the method further comprises:
obtaining a second bitstream, the second bitstream comprising:
said encoded version of said audio signal, and
a new position of the sound source relative to the origin of the first 3D scene, the new position being different from the position of the sound source, the new position referencing the origin of the first 3D scene using the identifier; and
the spatial rendering of the audio signal is adjusted based on the new position.
8. An electronic device, comprising:
At least one processor; and
a memory having instructions stored therein that, when executed by the at least one processor, cause the electronic device to:
receiving a bitstream, the bitstream comprising:
an encoded version of an audio signal associated with a sound source within a first three-dimensional 3D scene,
a scene tree structure comprising an origin of a second 3D scene relative to an origin of the first 3D scene, and
a location of the sound source within the first 3D scene relative to the origin of the first 3D scene, wherein the location references the origin of the first 3D scene using an identifier, wherein the scene tree structure defines an initial configuration of the sound source relative to the first 3D scene and the second 3D scene;
determining a position of a listener relative to the origin of the first 3D scene;
generating a set of spatially rendered audio signals by spatially rendering the audio signals according to the position of the sound source relative to the position of the listener; and
the sound source is generated using the set of spatially rendered audio signals to drive a set of speakers.
9. The electronic device of claim 8, wherein the location of the sound source comprises encoded location data of the sound source, the encoded location data comprising at least one of encoded coordinate data relative to the origin of the first 3D scene and encoded rotation data indicative of an orientation of the sound source relative to the origin of the first 3D scene in a coordinate system.
10. The electronic device of claim 9, wherein the encoded location data comprises a maximum distance parameter and the encoded coordinate data comprises a set of encoded cartesian coordinates, wherein the instructions further comprise determining a set of cartesian coordinates of the location of the sound source within the coordinate system relative to the origin of the first 3D scene by scaling a normalized set of the encoded cartesian coordinates with the maximum distance parameter.
11. The electronic device of claim 10, wherein the memory has further instructions for:
determining a number of added bits of each of the encoded cartesian coordinates based on a four-bit identifier in the received bitstream;
determining a total number of bits including the number of added bits for each of the encoded cartesian coordinates, wherein the total number of bits includes at least six bits, wherein the normalized set of encoded cartesian coordinates is scaled according to the total number of bits.
12. The electronic device of claim 9, wherein the encoded location data comprises a set of encoded spherical coordinates including an encoded azimuth value, an encoded elevation value, and an encoded radius, wherein the memory has further instructions for determining a set of spherical coordinates of the location of the sound source relative to the origin of the first 3D scene within the coordinate system, the set of spherical coordinates including
Based on the azimuth value and the elevation value of the encoded azimuth value and the encoded elevation value respectively using a first normalization function,
and a radius based on the encoded radius using a second normalization function.
13. The electronic device of claim 12, wherein the encoded azimuth value is an integer of at least seven bits, the encoded elevation value is an integer of at least six bits, and the encoded radius value is an integer of at least five bits.
14. The electronic device of claim 9, wherein the memory has further instructions for:
determining whether the position of the sound source includes the rotation data based on a one-bit value; and
in response to determining that the location of the sound source includes the rotation data, extracting four encoded quaternions from the bitstream that are indicative of the orientation of the sound source, wherein each of the encoded quaternions is an integer of at least eight bits in size,
wherein the set of spatially rendered audio signals is spatially rendered based on the four encoded quaternions.
15. A non-transitory machine-readable medium having instructions that, when executed by at least one processor of an electronic device, cause the electronic device to:
Receiving a bitstream, the bitstream comprising:
audio content of a first three-dimensional 3D scene, and
encoding metadata comprising an origin of a second 3D scene relative to an origin of the first 3D scene and a position of a sound source within the first 3D scene relative to the origin of the first 3D scene, wherein the position references the origin of the first 3D scene using an identifier;
determining a listener position relative to the origin of the first 3D scene; and
the audio content is spatially rendered according to the position of the sound source relative to the listener position.
16. The non-transitory machine readable medium of claim 15, wherein the identifier is a first identifier, wherein the origin of the first 3D scene comprises a location of the first identifier and the origin of the first 3D scene relative to the origin of the second 3D scene, wherein the location of the origin of the first 3D scene references the origin of the second 3D scene using a second identifier.
17. The non-transitory machine readable medium of claim 16, wherein the encoding metadata comprises:
The second identifier being a one-bit integer indicating that the origin of the second 3D scene is a 3D global origin of a 3D global scene, and
the first identifier being a six-bit integer.
18. The non-transitory machine readable medium of claim 15, comprising further instructions to:
receiving new metadata through the bitstream, the new metadata including a location update of the location of the sound source; and
the spatially rendered audio content is adjusted according to the location update.
19. The non-transitory machine readable medium of claim 18, wherein the location update is encoded into the new metadata using delta encoding, wherein delta between a new location of the sound source and a previous location of the sound source is encoded into the new metadata, wherein the new metadata comprises less data than the metadata.
20. The non-transitory machine readable medium of claim 19, wherein the new metadata comprises a single bit having a value indicating that the location update has been encoded using the delta encoding.
CN202311226348.8A 2022-09-23 2023-09-22 Method and system for efficient encoding of scene locations Pending CN117768832A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202263376958P 2022-09-23 2022-09-23
US63/376,960 2022-09-23
US63/376,958 2022-09-23

Publications (1)

Publication Number Publication Date
CN117768832A true CN117768832A (en) 2024-03-26

Family

ID=90316903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311226348.8A Pending CN117768832A (en) 2022-09-23 2023-09-22 Method and system for efficient encoding of scene locations

Country Status (1)

Country Link
CN (1) CN117768832A (en)

Similar Documents

Publication Publication Date Title
CN110168638B (en) Audio head for virtual reality, augmented reality and mixed reality
CN111183658B (en) Rendering for computer-mediated reality systems
US11295754B2 (en) Audio bandwidth reduction
CN112673649B (en) Spatial audio enhancement
US20220383885A1 (en) Apparatus and method for audio encoding
CN112771479B (en) Six-degree-of-freedom and three-degree-of-freedom backward compatibility
US20240163609A1 (en) Audio Encoding with Compressed Ambience
US11558707B2 (en) Sound field adjustment
US20220262373A1 (en) Layered coding of audio with discrete objects
CN114402631A (en) Separating and rendering a voice signal and a surrounding environment signal
US20220386060A1 (en) Signalling of audio effect metadata in a bitstream
US20240114310A1 (en) Method and System For Efficiently Encoding Scene Positions
CN117768832A (en) Method and system for efficient encoding of scene locations
KR20240001226A (en) 3D audio signal coding method, device, and encoder
WO2020191164A1 (en) Rendering metadata to control user movement based audio rendering
US11967329B2 (en) Signaling for rendering tools
US20240105196A1 (en) Method and System for Encoding Loudness Metadata of Audio Components
US20240096335A1 (en) Object Audio Coding
CN117750293A (en) Object audio coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination