CN116917985A

CN116917985A - Apparatus and method for processing multi-channel audio signal

Info

Publication number: CN116917985A
Application number: CN202280011393.9A
Authority: CN
Inventors: 李泰美; 高祥铁; 金敬来; 金善民; 金正奎; 南佑铉; 孙允宰; 郑铉权; 黄盛熙
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-01-25
Filing date: 2022-01-25
Publication date: 2023-10-20

Abstract

According to various embodiments of the present disclosure, an audio processing apparatus includes at least one processor configured to execute one or more instructions to obtain a downmixed second audio signal from at least one first audio signal, obtain information related to error cancellation of the at least one first audio signal, unmixed the at least one first audio signal from the downmixed second audio signal, and reconstruct the at least one first audio signal by applying the information related to error cancellation of the at least one first audio signal to the at least one first audio signal unmixed from the second audio signal. Information related to error cancellation is generated using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding.

Description

Apparatus and method for processing multi-channel audio signal

Technical Field

The present disclosure relates to the field of processing multichannel audio signals. In particular, the present disclosure relates to the field of processing audio signals of a three-dimensional (3D) audio channel layout from a multi-channel audio signal in front of a listener.

Background

The audio signals are typically two-dimensional (2D) audio signals, such as a 2-channel audio signal, a 5.1-channel audio signal, a 7.1-channel audio signal, and a 9.1-channel audio signal.

However, due to the uncertainty of the audio information in the high direction (height direction), the 2D audio signal may need to generate a three-dimensional (3D) audio signal (e.g., an n-channel audio signal or a multi-channel audio signal, where n is an integer greater than 2) to provide a spatial 3D effect of sound.

In a conventional channel layout for 3D audio signals, channels are arranged omnidirectionally around a listener. However, there is an increasing demand for viewers who want to experience immersive sound, such as cinema content in a home environment, according to the expansion of set top box (OTT) services, the increase in Television (TV) resolution, and the expansion of electronic device screens such as tablet computers. Therefore, it is necessary to process an audio signal of a 3D audio channel layout (e.g., a 3D audio channel layout in front of a listener) in which channels are arranged in front of the listener in consideration of the sound image representation of an object (e.g., a sound source) on a screen.

Furthermore, in the case of the conventional 3D audio signal processing system, an independent audio signal of each independent channel of the 3D audio signal has been encoded/decoded, and in particular, in order to restore a two-dimensional (2D) audio signal, such as a conventional stereo audio signal, after reconstructing the 3D audio signal, it is necessary to down-mix the reconstructed 3D audio signal.

Disclosure of Invention

Technical problem

One or more embodiments of the present disclosure provide for processing of multi-channel audio signals for supporting a three-dimensional (3D) audio channel layout in front of a listener.

Technical proposal

To overcome the technical problem, various embodiments of the present disclosure provide an audio processing method including generating a second audio signal by downmixing at least one first audio signal.

The audio processing method further includes generating first information related to error cancellation (error removal) of the at least one first audio signal using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding.

The audio processing method further comprises transmitting first information related to error cancellation of the at least one first audio signal and a downmixed second audio signal.

In some embodiments, the first information related to error cancellation of the at least one first audio signal may comprise second information about factors for error cancellation. In such embodiments, generating the first information related to error cancellation of the at least one first audio signal may include generating the second information about a factor for error cancellation when an original signal power of the at least one first audio signal is less than or equal to a first value. In such an embodiment, the second information may indicate that the value of the factor for error cancellation is 0. In other embodiments, the first information related to error cancellation of the at least one first audio signal may comprise second information about factors for error cancellation. In such embodiments, generating the first information related to error cancellation of the at least one first audio signal may include generating the second information about a factor for error cancellation based on the original signal power of the at least one first audio signal and the second signal power of the at least one decoded first audio signal when a first ratio of the original signal power of the at least one first audio signal to the original signal power of the second audio signal is less than a second value. In other embodiments, generating the second information about the factor for error cancellation may include generating the second information about the factor for error cancellation. In such an embodiment, the second information may indicate that the value of the factor for error cancellation is a second ratio of the original signal power of the at least one first audio signal to the second signal power of the decoded at least one first audio signal.

In other embodiments, generating the second information about the factor for error cancellation may include,

when a second ratio of the original signal power of the at least one first audio signal to the second signal power of the decoded at least one first audio signal is greater than 1, second information about a factor for error cancellation is generated. In such an embodiment, the second information may indicate that the value of the factor of error cancellation is 1. In other embodiments, the first information related to error cancellation of the at least one first audio signal may comprise second information about factors for error cancellation. In such an embodiment, generating the first information related to error cancellation of the at least one first audio signal may comprise generating the second information about a factor for error cancellation when a ratio of an original signal power of the at least one first audio signal to an original signal power of the second audio signal is greater than or equal to a second value. In such an embodiment, the second information may indicate that the value of the factor for error cancellation is 1.

In other embodiments, generating the second information about the factor for error cancellation may include generating first information related to error cancellation of at least one first audio signal for each frame of the second audio signal.

In other embodiments, the downmixed second audio signal may include a third audio signal of the base channel group and a fourth audio signal of the slave channel group. In such an embodiment, the fourth audio signal of the slave channel group may comprise a fifth audio signal of the first slave channel comprising a sixth audio signal of an independent channel comprised in the first 3D audio channel in front of the listener. In such an embodiment, by mixing the fifth audio signal of the first slave channel, the seventh audio signal of the second 3D audio channel on the side and rear of the listener can be obtained.

In other embodiments, the third audio signal of the basic channel set may include an eighth audio signal of the second channel and a ninth audio signal of the third channel. In such an embodiment, the eighth audio signal of the second channel may have been generated by mixing the tenth audio signal of the left stereo channel with the decoded audio signal of the center channel in front of the listener. In such an embodiment, the audio ninth signal of the third channel may have been generated by mixing the eleventh audio signal of the right stereo channel with the decoded audio signal of the center channel in front of the listener.

In other embodiments, the downmixed second audio signal may include a third audio signal of the base channel group and a fourth audio signal of the slave channel group. In such an embodiment, transmitting the first information related to the error cancellation of the at least one first audio signal and the downmixed second audio signal may comprise generating a bitstream comprising the first information related to the error cancellation of the at least one first audio signal and the second information related to the downmixed second audio signal. Transmitting the first information related to error cancellation of the at least one first audio signal and the downmixed second audio signal may further comprise transmitting a bitstream.

In such an embodiment, the bitstream may comprise a file stream of a plurality of audio tracks. In such an embodiment, the generation of the bitstream may comprise generating a first audio stream comprising a first audio track of the compressed third audio signal of the basic set of channels. The generating of the bitstream may further include generating a second audio stream including a second audio track of the slave channel audio signal identification information, the second audio track being adjacent to the first audio track. The generating of the bitstream may further include generating, when a fourth audio signal of a slave channel group corresponding to the third audio signal of the basic channel group exists, slave channel audio signal identification information indicating the existence of the fourth audio signal of the slave channel group.

In other embodiments, the second audio stream of the second audio track may comprise compressed fourth audio signals of the slave channel group when the slave channel audio signal identification information indicates that fourth audio signals of the slave channel group are present.

In other embodiments, when the slave channel audio signal identification information indicates that the fourth audio signal of the slave channel group does not exist, the second audio stream of the second audio track may include the fifth audio signal of the next track of the basic channel group.

In other embodiments, the downmixed second audio signal may include a third audio signal of the base channel group and a fourth audio signal of the slave channel group. In such an embodiment, the base isThe third audio signal of the channel group may comprise a fifth audio signal of the stereo channel. In such an embodiment, transmitting the first information related to the error cancellation of the at least one first audio signal and the downmixed second audio signal may comprise generating a bitstream comprising the first information related to the error cancellation of the at least one first audio signal and the second information of the downmixed second audio signal and transmitting the bitstream. In such an embodiment, the generation of the bitstream may comprise generating a base channel audio stream comprising the compressed fifth audio signal of the stereo channel. The generating may further include generating a plurality of slave channel audio streams comprising a plurality of audio signals of a plurality of slave channel groups. The plurality of slave channel audio streams may include a first slave channel audio stream and a second slave channel audio stream. In such an embodiment, when for a first multi-channel audio signal for generating a base channel audio stream and a first slave channel audio stream, the first number of surround channels is S _n-1 The second number of subwoofer channels is W _n-1 The third number of overhead channels is H _n-1 And for a second multi-channel audio signal for generating a first and a second slave channel audio stream, the fourth number of surround channels is S _n The fifth number of subwoofer channels is W _n The sixth number of overhead channels is H _n ，S _n-1 Can be less than or equal to S _n ，W _n-1 May be less than or equal to W _n ，H _n-1 May be less than or equal to H _n But S is _n-1 、W _n-1 And H _n-1 All of which may be respectively not equal to S _n 、W _n And H _n 。

In other embodiments, the audio processing method may further include generating an audio object signal of the 3D audio channel in front of the listener, the audio object signal indicating at least one of an audio signal, a position or a direction of the audio object. In such embodiments, transmitting the first information related to error cancellation of the at least one first audio signal and the downmixed second audio signal may include generating a bitstream including the first information related to error cancellation of the at least one first audio signal, the audio object signals of the 3D audio channels in front of the listener, and the second information about the downmixed second audio signal.

Transmitting the first information related to error cancellation of the at least one first audio signal and the downmixed second audio signal may further comprise transmitting a bitstream.

To overcome this technical problem, various embodiments of the present disclosure provide an audio processing method that includes obtaining a second audio signal downmixed from at least one first audio signal from a bitstream. The audio processing method further comprises obtaining first information related to error cancellation of the at least one first audio signal from the bitstream. The audio processing method further comprises unmixing at least one first audio signal from the downmixed second audio signal. The audio processing method further comprises reconstructing the at least one first audio signal by mixing first information related to error cancellation of the at least one first audio signal to the unmixed at least one first audio signal. First information relating to error cancellation of the at least one first audio signal is generated using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding. In some embodiments, the first information related to error cancellation of the at least one first audio signal may comprise second information about factors for error cancellation. In such embodiments, the factor of error cancellation may be greater than or equal to 0, and may be less than or equal to 1.

In other embodiments, reconstructing the at least one first audio signal may comprise reconstructing the at least one first audio signal to have a third signal power equal to a product of a fourth signal power of the at least one first audio signal and a factor for error cancellation.

In other embodiments, the bitstream may include second information about a third audio signal of the basic channel set and third information about a fourth audio signal of the dependent channel set. In such an embodiment, the third audio signal of the basic channel set may have been obtained by decoding the second information about the third audio signal of the basic channel set included in the bitstream without being unmixed with the further audio signal of the further channel set. The audio processing method may further include reconstructing a fifth audio signal of an upmix channel group including at least one upmix channel by unmixing with the third audio signal of the base channel group using the fourth audio signal of the slave channel group.

In other embodiments, the fourth audio signal of the slave channel group may comprise the first slave channel audio signal and the second slave channel audio signal. In such embodiments, the first slave channel audio signal may comprise a sixth audio signal of an independent channel in front of the listener and the second slave channel audio signal may comprise a mixed audio signal of audio signals of channels on the side and behind the listener.

In other embodiments, the third audio signal of the basic channel set may include a sixth audio signal of the first channel and a seventh audio signal of the second channel. In such an embodiment, the sixth audio signal of the first channel may have been generated by mixing the eighth audio signal of the left stereo channel with the decoded audio signal of the center channel and the seventh audio signal of the second channel in front of the listener

May have been generated by mixing the ninth audio signal of the right stereo channel and the compressed and decompressed audio signal of the center channel in front of the listener.

In other embodiments, the basic channel group may include a single channel or a stereo channel, and the at least one upmix channel may be a discrete audio channel that is at least one channel other than a channel of the basic channel group among 3D audio channels in front of the listener or 3D audio channels located in all directions around the listener.

In other embodiments, the 3D audio channel in front of the listener may be a 3.1.2 channel. The 3.1.2 channels may include three surround channels in front of the listener, one subwoofer channel in front of the listener, and two overhead channels. The 3D audio channels located in the omni-direction around the listener may include at least one of 5.1.2 channels or 7.1.4 channels. The channels 5.1.2 may include three surround channels in front of the listener, two surround channels laterally and rearwardly of the listener, one subwoofer channel in front of the listener, and two overhead channels in front of the listener. The 7.1.4 channels may include three surround channels in front of the listener, four surround channels on the side and back of the listener, one subwoofer channel in front of the listener, two overhead channels in front of the listener, and two overhead channels on the side and back of the listener.

In other embodiments, the unmixed first audio signal may include a sixth audio signal of the at least one upmix channel and a seventh audio signal of the independent channel. In such an embodiment, the seventh audio signal of the independent channel may comprise a first portion of the third audio signal of the basic channel set and a second portion of the fourth audio signal of the dependent channel set.

In other embodiments, the bitstream may include a file stream of a plurality of audio tracks including a first audio track and a second audio track adjacent to each other. In such an embodiment, the third audio signal of the basic channel group may have been obtained from the first audio track and the slave channel audio signal identification information may have been obtained from the second audio track.

In other embodiments, the fourth audio signal of the slave channel group may have been obtained from the second audio track when the obtained slave channel audio signal identification information indicates that the slave channel audio signal is present in the second audio track.

In other embodiments, when the obtained slave channel audio signal identification information indicates that there is no slave channel audio signal in the second audio track, a fourth audio signal of a next track of the basic channel group may have been obtained from the second audio track.

In other embodiments, the bitstream may include a base channel audio stream and a plurality of slave channel streams. The plurality of slave channel audio streams may include a first slave channel audio stream and a second slave channel audio stream. The base channel audio stream may comprise audio signals of stereo channels. In such an embodiment, when reconstructing for the pass-through base channel audio stream and the first slave channel audio streamA first number of surround channels is S _n-1 The second number of subwoofer channels is W _n-1 The third number of overhead channels is H _n-1 And for a multi-channel second audio signal reconstructed by the first and second slave channel audio streams, the fourth number of surround channels of the multi-channel audio signal is S _n The fifth number of subwoofer channels is W _n The sixth number of overhead channels is H _n ，S _n-1 Can be less than or equal to S _n ，W _n-1 May be less than or equal to W _n ，H _n-1 May be less than or equal to H _n But S is _n-1 、W _n-1 And H _n-1 All of which may be respectively not equal to S _n 、W _n And H _n 。

In other embodiments, the audio processing method may further comprise obtaining an audio object signal of a 3D audio channel in front of the listener from the bitstream, the audio object signal indicating at least one of the audio signal, a position or a direction of the audio object. The audio signal of the 3D audio channel in front of the listener may have been reconstructed based on a sixth audio signal of the 3D audio channel in front of the listener, the sixth audio signal being generated from the third audio signal of the basic channel group and the fourth audio signal of the dependent channel group, and the audio object signal of the 3D audio channel in front of the listener.

In other embodiments, the audio processing method may further include obtaining multi-channel audio-related additional information from the bitstream, wherein the multi-channel audio-related additional information may include at least one of: information on the total number of audio streams including the basic channel audio stream and the slave channel audio stream, downmix gain information, channel map information, volume information, low Frequency Effect (LFE) gain information Dynamic Range Control (DRC) information, second information of channel layout rendering information, third information about the number of coupled audio streams, fourth information indicating a multi-channel layout fifth information about whether a dialog and dialog level is present in the audio signal, sixth information indicating whether LFE is output, seventh information about whether an audio object is present on a screen, eighth information about whether a continuous channel audio signal is present or a discrete channel audio signal is present, or downmix information including at least one downmix parameter for generating a downmix matrix of the multi-channel audio signal.

To overcome this technical problem, various embodiments of the present disclosure provide an audio processing apparatus comprising a memory storing one or more instructions and at least one processor communicatively coupled to the memory and configured to execute the one or more instructions to obtain a second audio signal from a bitstream that is downmixed from at least one first audio signal.

The at least one processor may be further configured to obtain information related to error cancellation of the at least one first audio signal from the bitstream. The at least one processor may be further configured to unmixe at least one first audio signal from the downmix second audio signal. The at least one processor may be further configured to reconstruct the at least one first audio signal by applying information related to error cancellation of the at least one first audio signal to the at least one first audio signal unmixed from the second audio signal. Information related to error cancellation of the at least one first audio signal may have been generated using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding.

To overcome this technical problem, various embodiments of the present disclosure provide an audio processing method that includes generating a second audio signal by downmixing at least one first audio signal. The audio processing method further comprises generating information related to error cancellation of the at least one first audio signal using at least one of an original signal power of the second audio signal or a second signal power of the decoded at least one first audio signal. The audio processing method further includes, for the information related to error cancellation, generating an audio signal of a Low Frequency Effect (LFE) channel using a neural network for generating the audio signal of the LFE channel. The audio processing method further includes transmitting the downmixed second audio signal and the audio signal of the LFE channel.

To overcome this technical problem, various embodiments of the present disclosure provide an audio processing method that includes obtaining a second audio signal downmixed from at least one first audio signal from a bitstream. The audio processing method further comprises obtaining an audio signal of the LFE channel from the bitstream. The audio processing method further comprises obtaining information related to error cancellation of the at least one first audio signal using a neural network for obtaining additional information for the obtained audio signal of the LFE channel. The audio processing method further comprises reconstructing at least one first audio signal by applying information related to error cancellation to the at least one first audio signal upmixed from the second audio signal. Information related to error cancellation may have been generated using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding.

To overcome the technical problem, various embodiments of the present disclosure provide a computer-readable storage medium storing instructions that, when executed by at least one processor of an audio processing apparatus, cause the audio processing apparatus to perform an audio processing method.

Advantageous effects

With the method and apparatus for processing a multi-channel audio signal according to various embodiments of the present disclosure, an audio signal of a 3D audio channel layout in front of a listener may be encoded and an audio signal of a 3D audio channel layout in all directions around the listener may be encoded while supporting backward compatibility with a conventional stereo (e.g., 2-channel) audio signal.

With the method and apparatus for processing a multi-channel audio signal according to various embodiments of the present disclosure, an audio signal of a 3D audio channel layout in front of a listener can be decoded and an audio signal of a 3D audio channel layout in all directions around the listener can be decoded while supporting backward compatibility with a conventional stereo (e.g., 2-channel) audio signal.

However, effects achieved by the apparatus and method for processing a multi-channel audio signal according to various embodiments of the present disclosure are not limited to those described above, and other effects not mentioned will be clearly understood from the following description by those of ordinary skill in the art to which the present disclosure pertains.

Drawings

Fig. 1a is a diagram for describing a scalable channel layout structure according to various embodiments of the present disclosure.

Fig. 1b is a view for describing an example of a detailed scalable audio channel layout structure.

Fig. 2a is a block diagram of a structure of an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 2b is a block diagram of a structure of an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 2c is a block diagram of a structure of a multi-channel audio signal processor according to various embodiments of the present disclosure.

Fig. 2d is a view for describing an example of detailed operation of an audio signal classifier according to various embodiments of the present disclosure.

Fig. 3a is a block diagram of a structure of a multi-channel audio decoder according to various embodiments of the present disclosure.

Fig. 3b is a block diagram of a structure of a multi-channel audio decoder according to various embodiments of the present disclosure.

Fig. 3c is a block diagram of a structure of a multi-channel audio signal reconstructor in accordance with various embodiments of the present disclosure.

Fig. 3d is a block diagram of the structure of an upmix channel group audio generator in accordance with various embodiments of the present disclosure.

Fig. 4a is a block diagram of an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 4b is a block diagram of a reconstructor in accordance with various embodiments of the present disclosure.

Fig. 5a is a block diagram of a structure of an audio decoding apparatus according to various embodiments of the present disclosure.

Fig. 5b is a block diagram of the structure of a multi-channel audio signal reconstructor in accordance with various embodiments of the present disclosure.

Fig. 6 is a diagram illustrating a file structure according to various embodiments of the present disclosure.

Fig. 7a is a view for describing a detailed structure of a file according to various embodiments of the present disclosure.

Fig. 7b is a flowchart of a method of reproducing an audio signal by an audio decoding apparatus according to the file structure of fig. 7 a.

Fig. 7c is a view for describing a detailed structure of a file according to various embodiments of the present disclosure.

Fig. 7d is a flowchart of a method of reproducing an audio signal by an audio decoding apparatus according to the file structure of fig. 7 c.

Fig. 8a is a diagram for describing a file structure according to various embodiments of the present disclosure.

Fig. 8b is a flowchart of a method of reproducing an audio signal by an audio decoding apparatus according to the file structure of fig. 8 a.

Fig. 9a is a view for describing an audio track package according to the file structure of fig. 7 a.

Fig. 9b is a view for describing an audio track packet according to the file structure of fig. 7 c.

Fig. 9c is a view for describing an audio track package according to the file structure of fig. 8 a.

Fig. 10 is a view for describing additional information of a metadata header/metadata audio packet according to various embodiments of the present disclosure.

Fig. 11 is a view for describing an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 12 is a diagram for describing a metadata generator according to various embodiments of the present disclosure.

Fig. 13 is a view for describing an audio decoding apparatus according to various embodiments of the present disclosure.

Fig. 14 is a view for describing a 3.1.2 channel audio rendering unit, a 5.1.2 channel audio rendering unit, and a 7.1.4 channel audio rendering unit according to various embodiments of the present disclosure.

Fig. 15a is a flow chart for describing a process of determining a factor for error cancellation by an audio encoding device according to various embodiments of the present disclosure.

Fig. 15b is a flowchart for describing a process of determining a scale factor of an Ls5 signal by an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 15c is a flowchart for describing a process of generating an ls5_3 signal based on a factor of error cancellation of an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 16a is a diagram for describing a configuration of a bitstream for channel layout extension according to various embodiments of the present disclosure.

Fig. 16b is a diagram for describing a configuration of a bitstream for channel layout extension according to various embodiments of the present disclosure.

Fig. 16c is a diagram for describing a configuration of a bitstream for channel layout extension according to various embodiments of the present disclosure.

Fig. 17 is a diagram for describing a surround sound audio signal added to an audio signal for a 3.1.2 channel layout for channel layout expansion according to various embodiments of the present disclosure.

Fig. 18 is a view for describing a process of generating an object audio signal on a screen by an audio decoding apparatus based on an audio signal and sound source object information of a 3.1.2 channel layout according to various embodiments of the present disclosure.

Fig. 19 is a view for describing a transmission order and a rule of an audio stream in each channel group of an audio encoding apparatus according to various embodiments of the present disclosure.

Fig. 20a is a flow chart of a first audio processing method according to various embodiments of the present disclosure.

Fig. 20b is a flow chart of a second audio processing method according to various embodiments of the present disclosure.

Fig. 20c is a flow chart of a third audio processing method according to various embodiments of the present disclosure.

Fig. 20d is a flowchart of a fourth audio processing method according to various embodiments of the present disclosure.

Fig. 21 is a diagram for describing a process in which an audio encoding apparatus transmits metadata through a Low Frequency Effect (LFE) signal using a first neural network and obtains the metadata from the LFE signal using a second neural network according to various embodiments of the present disclosure.

Fig. 22a is a flow chart of a fifth audio processing method according to various embodiments of the present disclosure.

Fig. 22b is a flowchart of a sixth audio processing method according to various embodiments of the present disclosure.

Fig. 23 illustrates a mechanism for gradual downmixing of surround channels and overhead channels according to various embodiments of the present disclosure.

Detailed Description

Throughout the disclosure, the expression "at least one of a, b or c" and "at least one of a, b and c" means all or variants thereof of a only, b only, c only, both a and b, both a and c, both b and c.

The disclosure is capable of various modifications and various embodiments, and therefore specific embodiments of the disclosure are shown in the drawings and will be described in detail herein. It should be understood, however, that there is no intent to limit the disclosure to the particular embodiments of the disclosure, but rather, it should be understood to include all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.

In describing the embodiments of the present disclosure, when it is determined that the detailed description of the related art unnecessarily obscures the subject matter, the detailed description thereof will be omitted. Moreover, the amounts (e.g., first, second, etc.) are merely identification symbols that distinguish one component from another.

Furthermore, in this document, when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or directly coupled to the other element, but it should be understood that the element can be connected or coupled to the other element via the other element therebetween, unless otherwise indicated.

Further, for a component denoted by "unit", "module", or the like, two or more components may be integrated into one component, or one component may be divided into two or more for each detailed function. In addition to the primary functions of the components, each component to be described below may additionally perform functions responsible for some or all of the functions of other components, and some of the primary functions of the components may be dedicated to and performed by other components.

Here, the "Deep Neural Network (DNN)" may be a representative example of an artificial neural network model that simulates brain nerves, and is not limited to an artificial neural network model using a specific algorithm.

Here, the "parameter" may be a value used during the operation of each layer constituting the neural network, and may include, for example, a weight (and deviation) used when an input value is applied to a predetermined calculation formula. The parameters may be represented in the form of a matrix. The parameter may be a value set as a result of training, and may be updated by separate training data as needed.

Here, a "multi-channel audio signal" may refer to an audio signal of n channels (where n is an integer greater than 2). The "single channel audio signal" may be a one-dimensional (1D) audio signal, the "stereo channel audio signal" may be a two-dimensional (2D) audio signal, and the "multi-channel audio signal" may be a three-dimensional (3D) audio signal.

Here, the "channel (or speaker) layout" may represent a combination of at least one channel, and may specify a spatial arrangement of channels (or speakers). A channel used herein is a channel through which an audio signal is actually output, and thus may be referred to as a presentation channel.

For example, the channel layout may be an "x.y.z channel layout". Here, X may be the number of surround channels, Y may be the number of subwoofer channels, and Z may be the number of overhead channels. The channel layout may specify the spatial locations of the surround channel/subwoofer channel/treble channel.

Examples of the "channel (or speaker) layout" may include a 1.0.0 channel (or single channel) layout, a 2.0.0 channel (or stereo channel) layout, a 5.1.0 channel layout, a 5.1.2 channel layout, a 5.1.4 channel layout, a 7.1.0 layout, a 7.1.2 layout, and a 3.1.2 channel layout, but the channel layout is not limited thereto, and various other channel layouts are possible.

Channels specified by a channel (or speaker) layout may be referred to by various names, but may be collectively named for ease of explanation.

Channels constituting a channel (speaker) layout may be named based on the respective spatial locations of the channels.

For example, the first surround channel of a 1.0.0 channel layout may be named single channel. For a 2.0.0 channel layout, the first surround channel may be designated as the L2 channel and the second surround channel may be designated as the R2 channel.

Here, "L" means a channel located on the left side of the listener, "R" means a channel located on the right side of the listener, "2" means the number of surround channels is 2.

For a 5.1.0 channel layout, the first surround channel may be designated as an L5 channel, the second surround channel may be designated as an R5 channel, the third surround channel may be designated as a C channel, the fourth surround channel may be designated as an Ls5 channel, and the fifth surround channel may be designated as an Rs5 channel. Here, "C" means a channel located in the center of the listener, and "s" means a channel located on one side. 5.1.0 channel layout the first subwoofer channel may be named the Low Frequency Effects (LFE) channel. Here, LFE may refer to a low frequency effect. In other words, the LFE channel may be a channel for outputting low frequency effect sound.

The surround sound channels of the 5.1.2 channel layout and the 5.1.4 channel layout may be named the same as the surround sound channels of the 5.1.0 channel layout. Similarly, the subwoofer channels of the 5.1.2 channel layout and the 5.1.4 channel layout may be named the same as the subwoofer channels of the 5.1.0 channel layout.

The first overhead channel of the 5.1.2 channel layout may be named the Hl5 channel. The second overhead channel may be named the Hr5 channel. Here, "H" denotes a high-order channel, "l" denotes a channel located on the left side of the listener, and "r" denotes a channel located on the right side of the listener.

For a 5.1.4 channel layout, the first overhead channel may be named Hfl channel, the second overhead channel may be named Hfr channel, the third overhead channel may be named Hbl channel, and the fourth overhead channel may be named Hbr channel. Here, "f" denotes a front channel with respect to a listener, and "b" denotes a rear channel with respect to a listener.

For the 7.1.0 channel layout, the first surround channel may be named L channel, the second surround channel may be named R channel, the third surround channel may be named C channel, the fourth surround channel may be named Ls channel, the fifth surround channel may be named Rs channel, the sixth surround channel may be named Lb channel, and the seventh surround channel may be named Rb channel.

The surround sound channels of the 7.1.2 channel layout and the 7.1.4 channel layout may be synonymous with the surround sound channels of the 7.1.0 channel layout. Similarly, each of the subwoofer channels of the 7.1.2 channel layout and the 7.1.4 channel layout may be named identically to the subwoofer channels of the 7.1.0 channel layout.

For the 7.1.2 channel layout, the first overhead channel may be named Hl7 channel and the second overhead channel may be named Hr7 channel.

For a 7.1.4 channel layout, the first overhead channel may be named Hfl channel, the second overhead channel may be named Hfr channel, the third overhead channel may be named Hbl channel, and the fourth overhead channel may be named Hbr channel.

For the 3.1.2 channel layout, the first surround channel may be named L3 channel, the second surround channel may be named R3 channel, and the third surround channel may be named C channel. 3.1.2 the first subwoofer channel of the channel layout may be named LFE channel. For a 3.1.2 channel layout, the first overhead channel may be named Hfl3 channel (or Tl channel) and the second overhead channel may be named Hfr3 channel (or Tr channel).

Here, some channels may be named differently according to the channel layout, but may represent the same channel. For example, the Hl5 channel and the Hl7 channel may be the same channel. Likewise, the Hr5 channel and the Hr7 channel may be the same channel.

In some embodiments, the channel is not limited to the channel names described above, and various other channel names may be used.

For example, the L2 channel may be named as an L "channel, the R2 channel may be named as an R" channel, the L3 channel may be named as an ML3 (or L ') channel, the R3 channel may be named as an MR3 (or R ') channel, the Hfl3 channel may be named as an MHL3 channel, the Hfr3 channel may be named as an MHR3 channel, the Ls5 channel may be named as an MSL5 (or Ls ') channel, the Rs5 channel may be named as an MSR5 channel, the Hl5 channel may be named as an MHL5 (or Hl ') channel, the Hr5 channel may be named as an MHR5 (or Hr ') channel, and the C channel may be named as an MC channel.

The channels of the channel layouts for the above layouts may be named as in table 1.

TABLE 1

Channel layout	Channel name
		1.0.0	Single channel
2.0.0	L2/R2
		5.1.0	L5/C/R5/Ls5/Rs5/LFE
5.1.2	L5/C/R5/Ls5/Rs5/Hl5/Hr5/LFE
		5.1.4	L5/C/R5/Ls5/Rs5/Hfl/Hfr/Hbl/Hbr/LFE
7.1.0	L/C/R/Ls/Rs/Lb/Rb/LFE
		7.1.2	L/C/R/Ls/Rs/Lb/Rb/Hl7/Hr7/LFE
7.1.4	L/C/R/Ls/Rs/Lb/Rb/Hfl/Hfr/Hbl/Hbr/LFE
		3.1.2	L3/C/R3/Hfl3/Hfr3/LFE

The "transmission channel" is a channel for transmitting a compressed audio signal, and a portion of the "transmission channel" may be the same as, but not limited to, the "presentation channel", and another portion of the "transmission channel" may be a channel (mixing channel) of an audio signal in which the audio signals of the presentation channel are mixed. In other words, a "transmission channel" may be a channel of an audio signal containing a "presentation channel", but may also be a channel of which a part is the same as the presentation channel and the rest is a mixing channel different from the presentation channel. The "transmission channel" may be named different from the "presentation channel". For example, when the transmission channel is an a/B channel, the a/B channel may contain audio signals of an L2/R2 channel. When the transmission channel is a T/P/Q channel, the T/P/Q channel may contain audio signals of C/LFE/Hfl3 and Hfr3 channels. When the transmission channel is an S/U/V channel, the S/U/V channel may contain audio signals of L, R/Ls, rs/Hfl and Hfr channels. In the present disclosure, a "3D audio signal" may refer to an audio signal for detecting sound distribution and sound source position in a 3D space.

In this disclosure, "3D audio channels in front of a listener" may refer to 3D audio channels based on the layout of audio channels in front of the listener. The "front-of-listener 3D audio channel" may be referred to as a "front 3D audio channel". In particular, the "front-of-listener 3D audio channel" may be referred to as a "screen center 3D audio channel" because the "front-of-listener 3D audio channel" is a 3D audio channel based on an audio channel layout arranged around a screen located in front of the listener.

In this disclosure, "audience-omnidirectional 3D audio channels" may refer to 3D audio channels based on an audio channel layout arranged around the audience in all directions. The "listener-omni-directional 3D audio channel" may be referred to as an "all 3D audio channel". Here, the omni-direction may refer to a direction including all of front, side, and rear directions. In particular, the "listener-omni-directional 3D audio channel" may also be referred to as a "listener-centered 3D audio channel" because the "listener-omni-directional 3D audio channel" is a 3D audio channel based on an audio channel layout arranged around the listener in all directions.

In the present disclosure, a "channel group" as a kind of data unit may include an audio signal of at least one channel.

In some embodiments, an audio signal of at least one channel included in the channel group may be compressed. For example, a channel group may comprise at least one of a base channel group independent of another channel group or a slave channel group subordinate to at least one channel group. In this case, the target channel group to which the slave channel group belongs may be another slave channel group, and may be a slave channel group related to the lower channel layout. Alternatively or additionally, the channel group to which the slave channel group is subordinate may be a basic channel group. The "channel group" may be referred to as a "code group" because it includes data for the channel group. The slave channel group for further expanding the number of channels from the channels included in the base channel group may be referred to as a scalable channel group or an expansion channel group.

The audio signal of the "basic channel group" may comprise a single channel audio signal or a stereo channel audio signal. Without being limited thereto, the audio signal of the "basic channel group" may include audio signals of 3D audio channels in front of the listener.

For example, the audio signal of the "slave channel group" may include audio signals of channels other than the audio signal of the "basic channel group" between the audio signal of the 3D audio channel in front of the listener and the audio signal of the 3D audio channel in the all direction of the listener. In this case, a part of the audio signal of the other channel may be an audio signal (e.g., an audio signal of a mixing channel in which the audio signal of at least one channel is mixed).

For example, the audio signal of the "basic channel group" may be a single channel audio signal or a stereo channel audio signal. The "multi-channel audio signal" reconstructed based on the audio signals of the "basic channel group" and the "slave channel group" may be an audio signal of a 3D audio channel in front of the listener or an audio signal of a 3D audio channel in all directions of the listener.

In the present disclosure, "up-mixing" may refer to such operations: by unmixing, the number of presentation channels of the output audio signal is increased compared to the number of presentation channels of the input audio signal.

In the present disclosure, "unmixed" may refer to an operation of separating an audio signal of a specific channel from an audio signal in which audio signals of various channels are mixed (e.g., an audio signal of a mixing channel), and may refer to one of mixing operations. In this case, "unmixing" may be implemented as a calculation using an "unmixed matrix" (or a "downmix matrix" corresponding thereto), and the "unmixed" matrix may include at least one "unmixed weight parameter" (or a "downmix weight parameter" corresponding thereto) as a coefficient of the unmixed matrix (or a "downmix matrix" corresponding thereto). Alternatively or additionally, "unmixing" may be implemented as an arithmetic calculation based on a portion of the "unmixed matrix" (or a "downmix matrix" corresponding thereto), and may be implemented in various ways, without being limited thereto. As described above, "unmixing" may be associated with "upmixing".

Here, "mixing" may refer to any operation of generating an audio signal of a new channel (e.g., a mixing channel) by adding values obtained by multiplying each audio signal of a plurality of channels by corresponding weights (e.g., by mixing audio signals of a plurality of channels).

Here, "mixing" can be divided into "mixing" performed by the audio encoding apparatus in a narrow sense and "unmixing" performed by the audio decoding apparatus.

Here, the "mixing" performed in the audio encoding apparatus may be implemented as a calculation using a "(lower) mixing matrix", and the "(lower) mixing matrix" may include at least one "(lower) mixing weight parameter" as a coefficient of the (lower) mixing matrix. Alternatively or additionally, the "(lower) mixing" may be implemented as an arithmetic calculation based on a portion of the "(lower) mixing matrix" and may be implemented in various ways, without being limited thereto.

In the present disclosure, an "upmix channel group" may refer to a group including at least one upmix channel, and an "upmix channel" may refer to a downmix channel separated by unmixing with respect to an audio signal of an encoding/decoding channel. The narrow "upmix channel group" may include "upmix channels". However, the "upmix channel group" in a broad sense may further include "encoding/decoding channels" and "upmix channels". Here, the "encoding/decoding channel" may refer to an independent channel of an audio signal encoded (compressed) and included in a bitstream, or an independent channel of an audio signal obtained by decoding from a bitstream. In this case, separate mixing and/or unmixing operations are not required in order to obtain the audio signal of the encoding/decoding channel.

The audio signal of the "upmix channel group" in a broad sense may be a multi-channel audio signal, and the output multi-channel audio signal may be one of at least one multi-channel audio signal (e.g., an audio signal of at least one upmix channel group) which is an audio signal output by a device such as a speaker.

In this disclosure, "downmixing" may refer to the operation of: by mixing, the number of rendering channels of the output audio signal is reduced compared to the number of rendering channels of the input audio signal.

In the present disclosure, an "error-canceling factor" (or error-canceling factor (ERF)) may be a factor for canceling an error of an audio signal occurring due to lossy encoding.

Errors of the audio signal that occur due to lossy coding may include, for example, errors caused by coding (quantization) based on psychoacoustic features, etc. The "error-canceling factor" may be referred to as a "coding error-Canceling (CER) factor", "error-canceling rate", or the like. In particular, since the error cancellation operation substantially corresponds to the scaling operation, the "factor of error cancellation" may be referred to as a "scaling factor".

Hereinafter, embodiments of the present disclosure according to the technical spirit of the present disclosure are described in detail.

Conventional 3D audio decoding apparatuses receive compressed audio signals of individual channels of a specific channel layout from a bitstream. Conventional 3D audio decoding apparatuses reconstruct an audio signal of an audience-omni-directional 3D audio channel using compressed audio signals of independent channels received from a bitstream. In this case, only the audio signal of a specific channel layout can be reconstructed.

Alternatively or additionally, conventional 3D audio decoding apparatuses receive compressed audio signals of individual channels (e.g., a first individual channel group) of a specific channel layout from a bitstream. For example, the particular channel layout may be a 5.1 channel layout, and in this case, the compressed audio signals of the first independent channel group may be compressed audio signals of five surround channels and one subwoofer channel.

Here, in order to increase the number of channels, the conventional 3D audio decoding apparatus also receives compressed audio signals of other channels (second independent channel group) independent of the first independent channel group. For example, the compressed audio signal of the second independent channel group may be a compressed audio signal of two high-order channels (height channels).

That is, the conventional 3D audio decoding apparatus reconstructs an audio signal of an audience-omni-directional 3D audio channel using a compressed audio signal of a second independent channel group received from a bitstream, which is independent of a compressed audio signal of a first independent channel group received from the bitstream. Thus, the audio signal of the increased number of channels is reconstructed. Here, the audio signal of the listener's omnidirectional 3D audio channel may be an audio signal of 5.1.2 channels.

On the other hand, a conventional audio decoding apparatus supporting reproduction of only audio signals of a stereo channel cannot properly process compressed audio signals included in a bitstream.

A conventional 3D audio decoding apparatus supporting 3D audio signal reproduction first decompresses (e.g., decodes) compressed audio signals of a first independent channel group and a second independent channel group to reproduce audio signals of stereo channels. Then, the conventional 3D audio decoding apparatus upmixes the audio signal generated by decompression. However, in order to reproduce an audio signal of a stereo channel, an operation of upmixing must be performed.

Accordingly, there is a need for a scalable channel layout structure capable of processing a compressed audio signal in a conventional audio decoding apparatus. Alternatively or additionally, in the audio decoding apparatuses 300 and 500 of fig. 3a and 5a, respectively, which support reproduction of 3D audio signals, there is a need for a scalable channel layout structure capable of processing compressed audio signals according to a 3D audio channel layout supporting reproduction, according to various embodiments of the present disclosure. Here, the scalable channel layout structure may refer to a layout structure in which the number of channels can be freely increased from the basic channel layout.

According to various embodiments of the present disclosure, the audio decoding apparatuses 300 and 500 may reconstruct an audio signal of a scalable channel layout structure from a bitstream. With the scalable channel layout structure according to various embodiments of the present disclosure, the number of channels may be increased from the stereo channel layout 100 to the 3D audio channel layout 110 in front of the listener (or the 3D audio channel layout 110 in front of the listener). Furthermore, with a scalable channel layout structure, the number of channels can be increased from the front of the listener 3D audio channel layout 110 to the 3D audio channel layout 120 located in all directions around the listener (or the listener all direction 3D audio channel layout 120). For example, the 3D audio channel layout 110 in front of the listener may be a 3.1.2 channel layout. The audience-generic 3D audio channel layout 120 may be a 5.1.2 or 7.1.2 channel layout. However, the scalable channel layout that may be implemented in the present disclosure is not limited thereto.

As a basic channel group, an audio signal of a conventional stereo channel may be compressed. The conventional audio decoding apparatus may decompress the compressed audio signal of the basic channel group from the bitstream, thereby smoothly reproducing the audio signal of the conventional stereo channel.

Alternatively or additionally, as a group of slave channels, the audio signals of channels of the multi-channel audio signal other than the audio signals of the conventional stereo channels may be compressed.

However, in increasing the number of channels, a part of the audio signals of the channel group may be audio signals of some independent channels in which the audio signals of the specific channel layout are mixed.

Accordingly, in the audio decoding apparatuses 300 and 500, a portion of the audio signal of the basic channel group and a portion of the audio signal of the dependent channel group may be unmixed to generate an audio signal of an upmix channel included in a specific channel layout.

In some embodiments, there may be one or more slave channel groups. For example, the audio signals of channels other than the audio signals of the stereo channels in the audio signals of the 3D audio channel layout 110 in front of the listener may be compressed into the audio signals of the first group of slave channels.

Among the audio signals of the listener-omni-directional 3D audio channel layout 120, the audio signals of channels other than the audio signals of the channels reconstructed from the basic channel group and the first slave channel group may be compressed into the audio signals of the second slave channel group.

The audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure may support reproduction of audio signals of the audience-omnidirectional 3D audio channel layout 120.

Accordingly, the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure may reconstruct the audio signal of the listener omnidirectional 3D audio channel layout 120 based on the audio signals of the basic channel group and the audio signals of the first and second slave channel groups.

The conventional audio signal processing apparatus may ignore compressed audio signals of a slave channel group that cannot be reconstructed from a bitstream and reproduce audio signals of a stereo channel reconstructed from the bitstream.

Similarly, the audio decoding apparatuses 300 and 500 may process compressed audio signals of the basic channel group and the dependent channel group to reconstruct an audio signal that can support a channel layout from a scalable channel layout. The audio decoding apparatuses 300 and 500 may not reconstruct the compressed audio signal with respect to the unsupported upper channel layout from the bitstream. Thus, an audio signal that can support a channel layout can be reconstructed from the bitstream while ignoring compressed audio signals that are related to an upper channel layout that is not supported by the audio decoding apparatuses 300 and 500.

In particular, conventional audio encoding and decoding apparatuses compress and decompress audio signals of individual channels of a specific channel layout. Therefore, compression and decompression of audio signals of a limited channel arrangement is possible.

However, according to various embodiments of the present disclosure, transmission and reconstruction of audio signals of stereo channels is possible by the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 of fig. 2a and 4a, respectively, which support a scalable channel layout. According to various embodiments of the present disclosure, transmission and reconstruction of audio signals of a 3D channel layout in front of a listener is possible using the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500. Further, with the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure, an audio signal of an audience-omnidirectional 3D channel layout can be transmitted and reconstructed.

That is, according to various embodiments of the present disclosure, the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 may transmit and reconstruct an audio signal according to the layout of a stereo channel. Further, according to various embodiments of the present disclosure, the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 may freely convert an audio signal of a current channel layout into an audio signal of another channel layout. By mixing/unmixing between audio signals comprising channels in different channel layouts, a transition between channel layouts is possible. According to various embodiments of the present disclosure, the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 may support conversion between various channel layouts, and thus transmit and reproduce audio signals of various 3D channel layouts. That is, channel dependency is not guaranteed between the pre-listener channel layout and the omni-directional channel layout or between the stereo channel layout and the pre-stereo channel layout, but free conversion by mixing/unmixing of audio signals is possible.

According to various embodiments of the present disclosure, the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 support processing of audio signals of a front channel layout of a listener and thus transmit and reconstruct audio signals corresponding to speakers disposed around a screen, thereby improving the immersion of the listener.

Detailed operations of the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure are described with reference to fig. 2a to 5 b.

Fig. 1b is a diagram for describing an example of a detailed scalable audio channel layout structure according to various embodiments of the present disclosure.

Referring to fig. 1B, in order to transmit an audio signal of the stereo channel layout 160, the audio encoding apparatuses 200 and 400 may generate compressed audio signals (a/B signals) of a basic channel group by compressing the L2/R2 signals.

In this case, the audio encoding apparatuses 200 and 400 may generate audio signals of the basic channel group by compressing the L2/R2 signal.

In addition, in order to transmit an audio signal of the layout 170 of 3.1.2 channels, which is one of the 3D audio channels in front of the listener, the audio encoding apparatuses 200 and 400 can generate compressed audio signals of the slave channel groups by compressing C, LFE, hfl and Hfr3 signals. The audio decoding apparatuses 300 and 500 may reconstruct the L2/R2 signal by decompressing the compressed audio signal of the basic channel group. The audio decoding apparatuses 300 and 500 may reconstruct C, LFE, hfl and Hfr3 signals by decompressing compressed audio signals of the slave channel groups.

The audio decoding apparatuses 300 and 500 may reconstruct the L3 signal of the 3.1.2 channel layout 170 by unmixing the L2 signal and the C signal (operation 1 of fig. 1 b). The audio decoding apparatuses 300 and 500 may reconstruct the R3 signal of the 3.1.2 channel layout 170 by unmixing the R2 signal and the C signal (operation 2).

Accordingly, the audio decoding apparatuses 300 and 500 may output the L3, R3, C, lfe, hfl, and Hfr3 signals as the audio signals of the 3.1.2 channel arrangement 170.

In some embodiments, to transmit the audio signal of the listener's overall directional front 5.1.2 channel layout 180, the audio encoding apparatus 200 and 400 may further compress the L5 and R5 signals to generate compressed audio signals for the second set of slave channels.

As described above, the audio decoding apparatuses 300 and 500 may reconstruct the L2/R2 signal by decompressing the compressed audio signal of the basic channel group and reconstruct the C, LFE, hfl and Hfr3 signals by decompressing the compressed audio signal of the first slave channel group. Alternatively or additionally, the audio decoding apparatus 300 and 500 may reconstruct the L5 and R5 signals by decompressing the compressed audio signals of the second slave channel group. Further, as described above, the audio decoding apparatuses 300 and 500 may reconstruct the L3 and R3 signals by unmixing some of the decompressed audio signals.

Alternatively or additionally, the audio decoding apparatuses 300 and 500 may reconstruct the Ls5 signal by unmixing the L3 and L5 signals (operation 3). The audio decoding apparatuses 300 and 500 may reconstruct the Rs5 signal by unmixing the R3 and R5 signals (operation 4).

The audio decoding apparatuses 300 and 500 may reconstruct the Hl5 signal by unmixing the Hfl3 and Ls5 signals (operation 5).

The audio decoding apparatuses 300 and 500 may reconstruct the Hr5 signal by unmixing the Hfr3 and Rs5 signals (operation 6). Hfr3 and Hr5 are the right front channels in the overhead channels.

Accordingly, the audio decoding apparatuses 300 and 500 may output Hl5, hr5, LFE, L, R, C, ls, and Rs5 signals as the audio signals of the 5.1.2 channel arrangement 180.

In some embodiments, to transmit the audio signal of the 7.1.4 channel layout 190, the audio encoding apparatus 200 and 400 may further compress Hfl, hfr, ls and Rs signals as the audio signals of the third slave channel group.

As described above, the audio decoding apparatuses 300 and 500 may decompress the compressed audio signal of the basic channel group, the compressed audio signal of the first slave channel group, and the compressed audio signal of the second slave channel group, and reconstruct the Hl5, hr5, LFE, L, R, C, ls5, and Rs5 signals by unmixing (operations 1 to 6).

Alternatively or additionally, the audio decoding apparatus 300 and 500 may reconstruct Hfl, hfr, ls and Rs signals by decompressing the compressed audio signals of the third slave channel group. The audio decoding apparatuses 300 and 500 may reconstruct the Lb signal of the 7.1.4 channel layout 190 by unmixing the Ls5 signal and the Ls signal (operation 7).

The audio decoding apparatuses 300 and 500 may reconstruct the Rb signal of the 7.1.4 channel layout 190 by unmixing the Rs5 signal and the Rs signal (operation 8).

The audio decoding apparatuses 300 and 500 may reconstruct the Hbl signal of the 7.1.4 channel layout 190 by unmixing the Hfl signal and the Hl5 signal (operation 9).

The audio decoding apparatuses 300 and 500 may reconstruct the Hbr signal of the 7.1.4 channel layout 190 by unmixing the Hfr signal and the Hr5 signal (operation 10).

Accordingly, the audio decoding apparatuses 300 and 500 can output Hfl, hfr, LFE, C, L, R, ls, rs, lb, rb, hbl and Hbr signals as the audio signals of the 7.1.4 channel arrangement 190.

Accordingly, the audio decoding apparatuses 300 and 500 may reconstruct the audio signals of the front-of-listener 3D audio channels and the audio signals of the listener omni-directional 3D audio channels as well as the audio signals of the conventional stereo channel layout by supporting a scalable channel layout in which the number of channels is increased through the unmixing operation.

The scalable channel layout structure described in detail above with reference to fig. 1b is only one example, and the channel layout structure may be scalable implemented to include a variety of channel layouts.

Fig. 2a is a block diagram of an audio encoding apparatus according to various embodiments of the present disclosure.

The audio encoding apparatus 200 may include a memory 210 and a processor 230. The audio encoding apparatus 200 may be implemented as an apparatus capable of performing audio processing, such as a server, a Television (TV), a camera, a cellular phone, a tablet Personal Computer (PC), a laptop computer, or the like.

Although the memory 210 and the processor 230 are separately shown in fig. 2a, the memory 210 and the processor 230 may be implemented by one hardware module (e.g., chip).

Processor 230 may be implemented as a dedicated processor for neural network-based audio processing. Alternatively or additionally, the processor 230 may be implemented by a combination of software and a general-purpose processor, such as an Application Processor (AP), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU). The special purpose processor may include a memory to implement the various embodiments of the present disclosure or a storage processor to use external memory.

Processor 230 may include multiple processors. In this case, the processor 230 may be implemented as a combination of dedicated processors and by a combination of software and a plurality of general-purpose processors (e.g., AP, CPU, or GPU).

Memory 210 may store one or more instructions for audio processing. In various embodiments of the present disclosure, the memory 210 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for artificial intelligence or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 210. The neural network may be implemented by an external device (e.g., a server), in which case the audio encoding apparatus 200 may request and receive result information based on the neural network from the external device.

Processor 230 may sequentially process successive frames according to instructions stored in memory 210 and obtain successive encoded (compressed) frames. Consecutive frames may refer to frames constituting audio.

The processor 230 may perform an audio processing operation with the original audio signal as an input and output a bitstream including the compressed audio signal. In this case, the original audio signal may be a multi-channel audio signal. The compressed audio signal may be a multi-channel audio signal having a channel number less than or equal to the channel number of the original audio signal.

In this case, the bitstream may include a basic channel group, and further include n dependent channel groups (where n is an integer greater than or equal to 1). Therefore, the number of channels can be freely increased according to the number of the slave channel groups.

Fig. 2b is a block diagram of an audio encoding apparatus according to various embodiments of the present disclosure.

Referring to fig. 2b, the audio encoding apparatus 200 may include a multi-channel audio encoder 250, a bitstream generator 280, and an additional information generator 285. The multi-channel audio encoder 250 may include a multi-channel audio signal processor 260 and a compressor 270.

Referring back to fig. 2a, as described above, the audio encoding apparatus 200 may include the memory 210 and the processor 230, and instructions for implementing the components 250, 260, 270, 280, and 285 of fig. 2b may be stored in the memory 210 of fig. 2 a. Processor 230 may execute instructions stored in memory 210.

The multi-channel audio signal processor 260 may obtain at least one audio signal of a basic channel group and at least one audio signal of at least one slave channel group from an original audio signal. For example, when the original audio signal is an audio signal of a 7.1.4-channel layout, the multi-channel audio signal processor 260 may obtain an audio signal of 2 channels (stereo channels) as an audio signal of a basic channel group among the audio signals of the 7.1.4-channel layout.

The multi-channel audio signal processor 260 may obtain audio signals of channels other than the 2-channel audio signal from the audio signal of the 3.1.2-channel layout as the audio signals of the first slave channel group to reconstruct the audio signal of the 3.1.2-channel layout, which is one of the 3D audio channels in front of the listener. In this case, the audio signals of some channels of the first slave channel group may be unmixed to generate an audio signal of the unmixed channel.

The multi-channel audio signal processor 260 may obtain audio signals of channels other than the audio signal of the basic channel group and the audio signal of the first slave channel group from the audio signal of the 5.1.2 channel layout as the audio signal of the second slave channel group to reconstruct the audio signal of the 5.1.2 channel layout, which is one of the front and rear 3D audio channels of the listener. In this case, the audio signals of some channels of the second slave channel group may be unmixed to generate an audio signal of the unmixed channel.

The multi-channel audio signal processor 260 may obtain, from the audio signals of the 7.1.4-channel layout, the audio signals of channels other than the audio signals of the first and second slave channel groups as the audio signals of the third slave channel group to reconstruct the audio signals of the 7.1.4-channel layout, which is one of the listener's omni-directional 3D audio channels. Likewise, the audio signals of some channels of the third slave channel group may be unmixed to obtain the audio signals of the unmixed channels.

Detailed operation of the multi-channel audio signal processor 260 is described with reference to fig. 2 c.

The compressor 270 may compress the audio signal of the basic channel group and the audio signal of the dependent channel group. That is, the compressor 270 may compress at least one audio signal of the basic channel group to obtain at least one compressed audio signal of the basic channel group. Here, compression may refer to compression based on various audio codecs. For example, compression may include transform and quantization processes.

Here, the audio signal of the basic channel group may be a single channel or a stereo signal. Alternatively or additionally, the audio signals of the basic channel group may comprise audio signals of a first channel generated by mixing the audio signal L of the left stereo channel with c_1. Here, c_1 may be an audio signal of a center channel in front of the listener, which is decompressed after compression. In the name of the audio signal ("x_y"), the "X" may represent the name of the channel, and the "Y" may represent the decoded, upmixed, applied error-canceled factor (e.g., scaled), or applied LFE gain. For example, the decoded signal may be denoted as "x_1", and a signal generated by upmixing the decoded signal (upmixed signal) may be denoted as "x_2". Alternatively or additionally, the signal to which the LFE gain is applied to the decoded LFE signal may also be denoted as "x_2". A signal (e.g., a scaled signal) to which the factor of error cancellation is applied to the upmix signal may be denoted as "x_3".

The audio signal of the basic channel group may include an audio signal of a second channel generated by mixing the audio signal R of the right stereo channel with c_1.

The compressor 270 may obtain at least one compressed audio signal of the at least one slave channel group by compressing the at least one audio signal of the at least one slave channel group.

The additional information generator 285 may generate additional information based on at least one of the original audio signal, the compressed audio signal of the basic channel group, or the compressed audio signal of the slave channel group. In this case, the additional information may be information related to the multi-channel audio signal and include various information for reconstructing the multi-channel audio signal.

For example, the additional information may include an audio object signal of a 3D audio channel in front of the listener, which indicates at least one of an audio signal, a position, a shape, an area, or a direction of an audio object (e.g., a sound source). Alternatively or additionally, the additional information may comprise information about the total number of audio streams, including a base channel audio stream and a slave channel audio stream. The additional information may include downmix gain information. The additional information may include channel map information. The additional information may include volume information. The additional information may include LFE gain information. The additional information may include Dynamic Range Control (DRC) information. The additional information may include channel layout rendering information. The additional information may also include information indicating the number of coupled audio streams, information indicating a multi-channel layout, information about whether dialog and dialog levels exist in the audio signals, information indicating whether LFEs are output, information about whether audio objects exist on a screen, information about the presence or absence of audio signals of consecutive audio channels (or scene-based audio signals or surround sound audio signals), and information about the presence or absence of audio signals of discrete audio channels (or object-based audio signals or spatial multi-channel audio signals). The additional information may comprise information about the downmix, including at least one downmix weight parameter for reconstructing a downmix matrix of the multi-channel audio signal. The (down) mixing may correspond to each other such that the information about the (down) mixing may correspond to the information about the (down) mixing and/or the information about the (down) mixing may comprise the information about the (down) mixing. For example, the information about the unmixed may comprise at least one (lower) mixing weight parameter of the (lower) mixing matrix. The unmixed weight parameters may be obtained based on the (down) mix weight parameters.

The additional information may be various combinations of the above. In other words, the additional information may include at least one of the foregoing information.

For example, when there is an audio signal of a slave channel corresponding to at least one audio signal of the basic channel group, the additional information generator 285 may generate slave channel audio signal identification information indicating that the audio signal of the slave channel exists.

The bitstream generator 280 may generate a bitstream including the compressed audio signal of the basic channel group and the compressed audio signal of the dependent channel group. The bit stream generator 280 may generate a bit stream further including the additional information generated by the additional information generator 285.

For example, the bitstream generator 280 may generate a base channel audio stream and a slave channel audio stream. The base channel audio stream may comprise compressed audio signals of a base channel group and the slave channel audio stream may comprise compressed audio signals of a slave channel group.

The bitstream generator 280 may generate a bitstream including a base channel audio stream and a plurality of slave channel audio streams. The plurality of slave channel audio streams may comprise n slave channel audio streams (where n is an integer greater than 1). In this case, the basic channel audio stream may include a single channel audio signal or a stereo channel compressed audio signal.

For example, in channels of a first multi-channel layout reconstructed from a base channel audio stream and a first slave channel audio stream, the number of surround channels may be S _n-1 The number of the subwoofer channels can be W _n-1 The number of overhead channels may be H _n-1 . In a second multi-channel arrangement reconstructed from the base channel audio stream, the first slave channel audio stream and the second slave channel audio stream, the number of surround channels may be S _n The number of the subwoofer channels can be W _n The number of overhead channels may be H _n 。

In this case S _n-1 Can be less than or equal to S _n ，W _n-1 May be less than or equal to W _n ，H _n-1 May be less than or equal to H _n . Here, S can be excluded _n-1 Equal to S _n 、W _n-1 Equal to W _n And H _n-1 Equal to H _n Is the case in (a). Namely S _n-1 、W _n-1 And H _n-1 All of the possibilities in (a) are respectively not equal to S _n 、W _n And H _n 。

That is, the number of surround channels of the second multi-channel arrangement needs to be greater than the number of surround channels of the first multi-channel arrangement. Alternatively or additionally, the number of subwoofer channels of the second multi-channel arrangement needs to be greater than the number of subwoofer channels of the first multi-channel arrangement. Alternatively or additionally, the number of overhead channels of the second multi-channel arrangement needs to be greater than the number of overhead channels of the first multi-channel arrangement.

Further, the number of surround channels of the second multi-channel arrangement may be not smaller than the number of surround channels of the first multi-channel arrangement. Also, the number of subwoofer channels of the second multi-channel layout may be not less than the number of subwoofer channels of the first multi-channel layout. The number of overhead channels of the second multi-channel arrangement may be no less than the number of overhead channels of the first multi-channel arrangement.

Alternatively or additionally, there is no case where the number of surround channels of the second multi-channel arrangement is equal to the number of surround channels of the first multi-channel arrangement and the number of subwoofer channels of the second multi-channel arrangement is equal to the number of subwoofer channels of the first multi-channel arrangement and the number of overhead channels of the second multi-channel arrangement is equal to the number of overhead channels of the first multi-channel arrangement. That is, all channels of the second multi-channel arrangement may be different from all channels of the first multi-channel arrangement.

Specifically, for example, when the first multi-channel layout is a 5.1.2-channel layout, the second multi-channel layout may be a 7.1.4-channel layout.

Alternatively or additionally, the bitstream generator 280 may generate metadata including additional information.

Thus, the bitstream generator 280 may generate a bitstream including the base channel audio stream, the slave channel audio stream, and the metadata.

The bit stream generator 280 may generate a bit stream in a form in which the number of channels may freely increase from the basic channel group.

That is, an audio signal of a basic channel group may be reconstructed from a basic channel audio stream, and a multi-channel audio signal in which the number of channels increases from the basic channel group may be reconstructed from the basic channel audio stream and a slave channel audio stream.

In some embodiments, the bitstream generator 280 may generate a file stream having a plurality of audio tracks. The bitstream generator 280 may generate an audio stream of a first audio track of the compressed audio signal comprising at least one basic channel group. The bitstream generator 280 may generate an audio stream of the second audio track including the slave channel audio signal identification information. In this case, the second audio track following the first audio track may be adjacent to the first audio track.

In other embodiments, when there is a slave channel audio signal corresponding to at least one audio signal of the base channel group, the bitstream generator 280 may generate an audio stream of the second audio track including at least one compressed audio signal of the at least one slave channel group.

In other embodiments, when there is no slave channel audio signal corresponding to at least one audio signal of the basic channel set, the bitstream generator 280 may generate an audio stream of a second audio track including a next audio signal of the basic channel set relative to the first audio track of the basic channel set.

Fig. 2c is a block diagram of the structure of a multi-channel audio signal processor 260 of the audio encoding apparatus 200 according to various embodiments of the present disclosure.

Referring to fig. 2c, the multi-channel audio signal processor 260 may include a channel layout identifier 261, a downmix channel audio generator 262, and an audio signal classifier 266.

The channel layout identifier 261 may identify at least one channel layout from the original audio signal. In this case, the at least one channel layout may comprise a plurality of hierarchical channel layouts. The channel layout identifier 261 may identify a channel layout of the original audio signal. The channel layout identifier 261 may identify a channel layout lower than that of the original audio signal. For example, when the original audio signal is an audio signal of a 7.1.4 channel layout, the channel layout identifier 261 may identify a 7.1.4 channel layout and identify a 5.1.2 channel layout, a 3.1.2 channel layout, a 2 channel layout, etc., which is lower than the 7.1.4 channel layout. An upper channel layout may refer to a layout in which the number of at least one of the surround channel/subwoofer channel/high set channel is greater than the number of lower channel layouts. Depending on whether the number of surround channels is large or small, the upper/lower channel layout may be determined, and for the same number of surround channels, the upper/lower channel layout may be determined depending on whether the number of subwoofer channels is large or small. For the same number of surround channels and subwoofer channels, the upper/lower channel layout may be determined depending on whether the number of overhead channels is large or small.

Alternatively or additionally, the identified channel layout may include a target channel layout. The target channel layout may refer to the highest channel layout of the audio signal included in the final output bitstream. The target channel layout may be a channel layout of the original audio signal or a channel layout lower than the channel layout of the original audio signal.

For example, the channel layout identified from the original audio signal may be hierarchically determined from the channel layout of the original audio signal. In this case, the channel layout identifier 261 may identify at least one channel layout among predetermined channel layouts. For example, the channel layout identifier 261 may identify some predetermined channel layouts, 7.1.4 channel layouts, 5.1.4 channel layouts, 5.1.2 channel layouts, 3.1.2 channel layouts, and 2 channel layouts, from the layout of the original audio signal.

The channel layout identifier 261 may transmit a control signal to a downmix channel audio generator corresponding to at least one channel layout identified in the first and second downmix channel audio generators 263 and 264 to 265 based on the identified channel layout, and generate a downmix channel audio from the original audio signal based on the at least one channel layout identified by the channel layout identifier 261. The downmix channel audio generator 262 may generate the downmix channel audio from the original audio signal using a downmix matrix including at least one downmix weight parameter.

For example, when the channel layout of the original audio signal is the nth channel layout arranged in ascending order among the predetermined channel layouts, the downmix channel audio generator 262 may generate the downmix channel audio directly lower than the (n-1) th channel layout of the original audio signal from the original audio signal. By repeating this process, the downmix channel audio generator 252 can generate downmix channel audio of a channel layout lower than the current channel layout.

For example, the downmix channel audio generator 262 may include a first downmix channel audio generator 263 and a second downmix channel audio generator 264 to (n-1) -th downmix channel audio generators (not shown). In some embodiments, (n-1) may be less than or equal to n.

In this case, the (n-1) -th downmix channel audio generator (not shown) may generate an audio signal of the (n-1) -th channel layout from the original audio signal. Alternatively or additionally, an (n-2) -th downmix channel audio generator (not shown) may generate an audio signal of an (n-2) -th channel layout from the original audio signal. In this way, the first downmix channel audio generator 263 can generate an audio signal of the first channel layout from the original audio signal. In this case, the audio signal of the first channel arrangement may be an audio signal of a basic channel group.

In some embodiments, each of the downmix channel audio generators 263 and 264 to 265 may be connected in a cascade manner. That is, the downmix channel audio generators 263 and 264 to 265 may be connected such that the output of the upper downmix channel audio generator becomes the input of the lower downmix channel audio generator. For example, the audio signal of the (n-1) -th channel layout may be output from the (n-1) -th downmix channel audio generator (not shown) with the original audio signal as an input, and the audio signal of the (n-1) -th channel layout may be input to the (n-2) -th downmix channel audio generator (not shown), and the (n-2) -th downmix channel audio may be generated from the (n-2) -th downmix channel audio generator (not shown). In this way, the downmix channel audio generators 263 and 264 to 265 may be connected to output audio signals of each channel layout.

Based on the audio signals of the at least one channel layout, the audio signal classifier 266 may obtain the audio signals of the basic channel group and the audio signals of the dependent channel group. In this case, the audio classifier 266 may mix the audio signals of at least one channel included in the audio signals of at least one channel layout through the mixing unit 267. The audio classifier 266 may classify the mixed audio signal as at least one of an audio signal of a basic channel group or an audio signal of a subordinate channel group.

Referring to fig. 2d, the downmix channel audio generator 262 of fig. 2c can obtain an audio signal of the 5.1.2 channel layout 291, an audio signal of the 3.1.2 channel layout 292, an audio signal of the 2 channel layout 293, and an audio signal of the single channel layout 294, which are audio signals of the lower channel layout, from the original audio signal of the 7.1.4 channel layout 290. The downmix channel audio generators 263, 264 and the through 265 of the downmix channel audio generator 262 are connected in a cascade manner so that an audio signal can be sequentially obtained from a current channel layout to a lower channel layout.

The audio signal classifier 266 of fig. 2c may classify the audio signal of the single channel arrangement 294 as an audio signal of a basic channel group.

The audio signal classifier 266 may classify the audio signal of the L2 channel as part of the audio signal of the 2-channel layout 293 as the audio signal of the slave channel group #1 296. In some embodiments, the audio signals of the L2 channels and the audio signals of the R2 channels are mixed to generate the audio signals of the single channel arrangement 294, such that, in turn, the audio decoding apparatus 300 and 500 may unmixe the audio signals of the single channel arrangement 294 and the audio signals of the L2 channels to reconstruct the audio signals of the R2 channels. Therefore, the audio signal of the R2 channel cannot be classified into the audio signal of the individual channel group.

The audio signal classifier 266 may classify an audio signal of Hfl3 channel, an audio signal of C channel, an audio signal of LFE channel, and an audio signal of Hfr3 channel among the audio signals of the 3.1.2 channel arrangement 292 as an audio signal of the slave channel group #2 297. The audio signal of the L2 channel is generated by mixing the audio signal of the L3 channel and the audio signal of the Hfl3 channel, so that the audio decoding apparatuses 300 and 500 can reconstruct the audio signal of the L2 channel of the slave channel group #1 296 and the audio signal of the Hfl3 channel of the slave channel group #2297 in turn.

Therefore, the audio signal of the L3 channel among the audio signals of the 3.1.2 channel arrangement 292 may not be classified as the audio signal of the specific channel group.

For the same reason, the R3 channels may not be classified into audio signals of a specific channel group.

The audio signal classifier 266 may transmit the audio signal of the L channel and the audio signal of the R channel, which are audio signals of some channels of the 5.1.2 channel arrangement 291, as the audio signals of the slave channel group #3 298 so as to transmit the audio signals of the 5.1.2 channel arrangement 291. In some embodiments, the audio signal of one of the Ls5, hl5, rs5, and Hr5 channels may be one of the audio signals of the 5.1.2 channel arrangement 291, but may not be classified as an audio signal of a separate slave channel group. This is because the signals of Ls5, hl5, rs5, and Hr5 channels may not be the listener front channel audio signals, but may be signals in which the audio signals of at least one of the audio channels in front of, beside, and behind the listener among the audio signals of the 7.1.4 channel arrangement 290 can be mixed. By compressing the audio signals of the audio channels in front of the listener from the original audio signals, instead of classifying the mixed signal into the audio signals of the slave channel groups and compressing them, the sound quality of the audio signals of the audio channels in front of the listener can be improved. Accordingly, the listener can feel that the sound quality of the reproduced audio signal is improved.

However, ls5 or Hl5 instead of L may be classified as the audio signal of the slave channel group #3 298, and Rs5 or Hr5 instead of R may be classified as the audio signal of the slave channel group #3 298, as the case may be.

The audio signal classifier 266 may classify an audio signal of Ls, hfl, rs or Hfr channels among the audio signals of the 7.1.4 channel arrangement 290 as an audio signal of the slave channel group #4 299. In this case, lb instead of Ls, hbl instead of Hfl, rb instead of Rs, and Hbr instead of Hfr may not be classified as the audio signal of the slave channel group #4 299. By compressing the audio signals of the side audio channels close to the front of the listener instead of classifying the audio signals of the audio channels behind the listener among the audio signals of the 7.1.4 channel arrangement 290 as the audio signals of the channel group and compressing them, the sound quality of the audio signals of the side audio channels close to the front of the listener can be improved. Accordingly, the listener can feel that the sound quality of the reproduced audio signal is improved. However, lb instead of Ls, hbl instead of Hfl, rb instead of Rs, and Hbr instead of Hfr may be classified as audio signals of the slave channel group #4 299 according to circumstances.

Thus, the downmix channel audio generator 262 of fig. 2c may generate a plurality of audio signals (downmix channel audio) of the lower layout based on the plurality of lower layouts identified from the original audio signal layout. The audio signal classifier 266 of fig. 2c may classify the audio signals of the basic channel group and the audio signals of the slave channel groups #1, #2, #3, and # 4. The classification audio signal of a channel may classify a portion of the audio signals of the individual channels of the audio signals of each channel as the audio signals of the channel group according to each channel layout. The audio decoding apparatuses 300 and 500 may reconstruct the audio signal not classified by the audio signal channel classifier 266 by unmixing. In some embodiments, when an audio signal of a left channel with respect to a listener is classified as an audio signal of a specific channel group, an audio signal of a right channel corresponding to the left channel may be classified as an audio signal of a corresponding channel group. That is, the audio signals of the coupled channels may be classified into audio signals of one channel group.

When the audio signal of the stereo channel layout is classified as the audio signal of the basic channel group, the audio signals of the coupling channels may all be classified as the audio signal of one channel group. However, as described above with reference to fig. 2d, when the audio signal of the single channel layout is classified as the audio signal of the basic channel group, in addition, one of the audio signals of the stereo channels may be classified as the audio signal of the sub channel group # 1. However, the method of classifying the audio signals of the channel group may be various and is not limited to the description with reference to fig. 2 d. That is, when the classified audio signals of the channel group are unmixed and the audio signals of the channels that are not classified as the audio signals of the channel group can be reconstructed from the unmixed audio signals, the audio signals of the channel group can be classified in various forms.

Fig. 3a is a block diagram of a structure of a multi-channel audio decoding apparatus according to various embodiments of the present disclosure.

The audio decoding apparatus 300 may include a memory 310 and a processor 330. The audio decoding apparatus 300 may be implemented as an apparatus capable of audio processing, such as a server, a television, a camera, a mobile phone, a computer, a digital broadcasting terminal, a tablet PC, a laptop computer, or the like.

Although the memory 310 and the processor 330 are separately shown in fig. 3a, the memory 310 and the processor 330 may be implemented by one hardware module (e.g., chip).

The processor 330 may be implemented as a dedicated processor for neural network based audio processing. Alternatively or additionally, the processor 230 may be implemented by a combination of general-purpose processors and software, such as an AP, CPU or GPU. The special purpose processor may include a memory to implement the various embodiments of the present disclosure or a storage processor to use external memory.

Processor 330 may include a plurality of processors. In this case, the processor 330 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors (e.g., AP, CPU, or GPU).

Memory 310 may store one or more instructions for audio processing. According to various embodiments of the present disclosure, the memory 310 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for Artificial Intelligence (AI) or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 310. The neural network may be implemented as an external device (e.g., a server). In this case, the audio decoding apparatus 300 may request the neural network-based result information from the external apparatus and receive the neural network-based result information from the external apparatus.

Processor 330 may sequentially process successive frames according to instructions stored in memory 310 to obtain successive reconstructed frames. Consecutive frames may refer to frames constituting audio.

The processor 330 may output a multi-channel audio signal by performing an audio processing operation on an input bitstream. The bit stream may be implemented in a scalable form to increase the number of channels from the basic channel set. For example, the processor 330 may obtain the compressed audio signals of the basic channel sets from the bitstream, and may reconstruct the audio signals of the basic channel sets (e.g., stereo channel audio signals) by decompressing the compressed audio signals of the basic channel sets. Alternatively or additionally, the processor 330 may reconstruct the audio signals of the slave channel groups by decompressing the compressed audio signals of the slave channel groups from the bitstream. The processor 330 may reconstruct the multi-channel audio signal based on the audio signal of the basic channel group and the audio signal of the slave channel group.

In some embodiments, the processor 330 may reconstruct the audio signals of the first slave channel group by decompressing the compressed audio signals of the first slave channel group from the bitstream. The processor 330 may reconstruct the audio signals of the second slave channel group by decompressing the compressed audio signals of the second slave channel group.

The processor 330 may reconstruct a multi-channel audio signal of the increased number of channels based on the audio signal of the basic channel group and the corresponding audio signals of the first and second slave channel groups. Similarly, the processor 330 may decompress the compressed audio signals of the n slave channel groups (where n is an integer greater than 2), and may reconstruct a multi-channel audio signal that further increases the number of channels based on the audio signals of the base channel group and the corresponding audio signals of the devices of the n slave channel groups.

Fig. 3b is a block diagram of a structure of a multi-channel audio decoding apparatus according to various embodiments of the present disclosure.

Referring to fig. 3b, the audio decoding apparatus 300 may include an information acquirer 350 and a multi-channel audio decoder 360. The multi-channel audio decoder 360 may include a decompressor 370 and a multi-channel audio signal reconstructor 380.

The audio decoding apparatus 300 may include the memory 310 and the processor 330 of fig. 3a, and instructions for implementing the components 350, 360, 370, and 380 of fig. 3a may be stored in the memory 310. Processor 330 may execute instructions stored in memory 310.

The information acquirer 350 may acquire the compressed audio signal of the basic channel group from the bitstream. That is, the information acquirer 350 may classify a basic channel audio stream including at least one compressed audio signal from a basic channel group of a bitstream.

The information acquirer 350 may also acquire at least one compressed audio signal of at least one slave channel group from the bitstream. That is, the information acquirer 350 may classify at least one sub-channel audio stream including at least one compressed audio signal of a sub-channel group from the bitstream.

In some embodiments, the bitstream may include a base channel audio stream and a plurality of slave channel streams. The plurality of slave channel audio streams may include a first slave channel audio stream and a second slave channel audio stream.

In this case, a channel restriction of a multi-channel first audio signal reconstructed by a base channel audio stream and a first slave channel audio stream and a multi-channel second audio signal reconstructed by the base channel audio stream, the first slave channel audio stream and a second slave channel audio stream is described.

For example, in the slave basic channel audio stream and the firstAmong channels of the first multi-channel layout of the slave channel audio stream reconstruction, the number of surround channels may be S _n-1 The number of the subwoofer channels can be W _n-1 The number of overhead channels may be H _n-1 . In a second multi-channel arrangement reconstructed from the base channel audio stream, the first slave channel audio stream and the second slave channel audio stream, the number of surround channels may be S _n The number of the subwoofer channels can be W _n The number of overhead channels may be H _n . In this case S _n-1 Can be less than or equal to S _n ，W _n-1 May be less than or equal to W _n ，H _n-1 May be less than or equal to H _n . Here, S can be excluded _n-1 Equal to S _n 、W _n-1 Equal to W _n And H _n-1 Equal to H _n Is the case in (a). Namely S _n-1 、W _n-1 And H _n-1 All of the possibilities in (a) are respectively not equal to S _n 、W _n And H _n 。

In some embodiments, the bitstream may include a file stream having a plurality of audio tracks including a first audio track and a second audio track. The process of the information acquirer 350 acquiring at least one compressed audio signal of at least one slave channel group according to additional information included in the audio track is described below.

The information acquirer 350 may acquire at least one compressed audio signal of the basic channel group from the first audio track.

The information acquirer 350 may acquire the sub-channel audio signal identification information from the second audio track adjacent to the first audio track.

When the slave channel audio signal identification information indicates that the slave channel audio signal is present in the second audio track, the information acquirer 350 may acquire at least one audio signal of at least one slave channel group from the second audio track.

When the sub-channel audio signal identification information indicates that there is no sub-channel audio signal in the second audio track, the information acquirer 350 may acquire a next audio signal of the basic channel group from the second audio track.

The information acquirer 350 may acquire additional information related to reconstruction of multi-channel audio from a bitstream. That is, the information acquirer 350 may classify metadata including additional information from a bitstream and acquire the additional information from the classified metadata.

The decompressor 370 may reconstruct the audio signal of the basic channel set by decompressing at least one compressed audio signal of the basic channel set.

The decompressor 370 may reconstruct the at least one audio signal of the at least one slave channel group by decompressing the at least one compressed audio signal of the at least one slave channel group.

In this case, the decompressor 370 may include separate first to nth decompressors (not shown) for decoding the compressed audio signal of each channel group (n channel groups). In this case, the first to nth decompressors (not shown) may operate in parallel with each other.

The multi-channel audio signal reconstructor 380 may reconstruct the multi-channel audio signal based on the at least one audio signal of the basic channel group and the at least one audio signal of the at least one slave channel group.

For example, when the audio signal of the basic channel group is an audio signal of a stereo channel, the multi-channel audio signal reconstructor 380 may reconstruct an audio signal of a 3D audio channel in front of the listener based on the audio signal of the basic channel group and the audio signal of the first dependent channel group. For example, the 3D audio channel in front of the listener may be a 3.1.2 channel.

Alternatively or additionally, the multi-channel audio signal reconstructor 380 may reconstruct the audio signal of the listener omnidirectional audio channel based on the audio signal of the basic channel group, the audio signal of the first slave channel group, and the audio signal of the second slave channel group. For example, the listener-specific 3D audio channel may be a 5.1.2 channel or a 7.1.4 channel.

The multi-channel audio signal reconstructor 380 may reconstruct the multi-channel audio signal based not only on the audio signal of the basic channel group and the audio signal of the slave channel group, but also on the additional information. In this case, the additional information may be additional information for reconstructing the multi-channel audio signal. The multi-channel audio signal reconstructor 380 may output the reconstructed at least one multi-channel audio signal.

The multi-channel audio signal reconstructor 380 according to various embodiments of the present disclosure may generate a first audio signal of a 3D audio channel in front of a listener from at least one audio signal of a basic channel group and at least one audio signal of at least one slave channel group. The multi-channel audio signal reconstructor 380 may reconstruct a multi-channel audio signal including a second audio signal of the front 3D audio channel based on the first audio signal and the audio object signal of the front 3D audio channel. In this case, the audio object signal may indicate at least one of an audio signal, shape, area, position, or direction of an audio object (sound source), and may be obtained from the information acquirer 350.

Detailed operation of the multi-channel audio signal reconstructor 380 is described with reference to fig. 3 c.

Referring to fig. 3c, the multi-channel audio signal reconstructor 380 may include an upmix channel group audio generator 381 and a rendering unit 386.

The upmix channel group audio generator 381 may generate an audio signal of the upmix channel group based on the audio signal of the basic channel group and the audio signal of the dependent channel group. In this case, the audio signal of the upmix channel group may be a multi-channel audio signal. Alternatively or additionally, the multi-channel audio signal may be generated based on additional information (e.g., information about dynamic unmixed weight parameters).

The upmix channel group audio generator 381 may generate audio signals of upmix channels by mixing the audio signals of the basic channel group and some of the audio signals of the slave channel group. For example, the audio signals L3 and R3 of the unmixed channels (or upmixed channels) may be generated by unmixing the audio signals L and R of the basic channel group and a part of the audio signal C of the dependent channel group

The upmix channel group audio generator 381 may generate audio signals of some channels of the multi-channel audio signal by bypassing a unmixing operation with respect to some audio signals of the slave channel groups. For example, the upmix channel group audio generator 381 may generate audio signals of C, LFE, hfl and Hfr3 channels of the multi-channel audio signal by bypassing a unmixing operation with respect to audio signals of C, LFE, hfl and Hfr3 channels of some audio signals as the slave channel group.

Accordingly, the upmix channel group audio generator 381 may generate the audio signal of the upmix channel group based on the audio signals of the upmix channels generated through the upmix and the audio signals of the slave channel group whose upmix operation is bypassed. For example, the upmix channel group audio generator 381 may generate audio signals of L3, R3, C, LFE, hfl, and Hfr3 channels as audio signals of 3.1.2 channels based on audio signals of L3 and R3 channels as audio signals of a downmix channel and audio signals of C, LFE, hfl3 and Hfr3 channels as audio signals of a subordinate channel group.

The detailed operation of the upmix channel group audio generator 381 is described with reference to fig. 3 d.

The rendering unit 386 may include a volume controller 388 and a limiter 389. The multi-channel audio signal input to the rendering unit 386 may be a multi-channel audio signal of at least one channel layout. The multi-channel audio signal input to the rendering unit 386 may be a Pulse Code Modulation (PCM) signal.

In some embodiments, the volume (loudness) of the audio signal of each channel may be measured based on ITU-R bs.1770, which may be signaled by additional information of the bitstream.

The volume controller 388 may control the volume of the audio signal of each channel to a target volume (e.g., -24 LKFS) based on the volume information signaled by the bitstream.

In some embodiments, the true peak may be measured based on ITU-R BS.1770

Limiter 389 may limit the true peak level of the audio signal (e.g., to 1 dBTP) after volume control.

Although the post-processing components 388 and 389 included in the rendering unit 386 have been described so far, at least one component may be omitted and the order of each component may be changed according to circumstances, without being limited thereto.

The multi-channel audio signal output unit 390 may output the post-processed at least one multi-channel audio signal. For example, the multi-channel audio signal output unit 390 may output an audio signal of each channel of the multi-channel audio signal to an audio output device corresponding to each channel according to a target channel layout, with the post-processed multi-channel audio signal as an input. The audio output device may include various types of speakers.

Referring to fig. 3d, the upmix channel group audio generator 381 may include a unmixing unit 382. The unmixing unit 382 may include a first unmixing unit 383 and second through nth unmixing units 384 through 385.

The unmixed unit 382 may obtain an audio signal (e.g., an upmix channel or an unmixed channel) of a new channel from the audio signal of the base channel group and the audio signals of some channels (e.g., decoding channels) of the slave channel group. That is, the unmixed unit 382 may obtain an audio signal of one upmixed channel from at least one audio signal in which several channels are mixed. The unmixed unit 382 may output an audio signal of a specific layout, which includes an audio signal of an upmix channel and an audio signal of a decoding channel.

For example, the unmixing operation may be bypassed in the unmixing unit 382 such that the audio signals of the basic channel group may be output as audio signals of the first channel arrangement.

The first unmixed unit 383 may unmixed the audio signals of some channels with the audio signals of the basic channel group and the audio signals of the first sub-channel group as inputs. In this case, an audio signal of the unmixed channel (or upmix channel) may be generated. The first unmixed unit 383 may generate audio signals of independent channels by bypassing a mixing operation with respect to audio signals of other channels. The first unmixed unit 383 may output an audio signal of the second channel layout, which is a signal including an audio signal of an upmix channel and an audio signal of an independent channel.

The second unmixer unit 384 may generate the audio signal of the unmixed channel (or the upmix channel) by unmixing the audio signal of some of the audio signals of the second channel arrangement and the audio signals of the second slave channel. The second unmixer unit 384 may generate the audio signals of the independent channels by bypassing the mixing operation with respect to the audio signals of the other channels. The second unmixed unit 384 may output the audio signal of the third channel arrangement, which includes the audio signal of the upmix channel and the audio signal of the independent channel.

Similar to the operation of the second unmixer unit 384, the nth unmixer unit (not shown) may output the audio signal of the nth channel arrangement based on the audio signal of the (n-1) th channel arrangement and the audio signal of the (n-1) th dependent channel group. N may be less than or equal to N.

The nth unmixed unit 385 may output the audio signal of the nth channel arrangement based on the audio signal of the (N-1) th channel arrangement and the audio signal of the (N-1) th slave channel group.

Although the audio signals of the lower channel layout are shown to be directly input to the respective unmixed units 383 and 384-385, the audio signals of the channel layout output through the rendering unit 386 of fig. 3c may be input to each of the unmixed units 383 and 384-385. That is, the post-processed audio signal of the lower channel layout may be input to each of the unmixed units 383 and 384 to 385.

With reference to fig. 3d, it is described that the downmix units 383 and 384 to 385 may be connected in a cascade manner to output an audio signal of each channel layout.

However, without connecting the unmixed units 383 and 384 to 385 in a cascade manner, an audio signal of a specific layout may be output from among the audio signals of the basic channel group and the audio signals of the at least one slave channel group.

In some embodiments, an audio signal generated by mixing signals of several channels in the audio encoding apparatuses 200 and 400 may use a downmix gain to reduce the level to prevent clipping (clipping). The audio decoding apparatuses 300 and 500 may match the level of the audio signal with the level of the original audio signal based on the corresponding downmix gain of the signals generated by mixing.

In other embodiments, the operations based on the downmix gains described above may be performed for each channel or group of channels. The audio encoding apparatuses 200 and 400 may signal information about the downmix gain through additional information of the bitstream of each channel or each channel group. Accordingly, the audio decoding apparatuses 300 and 500 may obtain information on a downmix gain from additional information of a bitstream of each channel or each channel group and perform the above-described operations based on the downmix gain.

In other embodiments, the unmixing unit 382 may perform the unmixing operation based on dynamic unmixing weight parameters of the unmixing matrix (corresponding to the downmix weight parameters of the downmix matrix). In this case, the audio encoding apparatuses 200 and 400 may signal the dynamic unmixed weight parameters or the dynamic downmix weight parameters corresponding thereto through additional information of the bitstream. Some of the unmixed weight parameters may not be signaled and have a fixed value.

Accordingly, the audio decoding apparatuses 300 and 500 may obtain information on the dynamic downmix weight parameters (or information on the dynamic downmix weight parameters) from additional information of the bitstream and perform a downmix operation based on the obtained information on the dynamic downmix weight parameters (or information on the dynamic downmix weight parameters).

Referring to fig. 4a, the audio encoding apparatus 400 may include a multi-channel audio encoder 450, a bitstream generator 480, and an error-cancellation-related information generator 490. The multi-channel audio encoder 450 may include a multi-channel audio signal processor 460 and a compressor 470.

The components 450, 460, 470, 480, and 490 of fig. 4a may be implemented by the memory 210 and the processor 230 of fig. 2 a.

The operation of the multi-channel audio encoder 450, the multi-channel audio signal processor 460, the compressor 470 and the bitstream generator 480 of fig. 4a corresponds to the operation of the multi-channel audio encoder 250, the multi-channel audio signal processor 260, the compressor 270 and the bitstream generator 280, respectively, and thus the detailed description thereof is replaced by the description of fig. 2 b.

The error-cancellation related information generator 490 may be included in the additional information generator 285 of fig. 2b, but may also exist alone, without being limited thereto.

Error cancellation related information generator 490 may determine a factor (e.g., a scaling factor) for error cancellation based on the first power value and the second power value. In this case, the first power value may be an energy value of one channel of the original audio signal or an audio signal of one channel obtained by downmixing from the original audio signal. The second power value may be a power value of an audio signal of an upmix channel that is one of the audio signals of the upmix channel group. The audio signal of the upmix channel group may be an audio signal obtained by mixing the base channel reconstruction signal and the slave channel reconstruction signal.

The error-cancellation related information generator 490 may determine a factor of error cancellation for each channel.

The error-cancellation related information generator 490 may generate error-cancellation related information (or error-cancellation related information) including information about the determined factor of error cancellation. The bit stream generator 480 may generate a bit stream further including error cancellation related information. The detailed operation of the error-cancellation-related information generator 490 is described with reference to fig. 4 b.

Referring to fig. 4b, the error-cancellation-related information generator 490 may include a decompressor 492, a unmixing unit 494, a Root Mean Square (RMS) value determination unit 496, and an error-cancelled factor determination unit 498.

The decompressor 492 may generate a basic channel reconstruction signal by decompressing the compressed audio signals of the basic channel group. Alternatively or additionally, the decompressor 492 may generate the slave channel reconstruction signal by decompressing the compressed audio signals of the slave channel groups.

The unmixer unit 494 may unmixe the base channel reconstruction signal and the slave channel reconstruction signal to generate an audio signal of the upmixed channel group. For example, the unmixed unit 494 may generate an audio signal of an upmix channel (or an unmixed channel) by unmixing audio signals of some channels of the audio signals of the base channel group and the slave channel group. The unmixer unit 494 may bypass the unmixer operation with respect to some of the audio signals of the base channel group and the slave channel group.

The unmixer unit 494 may obtain an audio signal of an upmix channel group including an audio signal of an upmix channel and an audio signal whose unmixed operation is bypassed.

The RMS value determining unit 496 may determine the RMS value of the first audio signal of one upmix channel of the upmix channel group. The RMS value determining unit 496 may determine the RMS value of the second audio signal of one channel of the original audio signal or the RMS value of the second audio signal of one channel of the audio signal downmixed from the original audio signal. In this case, the channels of the first audio signal and the channels of the second audio signal may indicate the same channels in the channel layout.

The error-canceled factor determination unit 498 may determine the error-canceled factor based on the RMS value of the first audio signal and the RMS value of the second audio signal. For example, a value generated by dividing the RMS value of the first audio signal by the RMS value of the second audio signal may be obtained as a factor of error cancellation. The error-canceled factor determination unit 498 may generate information about the determined error-canceled factor. The error-canceled factor determination unit 498 may output error-canceled-related information including information about the error-canceled factor.

Referring to fig. 5a, the audio decoding apparatus 500 may include an information acquirer 550, a multi-channel audio decoder 560, a decompressor 570, a multi-channel audio signal reconstructor 580, and an error cancellation related information acquirer 555. The components 550, 555, 560, 570, and 580 of fig. 5a may be implemented by the memory 310 and the processor 330 of fig. 3 a.

Instructions for implementing components 550, 555, 560, 570, and 580 of fig. 5a may be stored in memory 310 of fig. 3 a. Processor 330 may execute instructions stored in memory 310.

The operations of the information acquirer 550, the decompressor 570 and the multi-channel audio signal reconstructor 580 of fig. 5a include the operations of the information acquirer 350, the decompressor 370 and the multi-channel audio signal reconstructor 380 of fig. 3b, respectively, and thus redundant descriptions are replaced with those described with reference to fig. 3 b. Hereinafter, a description is provided without redundancy to the description of fig. 3 b.

The information acquirer 550 may acquire metadata from a bitstream.

The error-cancellation-related information acquirer 555 may acquire the error-cancellation-related information from metadata included in the bit stream. Here, the information on the factor of error cancellation included in the error cancellation related information may be a factor of error cancellation of an audio signal of one upmix channel of the upmix channel group. An error cancellation related information acquirer 555 may be included in the information acquirer 550.

The multi-channel audio signal reconstructor 580 may generate an audio signal of the upmix channel group based on at least one audio signal of the basic channel and at least one audio signal of the at least one slave channel group. The audio signal of the upmix channel group may be a multi-channel audio signal. The multi-channel audio signal reconstructor 580 may reconstruct an audio signal of one upmix channel by applying an error-canceling factor to an audio signal of one upmix channel included in the upmix channel group.

The multi-channel audio signal reconstructor 580 may output a multi-channel audio signal including the reconstructed audio signal of one upmix channel.

The multi-channel audio signal reconstructor 580 may include an upmix channel group audio generator 581 and a rendering unit 583. The rendering unit 583 may include an error cancellation unit 584, a fader 585, a limiter 586, and a multi-channel audio signal output unit 587.

The upmix channel group audio generator 581, the error cancellation unit 584, the fader 585, the limiter 586 and the multi-channel audio signal output unit 587 of fig. 5b may include the operations of the upmix channel group audio generator 381, the fader 388, the limiter 389 and the multi-channel audio signal output unit 390 of fig. 3c, and thus redundant description is replaced with the description referring to fig. 3 c. Hereinafter, a portion that is not redundant to fig. 3c is described.

The error cancellation unit 584 may reconstruct the error cancelled audio signal of the first channel based on the audio signal of the first upmix channel of the upmix channel group of the multi-channel audio signal and the error cancelled factor of the first upmix channel. In this case, the factor of error cancellation may be a value based on the RMS value of the audio signal of the original audio signal or of the first channel of the audio signal downmixed from the original audio signal and the RMS value of the audio signal of the first upmixed channel of the upmixed channel group. The first channel and the first upmix channel may indicate the same channel of the channel layout. The error removal unit 584 may remove errors caused by encoding by making the RMS value of the audio signal of the first upmix channel of the current upmix channel group the RMS value of the audio signal of the first channel of the original audio signal or the audio signal downmixed from the original audio signal.

In some embodiments, the factor of error cancellation between adjacent audio frames may be different. In this case, at the end portion of the previous frame and the beginning portion of the next frame, the audio signal may beat (bounce) due to a discontinuity factor for error cancellation.

Accordingly, the error cancellation unit 584 may determine the error-cancelled factor used in the frame boundary adjacent section by performing smoothing on the error-cancelled factor. The frame boundary adjacent portion may refer to an end portion of a previous frame relative to the boundary and a first portion of a next frame relative to the boundary. Each section may include a number of samples.

Here, smoothing may refer to an operation of converting a factor of discontinuous error cancellation between adjacent audio frames into a factor of continuous error cancellation in a frame boundary portion.

The multi-channel audio signal output unit 588 may output a multi-channel audio signal including an error-canceled audio signal of one channel.

In some embodiments, at least one of the post-processing components 585 and 586 included in the rendering unit 583 may be omitted, and the order of the post-processing components 584, 585 and 586 including the error canceling unit 584 may be changed according to circumstances.

As described above, the audio decoding apparatuses 200 and 400 may generate a bitstream. The audio encoding apparatuses 200 and 400 may transmit the generated bit stream.

In this case, the bitstream may be generated in the form of a file stream. The audio decoding apparatuses 300 and 500 may receive a bitstream. The audio decoding apparatuses 300 and 500 may reconstruct the multi-channel audio signal based on information obtained from the received bitstream. In this case, the bitstream may be included in a specific file container. For example, the file container may be a Moving Picture Experts Group (MPEG) -4 media container for compressing various multimedia digital data, such as MPEG-4 part 14 (MP 4), etc.

Hereinafter, referring to fig. 6, a file structure according to various embodiments of the present disclosure is described.

Referring to fig. 6, a file 600 may include a metadata box 610 and a media data box 620.

For example, metadata box 610 may be a moov box of an MP4 file container and media data box 620 may be an mdat box of an MP4 file container.

A metadata box (box) 610 may be located in the header portion of the file 600. Metadata box 610 may be a data box that stores metadata for media data. For example, metadata box 610 may include additional information 615 as described above.

The media data box 620 may be a data box that stores media data. For example, the media data box 620 may include a base channel audio stream or a slave channel audio stream 625.

In the base channel audio stream or the slave channel audio stream 625, the base channel audio stream may comprise compressed audio signals of a base channel group.

In addition to the base channel audio stream or the slave channel audio stream 625, the slave channel audio stream may comprise compressed audio signals of a slave channel group.

The media data box 620 may include additional information 630. Additional information 630 may be included in the header portion of the media data box 620. Without being limited thereto, the additional information 630 may be included in a header (header) portion of the base channel audio stream or the slave channel audio stream 625. In particular, the additional information 630 may be included in a header portion of the slave channel audio stream 625.

The audio decoding apparatuses 300 and 500 may obtain additional information 615 and 630 included in respective portions of the file 600. The audio decoding apparatuses 300 and 500 may reconstruct a multi-channel audio signal based on the audio signal of the basic channel group, the audio signal of the slave channel group, and the additional information 615 and 630. Here, the audio signal of the basic channel group may be obtained from the basic channel audio stream, and the audio signal of the slave channel group may be obtained from the slave channel audio stream.

Referring to fig. 7a, a file 700 may include a metadata box 710 and a media data box 730.

The file 700 may include a metadata box 710 and a media data box 730. The metadata box 710 may include a metadata box of at least one audio track.

For example, the metadata box 710 may include a metadata box 715 for audio track #n (where n is an integer greater than or equal to 1). For example, the metadata box 715 of audio track #n may be a track box of an MP4 container.

The metadata box 715 of the audio track #n may include additional information 720.

In some embodiments, the media data box 730 may comprise a media data box of at least one audio track. For example, media data box 730 may include media data box 735 of audio track #n (where n is an integer greater than or equal to 1). The location information included in the metadata box 715 for audio track #n may indicate the location of the media data box 735 for audio track #n in the media data box 730. The media data box 735 of audio track #n may be identified based on the position information included in the metadata box 710 of audio track #n

The media data box 735 of audio track #n may include a base channel audio stream and a dependent channel audio stream 740 and additional information 745. The additional information 745 may be located in the header portion of the media data box of audio track #n. Alternatively or additionally, the additional information 745 may be included in a header portion of at least one of the base channel audio stream or the slave channel audio stream 740.

In operation S700, the audio decoding apparatuses 300 and 500 may obtain identification information of the audio track #n from the additional information included in the metadata.

In operation S705, the audio decoding apparatuses 300 and 500 may identify whether the identification information of the audio track #n indicates an audio signal of a basic channel group or whether the identification information of the audio track #n indicates an audio signal of a basic/subordinate channel group.

For example, the identification information of the audio track #n included in the OPUS audio format file may be a channel map family (CMF, channel mapping family). When the CMF is 1, the audio decoding apparatus 300 and 500 may identify that the audio signal of the basic channel group is included in the current audio track. For example, the audio signal of the basic channel group may be an audio signal of a stereo channel layout. When the CMF is 4, the audio decoding apparatus 300 and 500 may identify that the audio signal of the basic channel group and the audio signal of the subordinate channel group are included in the current audio track.

In operation S710, when the identification information of the audio track #n indicates an audio signal of a basic channel group, the audio decoding apparatus 300 and 500 may obtain a compressed audio signal of the basic channel group included in the media data box of the audio track #n. The audio decoding apparatuses 300 and 500 may decompress the compressed audio signals of the basic channel sets.

The audio decoding apparatus 300 and 500 may reproduce the audio signals of the basic channel group in operation S720.

In operation S730, when the identification information of the audio track #n indicates the audio signal of the basic/sub channel group, the audio decoding apparatus 300 and 500 may obtain the compressed audio signal of the basic channel group included in the media data box of the audio track #n. The audio decoding apparatuses 300 and 500 may decompress the obtained compressed audio signals of the basic channel sets.

In operation S735, the audio decoding apparatus 300 and 500 may obtain compressed audio signals of the sub-channel groups included in the media data box of the audio track #n

The audio decoding apparatuses 300 and 500 may decompress the obtained compressed audio signals of the slave channel groups.

In operation S740, the audio decoding apparatus 300 and 500 may generate an audio signal of at least one upmix channel group based on the audio signal of the basic channel group and the audio signal of the slave channel group.

The audio decoding apparatuses 300 and 500 may generate audio signals of at least one independent channel by bypassing a unmixed operation with respect to some of the audio signals of the base channel group and the audio signals of the dependent channel group. The audio decoding apparatuses 300 and 500 may generate an audio signal of an upmix channel group including an audio signal of at least one upmix channel and an audio signal of at least one independent channel.

In operation S745, the audio decoding apparatus 300 and 500 may reproduce the multi-channel audio signal. In this case, the multi-channel audio signal may be one of the audio signals of the at least one upmix channel group.

In operation S750, the audio decoding apparatuses 300 and 500 may identify whether the next audio track needs to be processed. When the audio decoding apparatuses 300 and 500 identify that the next audio track needs to be processed, the audio decoding apparatuses 300 and 500 may obtain identification information of the next audio track #n+1 and perform the operations S705 to S750 described above. That is, the audio decoding apparatuses 300 and 500 may increment the variable n by 1 to determine a new n, obtain identification information of the audio track #n, and perform the above-described operations S705 to S750.

As described above with reference to fig. 7a and 7b, one audio track comprising a compressed audio signal of a basic channel group and a compressed audio signal of a dependent channel group may be generated. However, when the identification information of the audio track indicates the audio signal of the basic/sub channel group, the conventional audio decoding apparatus may not obtain the compressed audio signal of the basic channel group from the corresponding audio track. That is, referring to fig. 7a and 7b, backward compatibility with audio signals such as basic channel groups of a stereo audio signal is not supported.

Referring to fig. 7c, file 750 may include metadata box 760 and media data box 780. The metadata box 760 may include a metadata box of at least one audio track. For example, metadata box 760 may include metadata box 765 for audio track #n (where n is an integer greater than or equal to 1) and metadata box 770 for audio track #n+1. The metadata box 770 of the audio track #n may include additional information 775.

Media data box 780 may include media data box 782 of audio track #n. The media data box 782 of audio track #n may include a base channel audio stream 784.

Media data box 780 may include media data box 786 of audio track #n+1. The media data box 786 of audio track #n+1 may include a slave channel audio stream 788. The media data box 786 of audio track #n+1 may include the additional information 790 described above. In this case, the additional information 790 may be included in the header portion of the media data box 786 of the audio track #n+1, but is not limited thereto.

The location information included in metadata box 765 for audio track #n may indicate the location of media data box 782 for audio track #n in media data box 780. The media data box 782 of audio track #n may be identified based on the location information included in the metadata box 765 of audio track #n. Likewise, the media data box 786 of audio track #n+1 may be identified based on the location information included in the metadata box 770 of audio track #n+1.

Referring to fig. 7D, the audio decoding apparatus 300 and 500 may obtain identification information of the audio track #n from additional information included in the metadata box in operation S750.

In operation S755, the audio decoding apparatuses 300 and 500 may identify whether the obtained identification information of the audio track #n indicates an audio signal of the basic channel group or an audio signal of the dependent channel group.

In operation S760, when the identification information of the audio track #n indicates the audio signal of the basic channel group, the audio decoding apparatus 300 and 500 may decompress the compressed audio signal of the basic channel group included in the audio track #n.

The audio decoding apparatus 300 and 500 may reproduce the audio signals of the basic channel group in operation S765.

In operation S770, when the identification information of the audio track #n indicates the audio signal of the sub-channel group, the audio decoding apparatus 300 and 500 may obtain the compressed audio signal of the sub-channel group of the audio track #n. The audio decoding apparatuses 300 and 500 may decompress the compressed audio signals of the slave channel groups of the audio track #n. The audio track of the audio signal of the basic channel group corresponding to the audio signal of the dependent channel group may be an audio track #n-1. That is, the compressed audio signal of the basic channel group may be included in an audio track preceding the audio track of the compressed audio signal including the dependent channel. For example, the compressed audio signal of the basic channel group may be included in an audio track adjacent to an audio track including the compressed audio signal of the subordinate channel among the previous audio tracks. Accordingly, the audio decoding apparatuses 300 and 500 may obtain compressed audio signals of the basic channel group of the audio track #n-1 before operation S770. Alternatively or additionally, the audio decoding apparatuses 300 and 500 may decompress the obtained compressed audio signals of the basic channel group.

The audio decoding apparatus 300 and 500 may generate the audio signal of at least one upmix channel group based on the audio signal of the basic channel group and the audio signal of the slave channel group in operation S775.

In operation S780, the audio decoding apparatuses 300 and 500 may reproduce a multi-channel audio signal that is one of the audio signals of the at least one upmix channel group.

In operation S785, the audio decoding apparatus 300 and 500 may identify whether the next audio track needs to be processed. When the audio decoding apparatuses 300 and 500 identify that the next audio track needs to be processed, the audio decoding apparatuses 300 and 500 may obtain identification information of the next audio track #n+1 and perform the above-described operations S755 to S785. That is, the audio decoding apparatuses 300 and 500 may increment the variable n by 1 to determine a new n, obtain the identification information of the audio track #n, and perform the above-described operations S755 to S785.

As described above with reference to fig. 7c and 7D, an audio track of a compressed audio signal comprising a dependent channel group may be generated separately from an audio track of a compressed audio signal comprising a basic channel group. When the identification information of the audio track indicates the audio signal of the dependent channel group, the conventional audio decoding apparatus may not obtain the compressed audio signal of the dependent channel group from the corresponding audio track. However, unlike the previous description with reference to fig. 7a and 7b, the conventional audio decoding apparatus may decompress the compressed audio signal of the basic channel group included in the previous audio track to reproduce the audio signal of the basic channel group.

Accordingly, referring to fig. 7c and 7D, backward compatibility with a stereo audio signal (e.g., an audio signal of a basic channel group) may be supported.

The audio decoding apparatuses 300 and 500 may obtain the compressed audio signal of the basic channel group and the compressed audio signal of the dependent channel group included in the separate audio tracks. The audio decoding apparatuses 300 and 500 may decompress the compressed audio signals of the basic channel group obtained from the first audio track. The audio decoding apparatuses 300 and 500 may decompress the compressed audio signals of the slave channel groups obtained from the second audio track. The audio decoding apparatuses 300 and 500 may reproduce multi-channel audio signals based on the audio signals of the basic channel group and the audio signals of the slave channel group.

In some embodiments, the number of slave channel groups corresponding to the base channel group may be a plurality. In this case, a plurality of audio tracks of the audio signal comprising at least one slave channel group may be generated. For example, an audio track #n of an audio signal comprising at least one slave channel group #1 may be generated. An audio track #n+1 of the audio signal comprising at least one slave channel group #2 may be generated. Like the audio track #n+1, an audio track #n+2 including an audio signal of at least one dependent channel group #3 may be generated. Similar to the previous description, an audio track #n+m-1 of an audio signal comprising at least one slave channel group #m may be generated. The audio decoding apparatuses 300 and 500 may obtain compressed audio signals of the slave channel groups #1, #2, # m, and #1, #2, which are included in the audio tracks #n, # n+1, # n, # m-1, and decompress the obtained compressed audio signals of the slave channel groups #1, #2, # m. The audio decoding apparatuses 300 and 500 may reconstruct a multi-channel audio signal based on the audio signal of the basic channel group and the audio signals of the slave channel groups #1, # 2.

The audio decoding apparatuses 300 and 500 may obtain compressed audio signals of audio tracks of the audio signals including the supported channel layout according to the supported channel layout. The audio decoding apparatuses 300 and 500 may not obtain a compressed audio signal including an audio track of an audio signal of an unsupported channel layout. The audio decoding apparatuses 300 and 500 may obtain compressed audio signals of some total audio tracks according to supported channel layouts and decompress compressed audio signals of at least one slave channel included in some audio tracks. Accordingly, the audio decoding apparatuses 300 and 500 may reconstruct the multi-channel audio signal according to the supported channel layout.

Referring to fig. 8a, the additional information 820 may be included in the metadata box 810 of the metadata container track #n+1 instead of the metadata box of the audio track #n+1 of fig. 7 c. Alternatively or additionally, the slave channel audio stream 840 may be included in the media data box 830 of the metadata container track #n+1 instead of the media data box of the audio track #n+1. That is, the additional information 820 may be included in the metadata container track instead of the audio track. However, the metadata container track and the audio track may be managed in the same track group such that when the audio track of the basic channel audio stream has #n, the metadata container track may have #n+1 of the dependent channel audio stream.

The audio decoding apparatuses 300 and 500 may identify the type of each track.

In operation S800, the audio decoding apparatuses 300 and 500 may identify whether there is a metadata container tracking the audio track #n+1 corresponding to the audio track #n. That is, the audio decoding apparatuses 300 and 500 may identify that the audio track #n is one of the audio tracks, and may identify the audio track #n+1. The audio decoding apparatuses 300 and 500 may identify whether the track #n+1 is a metadata container track corresponding to the audio track #n

In operation S810, when the audio decoding apparatus 300 and 500 identifies that the metadata container track #n+1 track corresponding to the audio track #n does not exist, the audio decoding apparatus 300 and 500 may decompress the compressed audio signal of the basic channel group.

The audio decoding apparatus 300 and 500 may reproduce the decompressed audio signals of the basic channel group in operation S820.

When the audio decoding apparatus 300 and 500 identifies that there is a metadata container track #n+1 track corresponding to the audio track #n, the audio decoding apparatus 300 and 500 may decompress the compressed audio signal of the basic channel group in operation S830.

In operation S840, the audio decoding apparatus 300 and 500 may decompress the compressed audio signals of the dependent channel groups of the metadata container track.

In operation S850, the audio decoding apparatus 300 and 500 may generate an audio signal of at least one upmix channel group based on the decompressed audio signal of the basic channel group and the decompressed audio signal of the at least one upmix channel group.

In operation S860, the audio decoding apparatus 300 and 500 may reproduce a multi-channel audio signal that is one of the audio signals of the at least one upmix channel group.

In operation S870, the audio decoding apparatus 300 and 500 may identify whether the next audio track needs to be processed. When there is a metadata container track #n+1 corresponding to the audio track #n, the audio decoding apparatuses 300 and 500 may identify whether the track #n+2 exists as a next track, and when the track #n+2 exists, the audio decoding apparatuses 300 and 500 may obtain identification information of the tracks #n+2 and #n+3 and perform the above-described operations S800 to S870. That is, the audio decoding apparatuses 300 and 500 may increment the variable n by 2 to determine new n, obtain identification information of the tracks #n and #n+1, and perform the above-described operations S800 to S870.

When the metadata container audio track #n+1 corresponding to the audio track #n does not exist, the audio decoding apparatuses 300 and 500 may identify whether the audio track #n+1 exists as a next audio track, and when the audio track #n+1 exists, the audio decoding apparatuses 300 and 500 may obtain identification information of the audio tracks #n+1 and #n+2 and perform the above-described operations S800 to S870. That is, the audio decoding apparatuses 300 and 500 may increment the variable n by 1 to determine new n, obtain identification information of the tracks #n+1 and #n+2, and perform the above-described operations S800 to S870.

As described above with reference to fig. 7a, the media data box 735 of audio track #n may include a base channel audio stream or a slave channel audio stream 740.

Referring to fig. 9a, an audio track #n packet 900 may include a metadata header 910, a base channel audio packet 920, and a slave channel audio packet 930. The base channel audio packet 920 may include a portion of a base channel audio stream and the slave channel audio packet 930 may include a portion of a slave channel audio stream. The metadata header 910 may be located in the header portion of the audio track #n packet 900. The metadata header 910 may include additional information. However, without being limited thereto, the additional information may be located in the header portion of the slave channel audio packet 930.

Fig. 9b is a view for describing audio track grouping according to the file structure of fig. 7 c.

As described with reference to fig. 7c, the media data box 762 of audio track #n may include a base channel audio stream 764 and the media data box 786 of audio track #n+1 may include a slave channel audio stream 788.

Referring to fig. 9b, the audio track #n packet 940 may include a basic channel audio packet 945. The audio track #n+1 packet 950 may include a metadata header 955 and a slave channel audio packet 960. The metadata header 955 may be located at the header portion of the audio track #n+1 packet 950. The metadata header 955 may include additional information.

However, without limitation, there may be one or more slave channel audio packets 960. Additional information may be included in the header portion of one or more slave channel audio packets 960.

As described with reference to fig. 8a, the media data box 850 of audio track #n may include a base channel audio stream 860 and the media data box 830 of metadata container audio track #n+1 may include a dependent channel audio stream 840.

Fig. 9b and 9c are identical to each other except that the audio track #n+1 packet 950 of fig. 9b is replaced by the metadata container track #n+1 packet (packet) 980 of fig. 9c, so that the description with reference to fig. 9c is replaced by the description of fig. 9 b.

Referring to fig. 10, the metadata header/metadata audio packet 1000 may include at least one of coding type information 1005, voice presence information 1010, voice specification information 1015, LFE presence information 1020, LFE gain information 1025, top audio presence information 1030, scale factor presence information 1035, scale factor information 1040, on-screen audio object presence information 1050, discrete channel audio stream presence information 1055, or continuous channel audio stream presence information 1060.

The encoding type information 1005 may be information for identifying an encoded audio signal in the media data related to the metadata header/metadata audio packet 1000. That is, the coding type information 1005 may be information for identifying the coding structure of the basic channel group and the coding structure of the dependent channel group.

For example, when the value of the encoding type information 1005 is 0x00, it may indicate that the encoded audio signal is an audio signal of a 3.1.2 channel layout. When the value of the encoding type information 1005 is 0x00, the audio decoding apparatuses 300 and 500 may identify that the compressed audio signal of the basic channel group included in the encoded audio signal is the audio signal a/B of the 2-channel layout, and that the compressed audio signals of the other slave channel groups are T, P and Q signals. When the value of the encoding type information 1005 is 0x01, it may indicate that the encoded audio signal is an audio signal of a 5.1.2 channel layout. When the value of the encoding type information 1005 is 0x01, the audio decoding apparatuses 300 and 500 may identify that the compressed audio signal of the basic channel group included in the encoded audio signal is the audio signal a/B of the 2-channel layout, and that the compressed audio signals of the other slave channel groups are T, P, Q and S signals.

When the value of the encoding type information 1005 is 0x02, it can be determined whether the encoded audio signal is an audio signal of a 7.1.4-channel layout. When the value of the encoding type information 1005 is 0x02, the audio decoding apparatuses 300 and 500 may identify that the compressed audio signals of the basic channel groups included in the encoded audio signal are audio signals a/B of a 2-channel layout, and that the compressed audio signals of the other slave channel groups are T, P, Q, S, U and V signals.

When the value of the encoding type information 1005 is 0x03, it may indicate that the encoded audio signal includes an audio signal of a 3.1.2 channel layout and a surround sound audio signal. When the value of the encoding type information 1005 is 0x03, the audio decoding apparatuses 300 and 500 may identify that the compressed audio signals of the basic channel groups included in the encoded audio signal are audio signals a/B of a 2-channel layout, and that the compressed audio signals of the other slave channel groups are T, P and Q signals and W, X, Y and Z signals.

When the value of the encoding type information 1005 is 0x04, it may indicate that the encoded audio signal includes an audio signal of a 7.1.4 channel layout and a surround sound audio signal. When the value of the encoding type information 1005 is 0x04, the audio decoding apparatuses 300 and 500 may identify that the compressed audio signals of the basic channel group included in the encoded audio signal are audio signals a/B of a 2-channel layout, and that the compressed audio signals of the other slave channel groups are T, P, Q, S, U and V signals and W, X, Y and Z signals.

The voice presence information 101 may be information for identifying whether dialogue information exists in an audio signal of a center channel included in media data related to the metadata header/metadata audio packet 1000. The voice specification information 1015 may indicate a specification value of a dialog included in the audio signal of the center channel. The audio decoding apparatuses 300 and 500 may control the volume of the voice signal based on the voice normal information 1015. That is, the audio decoding apparatuses 300 and 500 may differently control the volume level of the ambient sound and the volume level of the dialog sound. Thus, a clearer dialogue sound can be reconstructed. Alternatively or additionally, the audio decoding apparatuses 300 and 500 may uniformly set the volume levels of the voices included in the several audio signals to the target volume based on the voice specification information 1015 and sequentially reproduce the several audio signals.

LFE presence information 1020 may be information for identifying whether LFE is present in media data associated with the metadata header/metadata audio packet 1000.

The audio signal showing the LFE may be included in a designated audio signal portion without being allocated to the center channel according to the intention of the content manufacturer. Thus, when the LFE presence information is on, the audio signal of the LFE channel can be reconstructed.

When the LFE presence information is on, the LFE gain information 1025 may be information indicating the gain of the audio signal of the LFE channel. The audio decoding apparatuses 300 and 500 may output the audio signal of the LFE according to the LFE gain based on the LFE gain information 1025.

The top audio presence information 1030 may indicate whether an audio signal of a top front channel is present in media data associated with the metadata header/metadata audio packet 1000. Here, the top front channel may be a Hfl3 channel (top front left (TFL) channel) and a Hfr3 channel (top front right (TFR) channel) of a 3.1.2 channel layout.

The scale factor presence information 1035 and the scale factor information 1040 may be included in the information about the scale factors of fig. 5 a. The scale factor presence information 1035 may be information indicating whether an RMS scale factor of the audio signal of a particular channel is present. When the scale factor presence information 1035 indicates that the RMS scale factor of the audio signal of the particular channel is present, the scale factor information 1040 may be information indicating a value of the RMS scale factor of the particular channel.

The on-screen audio object presence information 1050 may be information indicating whether an audio object is present on the screen. When the on-screen audio object information 1050 is opened, the audio decoding apparatus 300 and 500 may identify the presence of an audio object on the screen, convert a multi-channel audio signal reconstructed based on the audio signal of the basic channel group and the audio signal of the subordinate channel group into an audio signal of a front-center (3D) audio channel, and output the audio signal.

The discrete channel audio stream presence information 1055 may be information indicating whether the audio stream of the discrete channel is included in media data associated with the metadata header/metadata audio packet 1000. In this case, the discrete channels may be 5.1.2 channels or 7.1.4 channels.

The continuous channel audio stream presence information 1060 may be information indicating whether an audio stream of an audio signal (WXYZ values) of a continuous channel is included in media data related to the metadata header/metadata audio packet 1000. In this case, the audio decoding apparatuses 300 and 500 may convert the audio signal into audio signals of various channel layouts based on the audio signals of surround sound channels such as WXYZ values, regardless of the channel layouts.

Alternatively or additionally, when the on-screen audio object presence information 1050 is opened, the audio decoding apparatuses 300 and 500 may convert WXYZ values to emphasize an on-screen audio signal among audio signals of the 3.1.2-channel layout.

Hereinafter, table 7 includes pseudo code (pseudo code 1) regarding the audio data structure.

TABLE 7

/>

metadata_version [4 bits ], metadata_header_length [9 bits ], and the like may be sequentially included in the structure of the metadata header of pseudo code 1. metadata_version may represent a version of metadata, and metadata_header_length may indicate a length of a metadata header. The speed_ exists may indicate whether dialog audio is present in the central channel. spech_norm may indicate a canonical value obtained by measuring the volume of dialog audio. The lfe_exist may indicate whether an audio signal of the lfe channel exists in the center channel. The ife _ gain may indicate the gain of the audio signal of the ife channel.

on_screen_audio_object_exists may indicate whether an audio object exists on the screen. object_s may indicate on the screen the mixing level of channels in the 3.1.2 audio channels of the audio object. object_g may indicate the area and shape of an on-screen object based on the center of the on-screen audio object. object_v may indicate a motion vector (dx, dy) of an object on the screen in one audio frame. object_l may indicate the position coordinates (x, y) of an object on the screen in an audio frame.

The audio_meta_data_exist may be information indicating whether basic metadata exists, whether audio metadata of a discrete channel exists, and whether audio metadata of a continuous channel exists.

The discrete_audio_metadata_offset may indicate an address of audio metadata of a discrete channel when the audio metadata of the discrete channel exists.

The continuous_audio_metadata_offset may indicate an address of audio metadata of a continuous channel when the audio metadata of the continuous channel exists.

coding_type [8 bits ] and the like may be sequentially included in the structure of the metadata audio packet of pseudo code 1.

The coding type (coding type) may indicate a type of coding structure of the audio signal.

Information such as the existence of a cancellation error ratio may be sequentially included.

The cancellation error ratio presence (cancelation error ratio exist) (3.1.2 channels) may indicate whether a Cancellation Error Ratio (CER) of the audio signal of the 3.1.2 channel layout is present. The cancellation error ratio (3.1.2 channels) may indicate the CER of the audio signal of the 3.1.2 channel layout. Likewise, a cancellation error ratio (5.1.2 lanes), a cancellation error ratio present (7.1.4 lanes), and a cancellation error ratio (7.1.4 lanes) may be present.

The discrete_audio_channel_data may indicate audio channel data of a discrete channel. The audio channel data of the discrete channels may include base_audio_channel_data and dependent_audio_channel_data.

When the value of discrete_audio_level_audio_exists is 1, base_audio_channel_data_ length, dependent _audio_channel_data_length and the like may be sequentially included in the metadata audio packet.

The base_audio_channel_data_length may indicate the length of the basic audio channel data. The dependent audio_channel_data_length may indicate the length of the dependent audio channel data.

Alternatively or additionally, base_audio_channel_data may indicate basic audio channel data.

The dependent audio channel data may indicate the dependent audio channel data.

continuous audio channel data may indicate audio channel data of a continuous channel.

The audio encoding apparatuses 200 and 400 may include a deblocking unit 1105, an audio signal classifier 110, a compressor 1115, a decompressor 1120, and a metadata generator 1130.

The unmixing unit 1105 may obtain an audio signal of the lower channel layout by unmixing the original audio signal. In this case, the original audio signal may be an audio signal of a 7.1.4-channel layout, and the audio signal of a lower-channel layout may be an audio signal of a 3.1.2-channel layout.

The audio signal classifier 1110 may classify an audio signal to be used for compression from among the audio signals of the at least one channel layout. The mixing unit 1113 may generate a mixed channel audio signal by mixing audio signals of some channels. The audio signal classifier 1110 may output a mixed channel audio signal.

For example, the mixing unit 1113 may mix the audio signals L3 and R3 of the 3.1.2 channel layout with the center channel signal c_1 of the 3.1.2 channel layout. In this case, audio signals a and B of a new mixing channel may be generated. C_1 may be a signal obtained by decompressing a compressed signal C of a center channel in an audio signal of a 3.1.2-channel layout.

That is, the signal C of the center channel among the audio signals of the 3.1.2 channel layout may be classified as a T signal. The second compressor 1117 of the compressor 1115 may obtain a T-compressed audio signal by compressing the T-signal. The decompressor 1120 may obtain c_1 by decompressing the T compressed audio signal.

The compressor 1115 may compress at least one audio signal classified by the audio signal classifier 1110. The compressors 1115 may include a first compressor 1116, a second compressor 1117, and a third compressor 1118. The first compressor 1116 may compress audio signals a and B of the basic channel group and generate a basic channel audio stream 1142 including the compressed audio signals a and B. The second compressor 1117 may compress the audio signals T, P, Q and Q2 of the first slave channel group to generate a slave channel audio stream 1144 comprising the compressed audio signals T, P, Q and Q2.

The third compressor 1118 may compress the audio signals S1, S2, U1, U2, V1 and V2 of the second dependent channel group to generate a dependent channel audio stream 1144 comprising the compressed audio signals S1, S2, U1, U2, V1 and V2.

In this case, by classifying audio signals of L, R, C, lfe, ls, rs, hfl and Hfr channels close to the screen among audio signals of a 7.1.4 channel layout into audio signals S1, S2, U1, U2, V1 and V2 and compressing them, the sound quality of the audio channels in the center of the screen can be improved.

The metadata generator 1130 may generate metadata including additional information based on at least one of the audio signal or the compressed audio signal. The audio signal may include an original audio signal and an audio signal of a lower channel layout generated by downmixing the original audio signal. Metadata may be included in the metadata header 1146 of the bitstream 1140.

The mixing unit 1113 may mix the uncompressed audio signal C with L3 and R3 to generate an audio signal a and an audio signal B. However, when the audio decoding apparatuses 300 and 500 obtain l3_1 and r3_1 by unmixing the audio signals a and B mixed with the non-compressed audio signal C, sound quality is degraded more than the original audio signals L3 and R3.

The mixing unit 1113 may generate the audio signal a and the audio signal B by mixing c_1 obtained by decompressing compressed C instead of C. In this case, when the audio decoding apparatuses 300 and 500 generate l3_1 and r3_1 by unmixing the audio signals a and B mixed with the audio signal C1, l3_1 and r3_1 may have improved sound quality compared to l3_1 and r3_1 of the mixed audio signal C.

Referring to fig. 12, the metadata generator 1200 may generate metadata 1250 (e.g., factor information for error cancellation, etc.) with the original audio signal, the compressed audio signal a/B, and the compressed audio signals T/P/Q and S/U/V as inputs.

The decompressor 1210 may decompress the compressed audio signals a/B, T/P/Q and S/U/V. The upmixing unit 1215 may reconstruct an audio signal of a lower channel layout of the original channel audio signal by unmixing some of the audio signals a/B, T/P/Q and S/U/V.

The down-mixing unit 1220 may generate an audio signal of a lower channel layout by mixing the original audio signals. In this case, an audio signal having the same channel layout as the audio signal reconstructed by the upmixing unit 1215 may be generated.

The RMS measurement unit 1230 may measure the RMS value of the audio signal of each upmix channel reconstructed by the upmix unit 1215. The RMS measurement unit 1230 may measure the RMS value of the audio signal of each channel generated from the downmix unit 1220.

The RMS comparator 1235 may one-to-one compare the RMS value of the audio signal of the upmix channel reconstructed by the upmix unit 1215 with the RMS value of the audio signal of the channel generated by the downmix unit 1220 for each channel to generate a factor of error cancellation for each upmix channel.

The metadata generator 1200 may generate metadata 1250 that includes information about factors of error cancellation for each upmix channel.

The voice detector 1240 may identify whether voice is present from the audio signal C of the center channel included in the original audio signal. The metadata generator 1200 may generate metadata 1250 including voice presence information based on the identification result of the voice detector 1240.

The voice measurement unit 1242 may measure a specification value of voice from the audio signal C of the center channel included in the original audio signal. The metadata generator 1200 may generate metadata 1250 including voice specification information based on the measurement result of the voice measurement unit 1242.

LFE detector 1244 may detect LFE from the audio signal of the LFE channel included in the original audio signal. The metadata generator 1200 may generate metadata 1250 including LFE presence information based on the detection result of the LFE detector 1244.

The LFE amplitude measurement unit 1246 may measure the amplitude of the audio signal of the LFE channel included in the original audio signal. The metadata generator 1200 may generate metadata 1250 including LFE gain information based on the measurement result of the LFE amplitude measurement unit 1246.

Referring to fig. 13, the audio decoding apparatuses 300 and 500 may reconstruct an audio signal of at least one channel layout with a bitstream 1300 as an input.

The first decompressor 1305 may reconstruct the a_1 (l2_1) and b_1 (r2_1) signals by decompressing the compressed audio signal of the basic channel audio 1301 included in the bitstream. The 2-channel audio rendering unit 1320 may reconstruct audio signals l2_1 and r2_1 of a 2-channel (stereo channel) layout based on the reconstructed a_1 and b_1 signals l2_1 and r2_1.

The second decompressor 1310 can reconstruct the c_1, lfe_1, hfl3_1, and hfr3_1 signals by decompressing the compressed audio signal of the correlated channel audio 1302 included in the bitstream.

The audio decoding apparatuses 300 and 500 may generate the l3_2 signal by unmixing the c_1 and a_1 signals. The audio decoding apparatuses 300 and 500 may generate the r3_2 signal by unmixing the c_1 and b_1 signals.

The 3.1.2-channel audio rendering unit 1325 may output an audio signal of a 3.1.2-channel layout having l3_2, r3_2, c_1, lfe_1, hfl3_1, and hfr3_1 signals as inputs. The 3.1.2-channel audio rendering unit 1325 may reconstruct the audio signal of the 3.1.2-channel layout based on the metadata included in the metadata header 1303.

The third decompressor 1315 may reconstruct the l_1 and r_1 signals by decompressing the compressed audio signal of the associated channel audio 1302 included in the bitstream 1300.

The audio decoding apparatuses 300 and 500 may generate the ls5_2 signal by unmixing the l3_2 and l_1 signals.

The audio decoding apparatuses 300 and 500 may generate the r5_2 signal by unmixing the r3_1 and r_1 signals.

The audio decoding apparatuses 300 and 500 may generate the hl5_2 signal by unmixing the hfl3_1 and ls5_2 signals. The audio decoding apparatuses 300 and 500 may generate the hr5_2 signal by unmixing the hfr3_1 and rs_2 signals.

The 5.1.2 channel audio rendering unit 1330 may output an audio signal of a 5.1.2 channel layout having c_1, lfe_1, l_1, r_1, ls5_2, r5_2, hl5_2, and hr5_2 signals as inputs. The 5.1.2 channel audio rendering unit 1330 may reconstruct the audio signal of the 5.1.2 channel layout based on the metadata included in the metadata header 1303.

The third decompressor 1315 may reconstruct ls_1, rs_1, hfl_1, and hfr_1 signals by decompressing the compressed audio signal of the correlated channel audio 1302 included in the bitstream 1300.

The audio decoding apparatuses 300 and 500 may generate the lb_2 signal by unmixing the ls5_2 and Ls signals. The audio decoding apparatuses 300 and 500 may generate the rb_2 signal by unmixing the rs_2 and Rs signals. The audio decoding apparatuses 300 and 500 may generate Hbl _2 signals by unmixing hl5_2 and hfl_1 signals. The audio decoding apparatuses 300 and 500 may generate hbr_2 signals by unmixing mhr_2 and hfr_1 signals.

The 7.1.4 channel audio rendering unit 1335 may output the audio signal of the 7.1.4 channel layout using the l_1, r_1, c_1, lfe_2, ls, rs, HFL _1, hfr_1, lb_2, rb_2, hbl _2, and hbr_2 signals as inputs.

The 7.1.4 channel audio rendering unit 1335 may reconstruct the audio signal of the 7.1.4 channel layout based on the metadata included in the metadata header 1303.

Fig. 14 is a view for describing a 3.1.2 channel audio rendering unit 1410, a 5.1.2 channel audio rendering unit 1420, and a 7.1.4 channel audio rendering unit 1430 according to various embodiments of the disclosure.

Referring to fig. 14,3.1.2, the channel audio rendering unit 1410 may generate an l3_3 signal using an l3_2 signal and an l3_2 error-canceled factor (ERF) included in metadata. The 3.1.2 channel audio rendering unit 1410 may generate the r3_3 signal using the r3_2 signal and the r3_2ERF included in the metadata.

The 3.1.2 channel audio rendering unit 1410 may generate an lfe_2 signal using the lfe_1 signal and the LFE gain included in the metadata.

The 3.1.2 channel audio rendering unit 1410 may reconstruct the 3.1.2 channel audio signal including l3_3, r3_3, c_1, lfe_3, hfl3_1, and hfr3_1 signals.

The 5.1.2 channel audio rendering unit 1420 may generate ls5_3 using the ls5_2 signal and ls5_3ERF included in the metadata.

The 5.1.2 channel audio rendering unit 1420 may generate the r5_3 using the r5_2 signal and the r5_3 ERF included in the metadata. The 5.1.2 channel audio rendering unit 1420 may generate hl5_3 using the hl5_2 signal and hl5_2ERF included in the metadata. The 5.1.2 channel audio rendering unit 1420 may generate hr5_3 using the hr5_2 signal and h5_2 ERF.

The 5.1.2 channel audio rendering unit 1420 may reconstruct the 5.1.2 channel audio signal including ls5_3, r5_3, hl5_3, hr5_3, l_1, r_1, c_1, and lfe_2 signals.

The 7.1.4 channel audio rendering unit 1430 may generate lb_3 using the lb_2 signal and the lb_2 ERF.

The 7.1.4 channel audio rendering unit 1430 may generate rb_3 using the rb_2 signal and the rb_2 ERF.

The 7.1.4 channel audio rendering unit 1430 may generate Hbl _3 using the Hbl _2 signal and Hbl _2 ERF.

The 7.1.4 channel audio rendering unit 1430 may generate hbr_3 using the hbr_2 signal and hbr_2 ERF.

The 7.1.4 channel audio rendering unit 1430 may reconstruct the 7.1.4 channel audio signal including the lb_3, rb_3, hbl _3, hbr_3, l_1, r_1, c_1, lfe_2, ls_1, rs_1, hfl_1, and hfr_1 signals.

Fig. 15a is a flowchart for describing a process of determining a factor for error cancellation by an audio encoding device 400 according to various embodiments of the present disclosure.

In operation S1502, the audio encoding apparatus 400 may determine whether the original signal power of the first audio signal is less than a first value. Here, the original signal power may refer to a signal power of an original audio signal or a signal power of an audio signal downmixed from the original audio signal. That is, the first audio signal may be an audio signal of at least some channels of the original audio signal or an audio signal downmixed from the original audio signal.

In operation S1504, when the original signal power of the first audio signal is less than the first value (yes), the audio encoding apparatus 400 may determine the value of the factor of error cancellation of the first audio signal to be 0.

In operation S1506, when the original signal power of the first audio signal is equal to or greater than the first value (no), the audio encoding apparatus 400 may determine whether the original signal power ratio of the first audio signal to the second audio signal is less than the second value.

In operation S1508, when the original signal power of the first audio signal is less than the second value (yes), the audio encoding apparatus 400 may determine a factor of error cancellation based on the original signal power of the first audio signal and the signal power of the decoded first audio signal.

In operation S1510, the audio encoding apparatus 400 may determine whether the value of the factor of error cancellation is greater than 1.

In operation S1512, when the signal power ratio of the first audio signal and the second audio signal is equal to or greater than the second value (no), the audio encoding apparatus 400 may determine the value of the error-canceling factor of the first audio signal as 1.

Alternatively or additionally, in operation S1510, when the value of the error-canceled factor is greater than 1 (yes), the audio encoding apparatus 400 may determine the value of the error-canceled factor of the first audio signal as 1.

Fig. 15b is a flowchart for describing a process of determining a scale factor of an Ls5 signal by the audio encoding apparatus 400 according to various embodiments of the present disclosure.

Referring to fig. 15b, in operation S1514, the audio encoding apparatus 400 may determine whether the power 20log (RMS (Ls 5)) of the Ls5 signal is less than-80 dB. Here, the RMS value may be calculated in units of frames. For example, a frame may include, but is not limited to, 960 samples of an audio signal, and a frame may include a plurality of samples of an audio signal. The root mean square value RMS (X) of X can be calculated by equation 1. Here, N represents the number of samples.

In operation S1516, when the power of the Ls5 signal is less than-80 dB, the audio encoding apparatus 400 may determine the factor of error cancellation of the ls5_2 signal as 0.

In operation S1518, the audio encoding apparatus 400 may determine whether the ratio of the power of the Ls5 signal to the power of the L3 signal for one frame is 20log (RMS (Ls 5)/RMS (L3)) is less than-6 dB.

In operation S1520, the audio encoding apparatus 400 may generate the l3_2 signal when the ratio of the power of the Ls5 signal to the power of the L3 signal of one frame 20log (RMS (Ls 5)/RMS (L3)) is less than-6 dB (yes). For example, the audio encoding apparatus 400 may compress the C signal and the L2 signal by downmixing the original audio signal to obtain the c_1 signal and the l2_1 signal, and may obtain the c_1 signal and the l2_1 signal by decompressing the compressed C signal and the L2 signal. The audio encoding apparatus 400 may generate the l3_2 signal by unmixing the C1 and l2_1 signals.

In operation S1522, the audio encoding apparatus 400 may obtain an l_1 signal by decompressing the compressed L signal.

In operation S1524, the audio encoding apparatus 400 may generate an ls5_2 signal based on the l3_2 signal and the l_1 signal.

In operation S1526, the audio encoding apparatus 400 may determine the factor RMS (Ls 5)/RMS (ls5_2) of error cancellation based on the power value RMS (Ls 5) of Ls5 and the power value RMS (ls5_2) of ls5_2.

In operation S1528, the audio encoding apparatus 400 may determine whether the value of the factor of error cancellation is greater than 1.

In operation S1530, when the value of the error-canceled factor is greater than 1 (yes), the audio encoding apparatus 400 may determine the value of the error-canceled factor to be 1.

In operation S1532, the audio encoding apparatus 400 may store and output a factor of error cancellation of the ls5_2 signal. The audio encoding apparatus 400 may generate error-cancellation-related information including information about factors of error cancellation, and generate additional information including the error-cancellation-related information. The audio encoding apparatus 400 may generate and output a bitstream including additional information.

Fig. 15c is a flowchart for describing a process of generating an ls5_3 signal by an audio encoding apparatus 500 based on a factor for error cancellation according to various embodiments of the present disclosure.

In operation S1535, the audio decoding apparatus 500 may generate the l3_2 signal.

For example, the audio decoding apparatus 500 may obtain the c_1 signal and the l2_1 signal by decompressing the compressed C signal and the L2 signal. The audio encoding apparatus 400 may generate the l3_2 signal by unmixing the C1 and l2_1 signals.

The audio decoding apparatus 500 may obtain the l_1 signal by decompressing the compressed L signal in operation S1540.

In operation S1545, the audio decoding apparatus 500 may generate an ls5_2 signal based on the l3_2 signal and the l_1 signal. That is, the audio decoding apparatus 500 may generate the ls5_2 signal by unmixing the l3_2 signal and the l_1 signal.

In operation S1550, the audio decoding apparatus 500 may obtain a factor of error cancellation of the ls_2 signal.

In operation S1555, the audio decoding apparatus 500 may generate an ls5_3 signal by applying an error-canceling factor to the ls5_2 signal. An ls5_3 signal may be generated having an RMS value (e.g., an RMS value nearly equal to that of Ls 5), which is the product of the RMS value of ls5_2 and the factor of error cancellation.

In performing lossy encoding on a mixed channel audio signal obtained by mixing audio signals of a plurality of audio channels, errors may occur in the audio signal. For example, in a quantization process with respect to an audio signal, coding errors may occur in the audio signal.

In particular, using a model based on psycho-acoustic characteristics, coding errors may occur in a coding process (e.g., quantization) with respect to an audio signal. For example, when strong and weak sounds are generated simultaneously on adjacent frequencies, masking features may occur, a phenomenon in which a listener may not hear the weak sounds. That is, the minimum hearing limit of the weak target sound increases due to the strong interrupt sound of the adjacent frequency.

Therefore, when the audio encoding apparatus 400 performs quantization on the frequency band of weak sounds using the psychoacoustic model, the audio signal in the frequency band of weak sounds may not be encoded.

For example, when there is a masked sound (e.g., a weak sound) in the Ls5 signal and a masking sound (e.g., a strong sound) in the L signal, the l3_2 signal may be a signal in which the masked sound is substantially eliminated from the signal (L3 signal) in which the masked sound and the masking sound are mixed, due to the masking characteristics.

In some embodiments, when ls5_2 is generated by unmixing the l3_2 signal and the l_1 signal, the ls5_2 signal may include a masking sound having very little energy in the form of noise due to a coding error based on a masking characteristic.

The masking sound included in the ls5_2 signal may have very small energy compared to the existing masking sound, but may have more energy than the sound to be masked. In this case, in the ls5_2 channel where a masked sound is to be output, a masking sound having a larger energy may be output. Thus, in order to reduce noise in the ls5_2 channel, the ls5_2 signal may be scaled to have the same signal power as the Ls5 signal including the masked sound, and errors caused by lossy encoding may be eliminated. In this case, the factor (e.g., scaling factor) of the scaling operation may be a factor of error cancellation. The factor of the error cancellation may be expressed as a ratio of an original signal power of the audio signal to a signal power of the audio signal after decoding, and the audio decoding apparatus 500 may reconstruct the audio signal having the same signal power as the original signal power by performing a scaling operation on the decoded signal based on the scaling factor.

Accordingly, as the energy of the masking sound output in the form of noise in a specific channel is reduced, the listener can expect improvement in sound quality.

In some embodiments, when the signal power of the masked sound is smaller than the signal power of the masking sound by a certain value by comparing the original signal powers of the masked sound and the masking sound, it may be identified that an encoding error caused by the masking phenomenon occurs, and the error removal factor may be determined to be a value between 0 and 1. For example, as a value of a factor of error cancellation, a ratio of the original signal power to the decoded signal power may be determined. However, depending on the case, when the ratio is greater than 1, the value of the factor of error cancellation may be determined to be 1. That is, for a value of the factor of error cancellation greater than 1, the energy of the decoded signal may increase, but when the energy of the decoded signal in which the masking sound is inserted in the form of noise increases, the noise may further increase.

Thus, in this case, the value of the factor of error cancellation may be determined to be 1 to maintain the current energy of the decoded signal.

When the ratio of the signal power of the masked sound to the signal power of the masking sound is greater than or equal to a certain value, it may be identified that no encoding error caused by the masking phenomenon occurs, and the value of the error removal factor may be determined to be 1 to maintain the current energy of the decoded signal.

Accordingly, the audio encoding apparatus 200 may generate an error-canceled factor based on the signal power of the audio signal and transmit information about the error-canceled factor to the audio decoding apparatus 300. The audio decoding apparatus 300 may reduce the energy of the masking sound in the form of noise by applying the error-cancelled factor to the audio signal of the upmix channel based on the information on the error-cancelled factor to match the energy of the masked sound of the target sound.

Referring to fig. 16a, the bitstream 6000 may include a base channel audio stream 1605, a slave channel audio stream #1 1610, and a slave channel audio stream #2 1615. The base channel audio stream 1605 may include an a signal and a B signal. The audio decoding apparatuses 300 and 500 may decompress the a signal and the B signal included in the basic channel audio stream 1605, and reconstruct the audio signals (L2 signal and R2 signal) of the 2-channel layout based on the decompressed a signal and B signal.

The slave channel audio stream #1 1610 may include other 4-channel audio signals T, P, Q1 and Q2 in addition to the reconstructed 2-channel of the 3.1.2 channel. The audio decoding apparatuses 300 and 500 may decompress the audio signals T, P, Q and Q2 included in the slave channel audio stream #1 1610 and reconstruct the audio signals (L3, R3, C, LFE, hfl3, and Hfr3 signals) of the 3.1.2 channel layout based on the decompressed audio signals T, P, Q and Q2 and the existing decompressed a and B signals.

Alternatively or additionally, the slave channel audio stream #2 1615 may include audio signals S1, S2, U1, U2, V1, and V2 of 6 channels other than the reconstructed 3.1.2 channel of the 7.1.4 channel. The audio decoding apparatuses 300 and 500 may reconstruct the audio signals (L5, R5, ls5, rs5, C, LFE, hl5, and Hr5 signals) of the 5.1.2 channel layout based on the audio signals S1, S2, U1, U2, V1, and V2 included in the slave channel audio stream #2 1615 and the previously reconstructed audio signals of the 3.1.2 channel layout.

As described above, the slave channel audio stream #2 1615 may include the audio signals of the discrete channels. To expand the number of channels, an audio signal equal to the number of channels may be compressed and included in the audio stream #2 1615. Thus, as the number of channels increases, the amount of data included in the slave channel audio stream #2 1615 increases.

Referring to fig. 16b, the bitstream 1620 may include a base channel audio stream 1625, a slave channel audio stream #1 1630, and a slave channel audio stream #2 1635.

Unlike the slave channel audio stream #2 1615 of fig. 16a, the slave channel audio stream #21635 of fig. 16b may include an audio signal of WXYZ channels, which is a surround sound audio signal. The surround sound audio signal is an audio stream of continuous channels, and can be expressed as an audio signal of WXYZ channels even when the number of expansion of channels is large. Accordingly, as the number of extended channels increases or audio signals of various channel layouts are reconstructed, the slave channel audio stream #2 1630 may include a surround sound audio signal. As described above, the audio encoding apparatuses 200 and 400 may generate additional information including information indicating whether an audio stream of a discrete channel (e.g., the slave channel audio stream #2 1615 of fig. 16 a) exists and information indicating whether an audio stream of a continuous channel (e.g., the slave channel audio stream #21635 of fig. 16 b) exists. Accordingly, the audio encoding apparatuses 200 and 400 can selectively generate various forms of bit streams by considering the degree of expansion of the number of channels.

Referring to fig. 16c, the bit stream 1640 may include a base channel audio stream 1645, a slave channel audio stream #1 1650, a slave channel audio stream #2 1655, and a slave channel audio stream #3 1660. The configuration of the basic channel audio stream 1645, the slave channel audio stream #1 1650, and the slave channel audio stream #2 1655 of fig. 16c may be the same as the configuration of the basic channel audio stream 1605, the slave channel audio stream #1 1610, and the slave channel audio stream #2 1615 of fig. 16 a. Accordingly, the audio decoding apparatuses 300 and 500 may reconstruct the audio signal of the 7.1.4 channel layout based on the base channel audio stream 1645, the slave channel audio stream #1 1650, and the slave channel audio stream #2 1655.

Alternatively or additionally, the audio encoding apparatus 200 and 400 may generate a bit stream 1640 comprising the slave channel audio stream #31660, the slave channel audio stream #31660 comprising the surround sound audio signal. Accordingly, the audio encoding apparatuses 200 and 400 can reconstruct an audio signal of a free channel layout independent of the channel layout. The audio encoding apparatuses 200 and 400 may convert the audio signal of the reconstructed free channel layout into audio signals of various discrete channel layouts.

That is, the audio encoding apparatuses 200 and 400 can freely reconstruct audio signals of various channel layouts by generating a bitstream including the slave channel audio stream #3 1660, the slave channel audio stream #3 1660 further including a surround sound audio signal.

The audio encoding apparatuses 200 and 400 may compress the surround sound audio signal and generate a bitstream including the compressed surround sound audio signal. Thus, the channel layout may be extended from the 3.1.2 channel layout according to the surround sound audio signal.

For example, the audio signal of the channel layout with reference to fig. 17,3.1.2 may be the audio signal of a channel located in front of the listener 1700. The audio encoding apparatuses 200 and 400 may obtain a surround sound audio signal as an audio signal behind the listener 1700 using a surround sound audio signal capturing apparatus such as a surround sound microphone. Alternatively or additionally, the audio encoding apparatuses 200 and 400 may obtain a surround sound audio signal as an audio signal behind the listener 1700 based on an audio signal of a channel behind the listener 1700.

For example, the Ls signal, the Rs signal, the Lb signal, the Rb signal, the Hbl signal, and the Hbr signal may be defined by theta, phi, and the audio signal S as shown in equation 2 provided below. theta and phi are shown in FIG. 17.

Ls(theta，phi，S)＝(100，0，S _Ls ) [ equation 2 ]]

Rs(theta，phi，S)＝(250，0，S _Rs )

Lb(theta，phi，S)＝(150，0，S _Lb )

Rb(theta，phi，S)＝(210，0，S _Rb )

Hbl(theta，phi，S)＝(140，45，S _Hbl )

Hbr(theta，phi，S)＝(220，135，S _Hbr )

The audio encoding apparatuses 200 and 400 may generate signals W, X, Y and Z based on equation 3 provided below. Here, N1, N2, N3 and N4 may be normalization factors, S _X ＝cos(theta)*cos(phi)*S，S _y ＝sin(theta)*cos(phi)*S，S _z ＝sin(phi)*S

The audio encoding apparatuses 200 and 400 may compress the surround sound audio signals W, X, Y and Z and generate bitstreams including the compressed surround sound audio signals W, X, Y and Z.

The audio decoding apparatuses 300 and 500 may obtain a bitstream including a compressed audio signal and a compressed surround sound audio signal of a 3.1.2 channel layout. The audio decoding apparatuses 300 and 500 may generate an audio signal of a 5.1.2 channel layout based on the compressed audio signal of the 3.1.2 channel layout and the compressed surround sound audio signal.

The audio decoding apparatuses 300 and 500 may generate audio signals of channels behind a listener according to equation 4 provided below based on the compressed surround sound audio signals.

Ls_1=cos (100) ×cos (0) x+sin (100) ×cos (0) ×y+sin (0) ×z+w [ equation 4]

Rs_1＝cos(250)*cos(0)*X+sin(250)*cos(0)*Y+sin(0)*Z+W

Lb_1＝cos(150)*cos(0)*X+sin(150)*cos(0)*Y+sin(0)*Z+W

Rb_1＝cos(210)*cos(0)*X+sin(210)*cos(0)*Y+sin(0)*Z+W

Hbl_1＝cos(140)*cos(45)*X+sin(140)*cos(45)*Y+sin(45)*Z+W

Hbr_1＝cos(220)*cos(220)*X+sin(220)*cos(135)*Y+sin(135)*Z+W

The audio decoding apparatuses 300 and 500 may generate C and LFE signals in the audio signal of the 5.1.2 channel arrangement using the C and LFE signals of the 3.1.2 channel arrangement.

The audio decoding apparatuses 300 and 500 may generate H15, hr5, L, R, ls5, and Rs5 signals among the audio signals of the 5.1.2 channel layout according to equation 5.

Hl5=hfl3-0.649 (ls_1+0.866xlb_1) [ equation 5]

Hr5＝Hfr3-0.649(Rs_1+0.866xRb_1)

L＝L3-0.866(Ls_1+0.866xLb_1)

R＝R3-0.866(Ls_1+0.866xLb_1)

Ls5＝Ls_1+0.866xLb_1

Rs5＝Rs_1+0.866xRb_1

The audio decoding apparatuses 300 and 500 may generate C and LFE signals in the audio signal of the 7.1.4 channel arrangement using the C and LFE signals of the 3.1.2 channel arrangement.

In addition to the compressed audio signal of the 3.1.2 channel arrangement, the audio decoding apparatuses 300 and 500 may generate Ls, rs, lb, rb, hbl and Hbr signals among the audio signals of the 7.1.4 channel arrangement using ls_1, rs_1, lb_1, rb_1, hbl _1, and hbr_1 obtained from the compressed surround sound audio signal.

The audio decoding apparatuses 300 and 500 may generate Hfl, hfr, L and R signals among the audio signals of the 7.1.4 channel layout according to equation 6.

hfl=hl5-Hbl _1 [ equation 6]

Hfr＝Hr5-Hbr_1

L＝L3-0.866(Ls_1+0.866xLb_1)

R＝R3-0.866(Ls_1+0.866xLb_1)

In addition to the compressed audio signal of the 3.1.2 channel arrangement, the audio decoding apparatuses 300 and 500 may reconstruct an audio signal of the extended channel arrangement from the 3.1.2 channel arrangement using the compressed surround sound audio signal.

Fig. 18 is a view for describing a process of generating an object audio signal on a screen by the audio decoding apparatus 1800 based on the audio signal of the 3.1.2 channel layout and the sound source object information.

The audio encoding apparatuses 200 and 400 may convert a spatial audio signal into an on-screen audio signal based on sound source object information. Here, the sound source object information may include sound source object information indicating a mixing level signal object_s of an object on a screen, a size/shape object_g of the object, a position object_l of the object, and a direction object_v of the object.

The sound source object signal generator 1810 may generate S, G, V and L signals from the audio signals W, X, Y, Z, L3, R3, C, LFE, hfl, and Hfr 3.

/>

The sound source object signal generator 1810 may generate a signal regarding a sound source object reproduced on a screen based on the audio signals S, G, V and L of the sound source object 3.1.2 channel layout and the sound source object information.

The remixing unit 1820 may generate remixed object audio signals (on-screen audio signals) S11 to Snm based on the 3.1.2 channel-laid out audio signals L3, R3, C, LFE, hfl, and Hfr3 and signals regarding the on-screen reproduced sound source objects.

That is, the sound source object signal generator 1810 and the remixing unit 1820 may generate an audio signal on a screen based on sound source object information according to equation 8 provided below.

The audio decoding apparatus 1800 may improve the sound image of the on-screen sound source object by remixing the signal regarding the on-screen reproduced sound source object with the reconstructed audio signal of the 3.1.2-channel layout based on the sound source object information and the S, G, V and L signals.

Fig. 19 is a view for describing a transmission order and a rule of audio streams in each channel group of the audio encoding apparatuses 200 and 400 according to various embodiments of the present disclosure.

In a scalable format, the transmission order and rules of the audio streams in each channel group may be as follows.

The audio encoding apparatuses 200 and 400 may first transmit the coupled stream and then transmit the uncoupled stream.

The audio encoding apparatuses 200 and 400 may first transmit the coupled stream of the surround channels and then transmit the coupled stream of the overhead channels.

The audio encoding apparatuses 200 and 400 may first transmit the coupled stream of the front channel and then transmit the coupled stream of the side channel or the rear channel.

For uncoupled streaming, the audio encoding apparatus 200 and 400 may first transmit a stream for the central channel and then transmit streams for the LFE channel and the other channel. Here, when the basic channel group includes a single channel signal, another channel may exist. In this case, the other channel may be one of the left channel L2 or the right channel R2 of the stereo channel.

The audio encoding apparatuses 200 and 400 may compress the audio signals of the coupled channels into a pair. The audio encoding apparatuses 200 and 400 may first transmit a coupled stream including audio signals compressed into a pair. For example, the coupling channel may refer to a bilateral symmetric channel, such as L/R, ls/Rs, lb/Rb, hfl/Hfr, hbl/Hbr channel, etc.

Hereinafter, the stream configuration of each channel group in the bit stream 1910 of case 1 is described according to the above-described transmission order and rule of the streams in each channel group.

Referring to fig. 19, for example, the audio encoding apparatuses 200 and 400 may compress L1 and R1 signals as 2-channel audio signals, and the compressed L1 and R1 signals may be included in a C1 bitstream of a Basic Channel Group (BCG).

Immediately following the basic channel set, the audio encoding apparatus 200 and 400 may compress the 4-channel audio signal into the audio signal of the slave channel set # 1.

The audio encoding apparatuses 200 and 400 may compress the Hfl3 signal and the Hfr3 signal, and the compressed Hfl3 signal and Hfr3 signal may be included in the C2 bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the C signal, and the compressed C signal may be included in the M1 bit stream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the LFE signal, and the compressed LFE signal may be included in the M2 bit stream of the slave channel group # 1.

The audio decoding apparatuses 300 and 500 may reconstruct an audio signal of a 3.1.2 channel layout based on compressed audio signals of the basic channel group and the slave channel group # 1.

Immediately after the slave channel group #2, the audio encoding apparatuses 200 and 400 may compress the 6-channel audio signal into the audio signal of the slave channel group # 2.

The audio encoding apparatuses 200 and 400 may first compress the L signal and the R signal, and the compressed L signal and R signal may be included in the C3 bitstream of the slave channel group # 2.

Next to the C3 bitstream, the audio encoding apparatuses 200 and 400 may compress the Ls signal and the Rs signal, which may be included in the C4 bitstream of the slave channel group # 2.

Next to the C4 bitstream, the audio encoding apparatuses 200 and 400 may compress the Hfl signal and the Hfr signal, and the compressed Hfl and Hfr signals may be included in the C5 bitstream of the slave channel group # 2.

The audio decoding apparatuses 300 and 500 may reconstruct an audio signal of a 7.1.4 channel layout based on compressed audio signals of the basic channel group, the slave channel group #1, and the slave channel group # 2.

Hereinafter, the stream configuration of each channel group in the bit stream 1920 of case 2 is described according to the above-described transmission order and rule of the streams in each channel group.

The audio encoding apparatuses 200 and 400 may compress an L2 signal and an R2 signal, which are 2-channel audio signals, and the compressed L2 signal and R2 signal may be included in a C1 bitstream of a basic channel group.

Immediately following the basic channel set, the audio encoding apparatus 200 and 400 may compress the 6-channel audio signal into the audio signal of the slave channel set # 1.

The audio encoding apparatuses 200 and 400 may first compress the L signal and the R signal, and the compressed L signal and R signal may be included in the C2 bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the Ls signal and the Rs signal, and the compressed Ls signal and Rs signal may be included in the C3 bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may reconstruct an audio signal of a 7.1.0 channel layout based on the compressed audio signals of the basic channel group and the slave channel group # 1.

Immediately after the slave channel group #1, the audio encoding apparatuses 200 and 400 may compress the 4-channel audio signal into the audio signal of the slave channel group # 2.

The audio encoding apparatuses 200 and 400 may compress the Hfl signal and the Hfr signal, and the compressed Hfl signal and Hfr signal may be included in the C4 bitstream of the slave channel group # 2.

The audio encoding apparatuses 200 and 400 may compress Hbl signals and Hbr signals, and the compressed Hfl signals and Hfr signals may be included in a C5 bitstream of the slave channel group # 2.

Hereinafter, the stream configuration of each channel group in the bit stream 1930 of case 3 is described according to the above-described transmission order and rule of the streams in each channel group.

Immediately following the basic channel set, the audio encoding apparatus 200 and 400 may compress the 10-channel audio signal into the audio signal of the slave channel set # 1.

The audio encoding apparatuses 200 and 400 may compress the Hfl signal and the Hfr signal, and the compressed Hfl signal and Hfr signal may be included in the C4 bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress Hbl signals and Hbr signals, and the compressed Hfl signals and Hfr signals may be included in a C5 bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may reconstruct an audio signal of a 7.1.4 channel layout based on the compressed audio signals of the basic channel group and the slave channel group # 1.

In some embodiments, the audio decoding apparatuses 300 and 500 may perform the unmixing in a stepwise manner using at least one upmixing unit. The unmixing may be performed based on audio signals of channels included in the at least one channel group.

For example, a 1.X to 2.X upmixing unit (first upmixing unit) may upmix an audio signal of a right channel from an audio signal of a single channel that is a mixed right channel.

Alternatively or additionally, the 2.X to 3.X upmixing unit (second upmixing unit) may unmixe the audio signal of the central channel from the audio signals of the L2 and R2 channels corresponding to the mixed central channel. Alternatively or additionally, the 2.X to 3.X upmixing unit (second upmixing unit) may upmix the audio signals of the L3 channel and the audio signals of the R3 channel from the audio signals of the L2 and R2 channels of the mixed L3 and R3 channels and the audio signals of the C channel.

The x to 5.X upmixing unit (third upmixing unit) may upmix audio signals of the Ls5 channel and the Rs5 channel from audio signals of the L3, R3, L (5) and R (5) channels corresponding to the Ls5/Rs5 mixing channel.

The x to 7.X upmixing unit (fourth upmixing unit) may upmix the audio signal of the Lb channel and the audio signal of the Rb channel from the audio signals of the Ls5, ls7 and Rs7 channels corresponding to the mixed Lb/Rb channel.

The x.x.2 (FH) to x.x.2 (H) upmixing unit (fourth upmixing unit) may upmix audio signals of the Hl channel and the Hr channel from audio signals of the Hfl3, hfr3, L5, R3 and R5 channels corresponding to the mixed Ls/Rs channel.

The x.x.2 (H) to x.x.4 upmix unit (fifth upmix unit) may unmixed the audio signals of the Hbl channels and the Hbr channels from the audio signals of the Hl, hr, hfl and the Hfr channels corresponding to the Hbl/Hbr channels mixed.

For example, the audio decoding apparatuses 300 and 500 may perform the unmixing on the 3.2.1 channel layout using the first upmixing unit.

The audio decoding apparatuses 300 and 500 may perform the unmixing on the 7.1.4-channel layout using the second and third upmixing units for surround channels and the fourth and fifth upmixing units for overhead channels.

Alternatively or additionally, the audio decoding apparatuses 300 and 500 may perform unmixing on the 7.1.0 channel layout using the first, second, and third mixing units. The audio decoding apparatuses 300 and 500 may not perform the unmixing from the 7.1.0 channel layout to the 7.1.4 channel layout.

Alternatively or additionally, the audio decoding apparatuses 300 and 500 may perform unmixing on the 7.1.4 channel layout using the first, second, and third mixing units. The audio decoding apparatuses 300 and 500 may not perform the unmixing on the overhead channels.

Hereinafter, a rule for generating a channel group by the audio encoding apparatuses 200 and 400 is described. For a channel layout Cli in scalable format (where I is an integer from 0 to n, CLi represents Si, wi, and Hi), si+wi+Hi may refer to the number of channels of channel group #i. The number of channels of channel group #i may be greater than the number of channels of channel group #i-1.

Channel group #i may include as many Cli (display channels) original channels as possible. The original channel may follow the priorities described below.

When H is _i-1 At 0, the priority of the higher channel may be higher than the priority of the other channels. The priority of the central channel and LFE channels may precede the other channels.

The priority of the high front channel may precede the priority of the side channel and the high rear channel.

The priority of the side channel may be before the priority of the rear channel. Further, the priority of the left channel may precede the priority of the right channel.

For example, when n is 4, CL0 is a stereo channel, CL1 is a 3.1.2 channel, CL2 is a 5.1.2 channel, and CL3 is a 7.1.4 channel, a channel group can be generated as described below.

The audio encoding apparatuses 200 and 400 may generate a basic channel group including a (L2) and B (R2) signals. The audio encoding apparatuses 200 and 400 may generate the slave channel group #1 including Q1 (Hfl 3), Q2 (Hfr 3), T (=c), and P (=lfe) signals. The audio encoding apparatuses 200 and 400 may generate the slave channel group #2 including S1 (=l) and S2 (=r) signals.

The audio encoding apparatuses 200 and 400 may generate the slave channel group #3 including V1 (Hfl), V2 (Hfr), U1 (Ls), and U2 (Rs) signals.

In some embodiments, the audio decoding apparatuses 300 and 500 may reconstruct a 7.1.4 channel audio signal from the decompressed audio signal using a downmix matrix. In this case, the downmix matrix may include the downmix weight parameters in table 2 as provided below.

TABLE 2

/>

Here, cw denotes a central weight, which may be 0 when the channel layout of the basic channel group is a 3.1.2 channel layout, and 1 when the channel layout of the basic channel group is a 2-channel layout. w may represent the surround highly mixed weights. The α, β, γ, and δ may indicate the downmix weight parameters and may be variable. The audio encoding apparatuses 200 and 400 may generate bitstreams including downmix weight parameter information such as α, β, γ, δ, and w, and the audio decoding apparatuses 300 and 500 may obtain the downmix weight parameter information from the bitstreams. On the other hand, the weight parameter information of the downmix matrix (or the downmix matrix) may be in the form of an index. For example, the weight parameter information of the downmix matrix (or the downmix matrix) may be index information indicating one of a plurality of downmix (or downmix) weight parameter sets, and at least one of the downmix (or the downmix) weight parameters corresponding to the one downmix (or the downmix) weight parameter set may exist in the form of a lookup table (LUT). For example, the weight parameter information of the downmix (or the unmixed) matrix may be information indicating one downmix (or the unmixed) weight parameter set among a plurality of downmix (or the unmixed) weight parameter sets, and at least one of α, β, γ, δ, or w may be predefined in the LUT corresponding to the one downmix (or the unmixed) weight parameter set. Accordingly, the audio decoding apparatuses 300 and 500 may obtain α, β, γ, δ, and w corresponding to one downmix (unmixed) weight parameter set. The matrix for downmixing from the first channel layout to the second channel layout may comprise a plurality of matrices. For example, the matrix may comprise a first matrix for downmixing from a first channel layout to a third channel layout and a second matrix for downmixing from the third channel layout to a second channel layout. Specifically, for example, the matrix for downmixing from the audio signal of the 7.1.4 channel layout to the audio signal of the 3.1.2 channel layout may include a first matrix for downmixing from the audio signal of the 7.1.4 channel layout to the audio signal of the 5.1.4 channel layout and a second matrix for downmixing from the audio signal of the 5.1.4 channel layout to the audio signal of the 3.1.2 channel layout.

Tables 3 and 4 show a first matrix and a second matrix for downmixing from an audio signal of a 7.1.4 channel layout to an audio signal of a 3.1.2 channel layout based on content-based downmixing parameters and weights based on a surround height.

TABLE 3

TABLE 4

Here, α, β, γ, or δ represents one of the downmix parameters, and w represents the surround height weight. Here A, B or C represents one of the downmix parameters and w represents the surround height weight. For upmixing (or unmixing) from a 5.X channel to a 7.X channel, the unmixed weight parameters α and β may be used. For upmixing from the x.x.2 (H) channel to the x.x.4 channel, the upmixing weight parameter Y may be used.

For upmixing from 3.X channels to 5.X channels, a unmixed weight parameter δ may be used.

For upmixing from x.x.2 (FH) channel to x.x.2 (H) channel, the unmixed weight parameters w and δ may be used.

For upmixing from 2.X channels to 3.X channels, a-3 dB unmixed weight parameter may be used. That is, the unmixed weight parameters may be fixed values and may not be signaled.

Furthermore, for upmixing to 1.X channels and 2.X channels, a-6 dB unmixed weight parameter may be used. That is, the unmixed weight parameters may be fixed values and may not be signaled. In some embodiments, the unmixed weight parameters used for the unmixed may be parameters included in one of a plurality of types. For example, the type 1 unmixed weight parameters α, β, γ, and δ may be 0dB, -3dB, and-3 dB. The type 2 unmixed weight parameters α, β, γ, and δ may be-3 dB, and-3 dB. The type 3 unmixed weight parameters α, β, γ, and δ may be 0dB, -1.25dB, and-1.25 dB.

The type 1 may be a type indicating a case where the audio signal is a normal audio signal, the type 2 may be a type indicating a case where a dialog is included in the audio signal (dialog type), and the type 3 may be a type indicating a case where a sound effect exists in the audio signal (sound effect type).

The audio encoding apparatuses 200 and 400 may analyze an audio signal and one of various types of audio signals according to the analysis. The audio encoding apparatuses 200 and 400 may perform down-mixing on the original audio using the determined type of the downmix weight parameters to generate an audio signal of the lower channel layout.

The audio encoding apparatuses 200 and 400 may generate a bitstream including index information indicating one of a plurality of types. The audio decoding apparatuses 300 and 500 may obtain index information from the bitstream and identify one of a plurality of types based on the obtained index information. The audio decoding apparatuses 300 and 500 may upmix the audio signals of the decompressed channel groups using the unmixed weight parameters of the identification type to reconstruct the audio signals of the specific channel layout.

Alternatively or additionally, the audio signal generated from the downmix may be represented as equation 9 provided below. That is, the downmixing may be performed based on an operation using an equation of a polynomial form of a degree, and each of the downmixed audio signals may be generated.

Ls5=α×ls7+β×lb7 [ equation 9]

Rs5＝α×Rs7+β×Rb7

L3＝L5+δ×Ls5

R3＝R5+δ×Rs5

L2＝L3+p ₂ ×C

R2＝R3+p ₂ ×C

Mono＝p ₁ ×(L2+R2)

Hl＝Hfl+γ×Hbl

Hr＝Hfr+γ×Hbr

Hfl3＝Hl×w′×δ×Ls5

Hfr3＝Hr×w′×δ×Rs5

Here, p ₁ May be about 0.5 (e.g. -6 dB), p ₂ May be about 0.707 (e.g., -3 dB). Alpha and beta may be values for downmixing the number of surround channels from 7 channels to 5 channels. For example, α or β may be 1 (e.g., 0 dB), 0.866 (e.g., -1.25 dB), and 0.707 (e.g., -3 dB). Gamma may be a value for downmixing the number of overhead channels from 4 channels to 5 channels. For example, γ may be one of 0.866 or 0.707. Delta may be a value for downmixing the number of surround channels from 5 channels to 3 channels. Delta may be one of 0.866 or 0.707. w' may be a value for downmixing from H2 (e.g., a 5.1.2 channel layout or a higher channel of a 7.1.2 channel layout) to Hf2 (a higher channel of a 3.1.2 channel layout).

Also, an audio signal generated by the unmixing may be represented as equation 10. That is, the unmixed may be performed in a stepwise manner (the operation procedure of each equation corresponds to one of the unmixed procedures) based on the operation of the equation using the form of the once polynomial, not limited to the operation using the unmixed matrix, and each of the unmixed audio signals may be generated.

R3＝R2-p ₂ ×C

Hl＝Hfl3-w′×(L3-L5)

Hr＝Hfr3-w′×(R3-R5)

w' may be a value for downmixing from H2 (e.g., an overhead channel of a 5.1.2 channel layout or a 7.1.2 channel layout) to Hf2 (an overhead channel of a 3.1.2 channel layout) or for unmixing from Hf2 (an overhead channel of a 3.1.2 channel layout) to H2 (e.g., an overhead channel of a 5.1.2 channel layout or a 7.1.2 channel layout).

Corresponding sum _w And the value of w' may be updated according to w. w may be about-1 or 1 and may be transmitted for each frame.

For example sum _w The initial value may be 0, and when w is 1 for each frame, sum _w May be increased by 1 and sum when w is-1 for each frame _w The value of (2) may be reduced by 1. When sum is _w When the value of (2) is increased or decreased by 1, when sum _w Sum when the value of (2) is outside the range of 0-10 _w The value of (2) may remain at 0 or 10. Show w' and sum _w Table 5 of the relationship between them can be as follows. That is, w' may be updated gradually for each frame and thus may be used for unmixing from Hf2 to H2.

TABLE 5

sum _w	0	1	2	3	4	5
							w′	0	0.0179	0.0391	0.0658	0.1038	0.25
sum _w	6	7	8	9	10
							w′	0.3962	w′	0.4609	0.4821	0.5

Without being limited thereto, the unmixing may be performed by integrating a plurality of unmixing processes. For example, the signals from Ls5 channels or Rs5 channels of the 2 surround channel unmixed L2 and R2 may be expressed as equation 11, which arranges the second to fifth equations of equation 10.

The signals of the H1 channel or the Hr channel unmixed from the 2 surround channels of L2 and R2 may be expressed as equation 12, which arranges the second and third equations and the eighth and ninth equations of equation 10.

In some embodiments, the gradual downmixing of the surround channel and the overhead channel may have a mechanism as shown in fig. 23.

The downmix related information (or the unmixed related information) may be index information indicating one of a plurality of modes based on a combination of preset downmix weight parameters (or unmixed weight parameters). For example, as shown in table 7, the downmix weight parameters corresponding to the plurality of modes may be predetermined.

TABLE 7

Mode	Downmixing weight parameters (α, β, γ, δ, w) (or unmixing weight parameters)
		1	(1，1，0.707，0.707，-1)
2	(0.707，0.707，0.707，0.707，-1)
		3	(1，0.866，0.866，0.866，-1)
4	(1，1，0.707，0.707，1)
		5	(0.707，0.707，0.707，0.707，1)
6	(1，0.866，0.866，0.866，1)

Fig. 20a is a flow chart of an audio processing method according to various embodiments of the present disclosure. In operation S2002, the audio decoding apparatus 500 may obtain at least one compressed audio signal of a basic channel group from a bitstream. In operation S2004, the audio decoding apparatus 500 may obtain at least one compressed audio signal of at least one slave channel group from the bitstream.

In operation S2006, the audio decoding apparatus 500 may obtain information on factors of error cancellation of one upmix channel of the upmix channel group from the bitstream.

In operation S2008, the audio decoding apparatus 500 may reconstruct an audio signal of a basic channel group by decompressing at least one compressed audio signal of the basic channel group.

In operation S2010, the audio decoding apparatus 500 may reconstruct at least one audio signal of at least one slave channel group by decompressing at least one compressed audio signal of at least one slave channel group.

In operation S2012, the audio decoding apparatus 500 may generate an audio signal of an upmix channel group based on at least one audio signal of a basic channel and at least one audio signal of at least one slave channel group.

In operation S2014, the audio decoding apparatus 500 may reconstruct an audio signal of one upmix channel based on the audio signal of one upmix channel of the upmix channel group and the error-canceled factor.

The audio decoding apparatus 500 may reconstruct a multi-channel audio signal including at least one audio signal of one upmix channel of the upmix channel group reconstructed by applying the error-canceling factor, and audio signals of other channels of the upmix channel group. That is, the factor of error cancellation may not be applicable to some audio signals of other channels.

Fig. 20b is a flow chart of an audio processing method according to various embodiments of the present disclosure.

In operation S2022, the audio decoding apparatus 500 may obtain a second audio signal downmixed from the at least one first audio signal from the bitstream.

The audio decoding apparatus 500 may obtain error concealment-related information of the first audio signal from the bitstream in operation S2024.

The audio decoding apparatus 500 may reconstruct the first audio signal by applying the error concealment related information to the upmixed first audio signal in operation S2026.

Fig. 20c is a flow chart of an audio processing method according to various embodiments of the present disclosure.

In operation S2052, the audio encoding apparatus 400 may obtain at least one audio signal of a basic channel group and an audio signal of at least one slave channel group by downmixing an original audio signal based on a specific channel layout.

In operation S2054, the audio encoding apparatus 400 may generate at least one compressed audio signal of the basic channel group by compressing at least one audio signal of the basic channel group.

In operation S2056, the audio encoding apparatus 400 may generate at least one compressed audio signal of at least one slave channel group by compressing at least one audio signal of at least one slave channel group.

In operation S2058, the audio encoding apparatus 400 may generate a basic channel reconstruction signal by decompressing at least one compressed audio signal of the basic channel group.

In operation S2060, the audio encoding apparatus 400 may generate a slave channel reconstruction signal by decompressing at least one audio signal of at least one slave channel group.

In operation S2062, the audio encoding apparatus 400 may obtain a first audio signal of one upmix channel of the upmix channel group by upmixing the base channel reconstruction signal and the slave channel reconstruction signal.

The audio encoding apparatus 400 may obtain the second audio signal from the original audio signal or obtain the second audio signal of one channel by downmixing the original audio signal in operation S2064.

In operation S2066, the audio encoding apparatus 400 may obtain a scale factor of an upmix channel based on the power value of the first audio signal and the power value of the second audio signal. Here, the upmix channel of the first audio signal and the channel of the second audio signal may indicate the same channel in a particular channel layout.

In operation S2068, the audio encoding apparatus 400 may generate a bitstream including at least one compressed audio signal of a basic channel group, at least one compressed audio signal of at least one slave channel group, and error cancellation-related information of one upmix channel.

Fig. 20d is a flow chart of an audio processing method according to various embodiments of the present disclosure.

In operation S2072, the audio encoding apparatus 400 may generate a second audio signal by downmixing at least one first audio signal.

In operation S2074, the audio encoding apparatus 400 may generate error-cancellation-related information of the first audio signal using at least one of an original signal power of the second audio signal or a signal power of the decoded first audio signal.

In operation S2076, the audio encoding apparatus 400 may transmit error removal related information of the first audio signal and the downmixed second audio signal.

Fig. 21 is a diagram for describing a process of transmitting metadata through LFE signals by an audio encoding apparatus using a first neural network and obtaining metadata from LFE signals by an audio decoding apparatus using a second neural network according to various embodiments of the present disclosure.

Referring to fig. 21, the audio encoding apparatus 2100 may obtain an a/B/T/Q/S/U/V audio signal by downmixing a channel signal L/R/C/Ls/Rs/Lb/Rb/Hfl/Hfr/Hbl/Hbr/W/X/Y/Z based on mixing related information (downmix related information) using a downmix unit 2105.

The audio encoding apparatus 2100 may obtain a P signal using the first neural network 2110 having LFE signals and metadata as inputs. That is, metadata may be included in the LFE signal using the first neural network. Here, the metadata may include voice specification information, information about factors of error cancellation (e.g., CER), object information on a screen, and mixing related information.

The audio encoding apparatus 2100 may generate a compressed a/B/T/Q/S/U/V signal using the first compressor 2115 as an input.

The audio encoding apparatus 2100 may generate a compressed P signal using the second compressor 2115 as an input.

The audio encoding apparatus 2100 may generate a bitstream including a compressed a/B/T/Q/S/U/V signal and a compressed P signal using the packetizer 2120. In this case, the bitstream may be packetized. The audio encoding apparatus 2100 may transmit the packetized bitstream to the audio decoding apparatus 2150.

The audio decoding apparatus 2150 may receive the packetized bitstream from the audio encoding apparatus 2100.

The audio decoding apparatus 2150 may obtain a compressed a/B/T/Q/S/U/V signal and a compressed P signal from the packetized bitstream using the unpacker 2155.

The audio decoding apparatus 2150 may obtain an a/B/T/Q/S/U/V signal from the compressed a/B/T/Q/S/U/V signal using the first decompressor 2160.

The audio decoding apparatus 2150 may obtain a P signal from the compressed P signal using the second decompressor 2165.

The audio decoding apparatus 2150 may reconstruct the channel signal from the a/B/T/Q/S/U/V signal based on the (de) mixing related information using the upmixing unit 2170. The channel signal may be at least one of L/R/C/Ls/Rs/Lb/Rb/Hfl/Hfr/Hbl/Hbr/W/X/Y/Z signals. The second neural network 2180 may be used to obtain (de) mixing related information.

The audio decoding apparatus 2150 may obtain the LFE signal from the P signal using the low pass filter 2175.

The audio decoding apparatus 2150 may obtain an enable signal from the P signal using the high frequency detector 2185.

The audio decoding apparatus 2150 may determine whether to use the second neural network 2180 based on the enable signal.

When it is determined that the second neural network 2180 is used, the audio decoding apparatus 2150 may obtain metadata from the P signal using the second neural network 2180. The metadata may include speech specification information, information about factors for error cancellation (e.g., CER), on-screen object information, and (de) mixing related information.

The parameters of the first neural network 2110 and the second neural network 2180 may be obtained through independent training or may be obtained through joint training, but are not limited thereto. The parameter information of the pre-trained first and second neural networks 2110 and 2180 may be received from separate training devices, and the first and second neural networks 2110 and 2180 may be set based on the parameter information, respectively.

Each of the first neural network 2110 and the second neural network 2180 may select one of a plurality of training parameter sets. For example, the first neural network 2110 may be set based on a parameter set selected from among a plurality of training parameter sets. The audio encoding apparatus 2100 may transmit index information indicating one parameter set selected from among the plurality of parameter sets of the first neural network 2110 to the audio decoding apparatus 2150. The audio decoding apparatus 2150 may select one parameter set from among a plurality of parameter sets of the second neural network 2180 based on the index information. The set of parameters selected by the audio decoding apparatus 2150 for the second neural network 2180 may correspond to the set of parameters selected by the audio encoding apparatus 2100 for the first neural network 2110. The plurality of parameter sets of the first neural network and the plurality of parameter sets of the second neural network 2180 may have a one-to-one correspondence, but may also have a one-to-many or many-to-one correspondence, not limited thereto. In case of one-to-many correspondence, additional index information may be transmitted from the audio encoding apparatus 2100. Alternatively or additionally, the audio encoding apparatus 2100 may transmit index information indicating one of the plurality of parameter sets of the second neural network 2180 instead of index information indicating one of the plurality of parameter sets of the first neural network 2110.

Fig. 22a is a flow chart of an audio processing method according to various embodiments of the present disclosure.

In operation S2205, the audio decoding apparatus 2150 may obtain a second audio signal downmixed from the at least one first audio signal from the bitstream.

In operation S2210, the audio decoding apparatus 2150 may obtain an audio signal of the LFE channel from the bitstream.

In operation S2215, for the obtained audio signal of the LFE channel, the audio decoding apparatus 2150 may obtain audio information related to error cancellation of the first audio signal using a neural network (e.g., the second neural network 2180) for obtaining additional information.

The audio decoding apparatus 2150 may reconstruct the first audio signal by applying the error concealment-related information to the first audio signal upmixed from the second audio signal in operation S2220.

Fig. 22b is a flow chart of an audio processing method according to various embodiments of the present disclosure.

In operation S2255, the audio encoding apparatus 2100 may generate the second audio signal by downmixing the at least one first audio signal.

In operation S2260, the audio encoding apparatus 2100 may generate error-cancellation-related information of the first audio signal using at least one of an original signal power of the second audio signal or a signal power of the decoded first audio signal.

In operation S2265, for the error-cancellation-related information, the audio encoding apparatus 2100 may generate an audio signal of the LFE channel using a neural network (e.g., the first neural network 2110) of the audio signal of the LFE channel.

In operation S2270, the audio encoding apparatus 2100 may transmit the second audio signal of the downmix and the audio signal of the LFE channel.

According to various embodiments of the present disclosure, an audio encoding apparatus may generate an error-canceled factor based on a signal power of an audio signal and transmit information about the error-canceled factor to an audio decoding apparatus. By applying the error-cancelled factor to the audio signal of the upmix channel based on the information on the error-cancelled factor, the audio decoding apparatus can reduce the energy of the masking sound in the form of noise to match the energy of the masked sound of the target sound.

In some embodiments, the above-described embodiments of the present disclosure may be written as programs or instructions executable on a computer, and the programs or instructions may be stored in a storage medium.

The machine-readable storage medium may be provided in the form of a non-transitory storage medium. Wherein the term "non-transitory storage medium" simply means that the storage medium is a tangible device and does not include signals (e.g., electromagnetic waves), but the term does not distinguish between locations where data is semi-permanently stored in the storage medium and locations where data is temporarily stored in the storage medium. For example, a "non-transitory storage medium" may include a buffer that temporarily stores data.

According to various embodiments of the present disclosure, methods according to various embodiments disclosed herein may be included in a computer program product and provided therein. The computer program product may be traded as a product between a seller and a buyer. The computer program product may be distributed in the form of a machine-readable storage medium, such as a compact disk read only memory (CD-ROM), or via an application store (e.g., playStore ^TM ) Online distribution (e.g., download or upload), or directly between two user devices (e.g., smartphones). At least a portion of the computer program product (e.g., downloadable application) may be at least temporarily stored or generated at least during online distributionIn a machine readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

In some embodiments, the model associated with the neural network described above may be implemented as a software module. When implemented as software modules (e.g., program modules including instructions), the neural network model may be stored on a computer-readable recording medium.

Alternatively or additionally, the neural network model may be integrated in the form of a hardware chip and may be part of the apparatus and display device described above. For example, the neural network model may be made in the form of a dedicated hardware chip for artificial intelligence, or as part of a conventional general purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU).

Alternatively or additionally, the neural network model may be provided in the form of downloadable software. The computer program product may comprise a product in the form of a software program (e.g., a downloadable application) that is distributed electronically through a manufacturer or electronic marketplace. For electronic distribution, at least a portion of the software program may be stored in a storage medium or temporarily generated. In this case, the storage medium may be a server of a manufacturer or an electronic market, or a storage medium of a relay server.

The technical spirit of the present disclosure has been described in detail with reference to exemplary embodiments, but is not limited to the above-described embodiments, and various changes and modifications may be made by one of ordinary skill in the art within the technical spirit of the present disclosure, without being limited to the foregoing embodiments.

Claims

1. An audio processing method, comprising:

generating a second audio signal by downmixing the at least one first audio signal;

generating first information related to error cancellation of the at least one first audio signal using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding; and

First information related to error cancellation of the at least one first audio signal and a downmixed second audio signal are transmitted.

2. The audio processing method of claim 1, wherein the first information related to error cancellation of the at least one first audio signal includes second information about a factor for error cancellation, and

wherein generating first information related to error cancellation of the at least one first audio signal comprises generating second information about a factor for error cancellation, the second information indicating that the factor for error cancellation has a value of 0, when an original signal power of the at least one first audio signal is less than or equal to a first value.

3. The audio processing method of claim 1, wherein the first information related to error cancellation of the at least one first audio signal includes second information about a factor for error cancellation, and

wherein generating first information related to error cancellation of the at least one first audio signal comprises generating second information about a factor for error cancellation based on the original signal power of the at least one first audio signal and the decoded second signal power of the at least one first audio signal when a first ratio of the original signal power of the at least one first audio signal to the original signal power of the second audio signal is smaller than a second value.

4. The audio processing method of claim 3, wherein generating second information about the factor for error cancellation comprises generating second information about the factor for error cancellation, the second information indicating that the value of the factor for error cancellation is a second ratio of an original signal power of the at least one first audio signal to a second signal power of the at least one first audio signal after decoding.

5. The audio processing method of claim 4, wherein generating the second information about the factor for error cancellation comprises generating the second information about the factor for error cancellation when a second ratio of an original signal power of the at least one first audio signal to a second signal power of the at least one first audio signal after decoding is greater than 1, the second information indicating that the factor for error cancellation has a value of 1.

6. The audio processing method of claim 1, wherein the first information related to error cancellation of the at least one first audio signal includes second information about a factor for error cancellation, and

wherein generating the first information related to error cancellation of the at least one first audio signal comprises generating second information about a factor for error cancellation when a ratio of an original signal power of the at least one first audio signal to an original signal power of a second audio signal is greater than or equal to a second value, the second information indicating that the factor for error cancellation has a value of 1.

7. The audio processing method of claim 1, wherein generating the second information about the factor for error cancellation comprises,

first information relating to error cancellation of the at least one first audio signal is generated for each frame of the second audio signal.

8. The audio processing method of claim 1, wherein the downmixed second audio signal includes a third audio signal of the basic channel set and a fourth audio signal of the subordinate channel set,

wherein the fourth audio signal of the slave channel group comprises a fifth audio signal of the first slave channel comprising a sixth audio signal of an independent channel comprised in the first three-dimensional (3D) audio channel in front of the listener, and

wherein the seventh audio signal of the second 3D audio channel on the side and rear of the listener has been obtained by mixing the fifth audio signal of the first slave channel.

9. The audio processing method of claim 8, wherein the third audio signal of the basic channel group includes an eighth audio signal of the second channel and a ninth audio signal of the third channel,

wherein the eighth audio signal of the second channel has been generated by mixing the tenth audio signal of the left stereo channel with the decoded audio signal of the center channel in front of the listener, and

Wherein the ninth audio signal of the third channel has been generated by mixing the eleventh audio signal of the right stereo channel with the decoded audio signal of the center channel in front of the listener.

10. The audio processing method of claim 1, wherein the downmixed second audio signal includes a third audio signal of the basic channel set and a fourth audio signal of the subordinate channel set,

wherein the third audio signal of the basic channel group comprises a fifth audio signal of the stereo channel,

wherein transmitting the first information related to error cancellation of the at least one first audio signal and the downmixed second audio signal comprises:

generating a bitstream comprising first information related to error cancellation of the at least one first audio signal and second information related to a downmixed second audio signal, and

transmitting the bit stream, and

wherein the generating of the bit stream comprises:

generating a base channel audio stream comprising a compressed fifth audio signal of a stereo channel, and

generating a plurality of slave channel audio streams comprising a plurality of audio signals of a plurality of slave channel groups, and

wherein the plurality of slave channel audio streams includes a first slave channel audio stream and a second slave channel audio stream, and

Wherein when for a first multi-channel audio signal for generating a base channel audio stream and a first slave channel audio stream, the first number of surround channels is S _n-1 The second number of subwoofer channels is W _n-1 The third number of overhead channels is H _n-1 And for a second multi-channel audio signal for generating a first and a second slave channel audio stream, the fourth number of surround channels is S _n The fifth number of subwoofer channels is W _n The sixth number of overhead channels is H _n ，

S _n-1 Less than or equal to S _n ，W _n-1 Less than or equal to W _n ，H _n-1 Less than or equal to H _n But S is _n-1 、W _n-1 And H _n-1 All respectively not equal to S _n 、W _n And H _n 。

11. An audio processing method, comprising:

obtaining a second audio signal downmixed from the at least one first audio signal from the bitstream;

obtaining first information related to error cancellation of the at least one first audio signal from the bitstream;

unmixed the at least one first audio signal from the downmix second audio signal; and

reconstructing the at least one first audio signal by mixing first information related to error cancellation of the at least one first audio signal to the at least one first audio signal which is unmixed,

Wherein first information related to error cancellation of the at least one first audio signal has been generated using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding.

12. The audio processing method of claim 11, wherein the first information related to error cancellation of the at least one first audio signal includes second information about a factor for error cancellation, and

wherein the factor for error cancellation is greater than or equal to 0 and less than or equal to 1.

13. The audio processing method of claim 11, wherein the reconstructing of the at least one first audio signal includes reconstructing the at least one first audio signal to have a third signal power that is equal to a product of a fourth signal power of the unmixed at least one first audio signal and an error cancellation factor.

14. The audio processing method of claim 11, wherein the bitstream includes second information regarding a third audio signal of the basic channel group and third information regarding a fourth audio signal of the dependent channel group,

Wherein the third audio signal of the basic channel set is obtained by decoding the second information on the third audio signal of the basic channel set included in the bitstream without being unmixed with the other audio signal of the other channel set, and

wherein the audio processing method further comprises reconstructing a fifth audio signal of an upmix channel group comprising at least one upmix channel by unmixing with a third audio signal of a base channel group using a fourth audio signal of a dependent channel group.

15. An audio processing apparatus comprising:

a memory storing one or more instructions; and

at least one processor communicatively coupled to the memory and configured to execute the one or more instructions to:

a second audio signal downmixed from the at least one first audio signal is obtained from the bitstream,

information relating to error cancellation of the at least one first audio signal is obtained from the bitstream,

unmixed the at least one first audio signal from the downmix second audio signal, and

reconstructing the at least one first audio signal by applying information related to error cancellation of the at least one first audio signal to the at least one first audio signal unmixed from a second audio signal, and

Wherein information related to error cancellation of the at least one first audio signal has been generated using at least one of an original signal power of the at least one first audio signal or a second signal power of the at least one first audio signal after decoding.