CN116762128A - Audio processing apparatus and method - Google Patents

Audio processing apparatus and method Download PDF

Info

Publication number
CN116762128A
CN116762128A CN202280011465.XA CN202280011465A CN116762128A CN 116762128 A CN116762128 A CN 116762128A CN 202280011465 A CN202280011465 A CN 202280011465A CN 116762128 A CN116762128 A CN 116762128A
Authority
CN
China
Prior art keywords
channel
audio signal
audio
channels
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280011465.XA
Other languages
Chinese (zh)
Inventor
南佑铉
高祥铁
金敬来
金正奎
孙允宰
李泰美
郑铉权
黄盛熙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020210138834A external-priority patent/KR20220108704A/en
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority claimed from PCT/KR2022/001496 external-priority patent/WO2022164229A1/en
Publication of CN116762128A publication Critical patent/CN116762128A/en
Pending legal-status Critical Current

Links

Landscapes

  • Stereophonic System (AREA)

Abstract

An audio processing apparatus may obtain a second audio signal corresponding to a channel included in a second channel group from a first audio signal corresponding to the channel included in the first channel group, downsample at least one third audio signal corresponding to at least one channel identified from the channels included in the first channel group based on a correlation with the second channel group by using an Artificial Intelligence (AI) model, and generate a bitstream including the second audio signal corresponding to the channel included in the second channel group and the downsampled at least one third audio signal. The first channel group includes channel groups of the original audio signal, and the second channel group is constructed by combining at least two channels of channels included in the first channel group.

Description

Audio processing apparatus and method
Technical Field
The present disclosure relates to an apparatus and method for processing audio. More particularly, the present disclosure relates to an apparatus and method for encoding an audio signal or an apparatus and method for decoding an audio signal.
Background
As technology advances, stereo equipment including a larger and clearer display and a plurality of speakers has been widely used. Meanwhile, research into video encoding techniques for transmitting and receiving more vivid images and audio encoding techniques for transmitting and receiving more realistic and immersive audio signals has been actively conducted. For example, the immersive audio signal may be encoded by a codec conforming to a predetermined compression standard, such as the Advanced Audio Coding (AAC) standard or the OPUS standard, and then stored in a recording medium or transmitted in the form of a bitstream via a communication channel. The stereo device may reproduce the immersive audio signal by decoding a bitstream generated according to a predetermined compression standard.
Audio content may be reproduced through various channel layouts depending on the environment in which the audio content is consumed. For example, audio content may be reproduced by a 2-channel layout, a 3.1-channel layout, or a 3.1.2-channel layout implemented with speakers mounted on a display device such as a television, or a 5.1-channel layout, a 5.1.2-channel layout, a 7.1-channel layout, or a 7.1.4-channel layout implemented with a plurality of speakers arranged around a viewer, which are implemented via a sound output device such as headphones.
In particular, as set Top box (OTT) services expand, television resolution increases, and screens of electronic devices such as tablet computers increase, demands of viewers who want to experience an immersive sound like in a theater are increasing. Accordingly, an audio processing apparatus is required to support a three-dimensional (3D) channel layout in which a sound image (sound image) is implemented based on a display screen.
Disclosure of Invention
Technical problem
There is a need for an encoding and decoding method capable of improving transmission efficiency while supporting conversion between various channel layouts. In particular, there is a need for an audio encoding and decoding method capable of accurately reproducing an original audio signal even when audio contents of a predetermined channel layout are converted into another channel layout having a different sound image and output.
Solution to the problem
According to an embodiment of the present disclosure, an audio processing apparatus may include: a memory storing one or more instructions; and a processor configured to execute the one or more instructions to: obtaining a second audio signal corresponding to a second channel included in a second channel group from among first audio signals corresponding to a first channel included in a first channel group of an original audio signal, the second channel being obtained by combining at least two channels of the first channel included in the first channel group, downsampling at least one third audio signal corresponding to at least one channel identified from among channels included in the first channel group based on correlation with the second channel group by using an Artificial Intelligence (AI) model, and generating a bitstream including the second audio signal corresponding to the second channel included in the second channel group and the downsampled at least one third audio signal.
The processor may be further configured to identify at least one channel having a correlation with the second channel group below a predetermined value from among the first channels included in the first channel group.
The processor may be further configured to: assigning a weight value to the first channels included in the first channel group based on a correlation of each of the first channels included in the first channel group with the second channel group; obtaining a second audio signal from the first audio signals by calculating a weighted sum of at least two first audio signals from among the first audio signals based on weight values assigned to the first channels included in the first channel group; and identifying at least one channel from the first channels included in the first channel group based on the weight values assigned to the first channels included in the first channel group.
The processor may be further configured to: obtaining an audio signal corresponding to one channel among the second channels included in the second channel group by summing audio signals corresponding to at least two channels among the first channels included in the first channel group based on weight values assigned to the at least two channels; and identifying the channel of the at least two channels having the largest assigned weight value as the channel corresponding to the first subset of the first channel group and identifying the remaining channels of the at least two channels as corresponding to the second subset of the first channel group.
The processor may be further configured to: extracting a first set of audio samples and a second set of audio samples from at least one third audio signal; obtaining downsampling-related information about the first and second sets of audio samples by using the AI model; and downsampling the at least one third audio signal by applying the downsampling-related information to the first and second groups of audio samples.
The AI model may be trained to obtain downsampling-related information by minimizing an error between a reconstructed audio signal reconstructed based on the second audio signal and the downsampled at least one third audio signal.
The processor may be further configured to: obtaining an audio signal of a base channel group and an audio signal of a dependent channel group from a second audio signal corresponding to a second channel included in a second channel group; obtaining a first compressed signal by compressing audio signals of a basic channel group; obtaining a second compressed signal by compressing the audio signals of the slave channel group; obtaining a third compressed signal by compressing the downsampled at least one third audio signal; and generating a bitstream including the first compressed signal, the second compressed signal, and the third compressed signal.
The base channel group may include two channels for stereo reproduction, and the sub channel group may include channels of the second channels included in the second channel group other than the two channels having relatively high correlation with the two channels for stereo reproduction.
The first audio signal corresponding to the first channel included in the first channel group may include a multi-channel audio signal centered around a listener, and the second audio signal corresponding to the second channel included in the second channel group may include a multi-channel audio signal centered in front of the listener.
The first channel group may include 7.1.4 channels, and the 7.1.4 channels include a front left channel, a front right channel, a center channel, a left channel, a right channel, a rear left channel, a rear right channel, a subwoofer channel, a front left high channel, a front right high channel, a rear left high channel, and a rear right high channel. The second channel group may include 3.1.2 channels, the 3.1.2 channels include a front left channel, a front right channel, a subwoofer channel, a front left-high channel, and a front right-high channel, and the processor may be further configured to identify, from among the first channels included in the first channel group, a left channel, a right channel, a rear left channel, a rear right channel, a rear left-high channel, and a rear right-high channel having relatively low correlation with the second channel group as the second channel of the second sub-group.
According to another embodiment of the present disclosure, an audio processing apparatus may include: a memory storing one or more instructions; and a processor configured to execute one or more instructions to obtain a first audio signal corresponding to a first channel included in the first channel group and a downsampled second audio signal from the bitstream, obtain at least one second audio signal corresponding to at least one channel included in the second channel group by upsampling the downsampled second audio signal using an Artificial Intelligence (AI) model, and reconstruct a third audio signal corresponding to the second channel included in the second channel group from the first audio signal and the at least one second audio signal, wherein the first channel group may include a fewer number of channels than the second channel group.
Based on the correlation of each of the second channels included in the second channel group with the first channel group, the second channels included in the second channel group may be classified into channels of the first sub-group and channels of the second sub-group, and the processor may be further configured to obtain a second audio signal corresponding to the channels of the second sub-group.
The processor may be further configured to: obtaining a fourth audio signal corresponding to the channels of the first sub-group from the first audio signal and the second audio signal corresponding to the channels of the second sub-group according to a transformation rule from the second channel included in the second channel group to the first channel included in the first channel group; refining the second audio signal and the fourth audio signal by using the AI model; and obtaining a third audio signal from the refined second audio signal and the refined fourth audio signal.
The fourth audio signal may be refined by a first neural network layer within the AI model, the second audio signal may be refined by a second neural network layer within the AI model, the refined fourth audio signal may be obtained by inputting the first audio signal, the second audio signal, and the fourth audio signal to the first neural network layer, and the refined second audio signal may be obtained by inputting the first audio signal, the second audio signal, the refined fourth audio signal, and the value output by the first neural network layer to the second neural network layer.
The processor may be further configured to: obtaining an audio signal corresponding to a channel included in the base channel group and an audio signal corresponding to a channel included in the dependent channel group by decompressing the bit stream; and obtaining the first audio signal based on the additional information obtained from the bitstream, the audio signal corresponding to the channels included in the base channel group, and the audio signal corresponding to the channels included in the dependent channel group.
The base channel group may include two channels for stereo reproduction, the dependent channel group may include channels other than the two channels having relatively high correlation with the two channels for stereo reproduction included in the first channel group, and the processor may be further configured to obtain a mixed audio signal corresponding to the first channel included in the first channel group by mixing an audio signal corresponding to the channel included in the base channel group with an audio signal corresponding to the channel included in the dependent channel group, and obtain the first audio signal by rendering (rendering) the mixed audio signal based on the additional information.
The first audio signal corresponding to the first channel included in the first channel group may include a multi-channel audio signal centered in front of the listener, and the third audio signal corresponding to the second channel included in the second channel group may include a multi-channel audio signal centered in front of the listener.
The first channel group may include 3.1.2 channels, the 3.1.2 channels include a front left channel, a front right channel, a subwoofer channel, a front left high channel, and a front right high channel, the second channel group may be 7.1.4 channels, the 7.1.4 channels include a front left channel, a front right channel, a center channel, a left channel, a right channel, a rear left channel, a rear right channel, a subwoofer channel, a front left high channel, a front right high channel, a rear left high channel, and a rear right high channel, and the channels of the second sub-group may include a left channel, a right channel, a rear left channel, a rear right channel, an upper rear left high channel, and an upper rear right high channel, which have relatively low correlation with the first channel group, among the second channels included in the second channel group.
According to another embodiment of the present disclosure, an audio processing method may include: performing audio signal format conversion on an original audio signal by combining at least two channels of a plurality of channels included in the original audio signal to convert the original audio signal into a converted audio signal; identifying at least one side channel signal and a plurality of base channel signals of the original audio signal based on a correlation of each of a plurality of channels of the original audio signal with the audio format conversion; downsampling the at least one side channel signal by using an Artificial Intelligence (AI) model; and generating a bitstream comprising the converted audio signal and the downsampled at least one side channel signal.
According to another embodiment of the present disclosure, an audio processing method may include: obtaining a first audio signal corresponding to a first channel included in a first channel group and a downsampled second audio signal from a bitstream; obtaining at least one second audio signal corresponding to at least one channel of a second channel included in a second channel group by upsampling the downsampled second audio signal using an Artificial Intelligence (AI) model; and reconstructing a third audio signal corresponding to a second channel included in the second channel group from the first audio signal and the at least one second audio signal, wherein the first channel group includes a smaller number of channels than the second channel group.
Drawings
The foregoing and/or other aspects will become more apparent by describing certain example embodiments with reference to the accompanying drawings in which:
FIG. 1A illustrates an example of an audio processing system in which sound images are transformed according to an audio content consumption environment, according to an embodiment of the present disclosure;
fig. 1B illustrates a method of processing a multi-channel audio signal by dividing the multi-channel audio signal into an audio signal of a base channel group and an audio signal of a sub-channel group, which is performed by an audio encoding apparatus and an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 2A is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 2B is a detailed block diagram of an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 2C is a block diagram of a multi-channel audio signal processor included in an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 2D is a view for explaining an operation of a multi-channel audio signal processor according to an embodiment of the present disclosure;
fig. 3A is a block diagram of an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 3B is a detailed block diagram of an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 3C is a block diagram of a multi-channel audio signal reconstructor included in an audio decoding apparatus according to an embodiment of the present disclosure;
Fig. 3D is a block diagram for explaining an operation of a mixer of a multi-channel audio signal reconstructor according to an embodiment of the present disclosure;
fig. 4A is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 4B is a block diagram of an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 5 illustrates an example of a transformation between channel groups performed in an audio processing system according to an embodiment of the present disclosure;
fig. 6 is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 7A illustrates an example of a transformation rule between channel groups performed in an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 7B illustrates an example of a transformation rule between channel groups performed in an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 8A is a view for explaining downsampling of a side channel audio signal performed by an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 8B is a block diagram for explaining an operation of a downsampler of an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 9 is a block diagram of an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 10A is a block diagram for explaining an operation of a sound image reconstructor of an audio decoding apparatus according to an embodiment of the present disclosure;
Fig. 10B is a block diagram for explaining respective operations of an upsampler and a refiner (refiner) of an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 10C is a block diagram for explaining the operation of a refiner of an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 11 is a flowchart of an audio signal encoding method performed by an audio encoding apparatus according to an embodiment of the present disclosure;
fig. 12 is a flowchart of an audio signal decoding method performed by an audio decoding apparatus according to an embodiment of the present disclosure;
fig. 13 illustrates an example of a transformation between channel groups performed based on sound image characteristic analysis in an audio processing system according to an embodiment of the present disclosure;
fig. 14 illustrates an example in which an audio processing system downsamples an audio signal of a side channel based on characteristics of the channel according to an embodiment of the present disclosure; and
fig. 15 illustrates an embodiment of an audio processing method suitable for use in accordance with embodiments of the present disclosure.
Detailed Description
Throughout this disclosure, the expression "at least one of a, b or c" means all or variants thereof of a only, b only, c only, both a and b, both a and c, both b and c, a, b and c.
Embodiments of the present disclosure are described in detail herein with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains may readily perform the present disclosure. This disclosure may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein. In the drawings, parts irrelevant to the description are omitted for simplicity of description, and like numerals refer to like elements throughout.
Embodiments of the present disclosure may be described in terms of functional block components and various processing steps. Some or all of these functional blocks may be implemented using various amounts of hardware and/or software configurations that perform the specified functions. For example, the functional blocks of the present disclosure may be implemented by one or more microprocessors, or by circuit configurations for specific functions. For example, the functional blocks of the present disclosure may be implemented in a variety of programming or scripting languages. The functional blocks may be implemented as algorithms that execute in one or more processors. For example, the present disclosure may employ conventional techniques for electronic configuration, signal processing, and/or data processing.
Furthermore, the connection lines or connection members between the components shown in the figures merely illustrate functional connections and/or physical or circuit connections. Connections between components may be represented by many alternative or additional functional relationships, physical connections, or logical connections in a practical device.
In the present disclosure, "Deep Neural Network (DNN)" is a representative example of an artificial neural network model that simulates brain nerves, and is not limited to an artificial neural network model using a specific algorithm.
In the present disclosure, a "multi-channel audio signal" may refer to an audio signal of n channels (where n is an integer greater than 2). The "mono audio signal" or the "stereo channel audio signal" distinguished from the "multi-channel audio signal" may be a one-dimensional (1D) audio signal or a two-dimensional (2D) audio signal, and the "multi-channel audio signal" may be a three-dimensional (3D) audio signal, but is not limited thereto, and may be a 2D audio signal.
In the present disclosure, a "channel layout", "speaker layout", or "speaker channel layout" may represent a combination of at least one channel, and may specify a spatial layout of speakers outputting channels or audio signals. The channel used herein is a channel through which an audio signal is actually output, and thus may be referred to as a rendering channel (presentation channel).
For example, the predetermined channel layout may be designated as an "x.y.z channel layout". In the "x.y.z channel layout", X may be the number of surround channels, Y may be the number of sub-woofer channels, and Z may be the number of high channel channels. The predetermined channel layout may specify the spatial locations of the surround channel/ultra-low sound channel/high-set channel.
Examples of the "channel layout" include a 1.0.0 channel (or mono) layout, a 2.0.0 channel (or stereo) layout, a 3.1.2 channel layout, a 3.1.4 channel layout, a 5.1.0 channel layout, a 5.1.2 channel layout, a 5.1.4 channel layout, a 7.1.0 layout, a 7.1.2 layout, and a 7.1.4 channel layout, but the "channel layout" is not limited thereto, and various other setup channel layouts are possible
The "channel layout" may be referred to as a "channel group", and channels constituting the "channel layout" may be referred to as various names, but may be uniformly named for convenience of explanation. The plurality of channels constituting the "channel layout" may be named based on the respective spatial positions of the channels.
For example, the first surround channel of a 1.0.0 channel layout may be named mono. The first surround channel of the 2.0.0 channel layout may be named L2 channel and the second surround channel of the 2.0.0 channel layout may be named R2 channel. "L" represents a channel located on the left side of the listener, and "R" represents a channel located on the right side of the listener. "2" means that the number of surround channels is 2.
The first surround channel of the 3.1.2 channel arrangement may be named L3 channel, the second surround channel of the 3.1.2 channel arrangement may be named R3 channel, and the third surround channel of the 3.1.2 channel arrangement may be named C channel. The first ultra-low sound channel of the 3.1.2 channel layout may be named LFE channel. The first high channel of the 3.1.2 channel layout may be named Hfl3 channel (or Tl channel) and the second high channel of the 3.1.2 channel layout may be named Hfr3 channel (or Tr channel).
The first surround channel of the 5.1.0 channel arrangement may be named L5 channel, the second surround channel of the 5.1.0 channel arrangement may be named R5 channel, the third surround channel of the 5.1.0 channel arrangement may be named C channel, the fourth surround channel of the 5.1.0 channel arrangement may be named Ls5 channel, and the fifth surround channel of the 5.1.0 channel arrangement may be named Rs5 channel. "C" represents the channel at the center of the listener. "s" refers to the channel on the listener's side. The first ultra-low sound channel of the 5.1.0 channel layout may be named LFE channel. LFE may refer to a low frequency effect. In other words, the LFE channel may be a channel for outputting low frequency effect sound.
The individual surround channels of the 5.1.2 channel layout and the 5.1.4 channel layout may be named the same as the surround channels of the 5.1.0 channel layout. Similarly, the individual subwoofer channels of the 5.1.2 channel layout and the 5.1.4 channel layout may be named identically to the subwoofer channels of the 5.1.0 channel layout.
The first high channel of the 5.1.2 channel layout may be named Hl5 channel. Here, H represents a high-order channel. The second higher channel of the 5.1.2 channel layout may be named Hr5 channel.
The first high-order channel of the 5.1.4 channel arrangement may be named Hfl channel, the second high-order channel of the 5.1.4 channel arrangement may be named Hfr channel, the third high-order channel of the 5.1.4 channel arrangement may be named Hbl channel, and the fourth high-order channel of the 5.1.4 channel arrangement may be named Hbr channel. Here, f denotes a listener-based front channel, and b denotes a listener-based rear channel.
The first surround channel of the 7.1.0 channel arrangement may be named L channel, the second surround channel of the 7.1.0 channel arrangement may be named R channel, the third surround channel of the 7.1.0 channel arrangement may be named C channel, the fourth surround channel of the 7.1.0 channel arrangement may be named Ls channel, the fifth surround channel of the 7.1.0 channel arrangement may be named Rs channel, the sixth surround channel of the 7.1.0 channel arrangement may be named Lb channel, and the seventh surround channel of the 7.1.0 channel arrangement may be named Rb channel.
The individual surround channels of the 7.1.2 channel layout and the 7.1.4 channel layout may be named the same as the surround channels of the 7.1.0 channel layout. Similarly, the individual subwoofer channels of the 7.1.2 channel layout and the 7.1.4 channel layout may be named identically to the subwoofer channels of the 7.1.0 channel layout. The first high-order channel of the 7.1.2 channel layout may be named Hl7 channel and the second high-order channel of the 7.1.20 channel layout may be named Hr7 channel.
The first high channel of the 7.1.4 channel arrangement may be named Hfl channel, the second high channel of the 7.1.4 channel arrangement may be named Hfr channel, the third high channel of the 7.1.4 channel arrangement may be named Hbl channel, and the fourth high channel of the 7.1.4 channel arrangement may be named Hbr channel.
Depending on the channel layout, some channels may have different names, but may represent the same channel. For example, the Hl5 channel and the Hl7 channel may be the same channel. Also, the Hr5 channel and the Hr7 channel may be the same channel.
The channels are not limited to the channel names described above, and various other channel names may be used. For example, the L2 channel may be named an L "channel, the R2 channel may be named an R" channel, the L3 channel may be named an ML3 or L ' channel, the R3 channel may be named an MR3 or R ' channel, the Hfl3 channel may be named an MHL3 channel, the Hfr3 channel may be named an MHR3 channel, the Ls5 channel may be named an MSL5 or Ls ' channel, the Rs5 channel may be named an MSR5 channel, the Hl5 channel may be named an MHL5 or Hl ' channel, the Hr5 channel may be named an MHR5 or Hr ' channel, and the C channel may be named an MC channel.
As described above, at least one channel constituting the channel layout may be named as in table 1.
Sound channel layout Channel name
1.0.0 Monaural
2.0.0 L2/R2
3.1.2 L3/C/R3/Hfl3/Hfr3/LFE
5.1.0 L5/C/R5/Ls5/Rs5/LFE
5.1.2 L5/C/R5/Ls5/Rs5/Hl5/Hr5/LFE
5.1.4 l5/C/R5/Ls5/Rs5/Hfl/Hfr/Hbl/Hbr/LFE
7.1.0 L/C/R/Ls/Rs/Lb/Rb/LFE
7.1.2 L/C/R/Ls/Rs/Lb/Rb/Hl7/Hr7/LFE
7.1.4 l/C/R/Ls/Rs/Lb/Rb/Hfl/Hfr/Hbl/Hbr/LFE
The "transmission channel" is a channel for transmitting a compressed audio signal, and a portion of the "transmission channel" may be the same as the "presentation channel", but is not limited thereto, and another portion of the "transmission channel" may be a channel of an audio signal in which audio signals of the presentation channels are mixed. In other words, the "transmission channel" may be a channel containing an audio signal of the "presentation channel", but may also be a mixed channel in which a part is the same as the presentation channel and the residual part is different from the presentation channel. The "transmission channel" may be named distinct from the "presentation channel". For example, when the transmission channel is an a/B channel, the a/B channel may contain an audio signal corresponding to an L2/R2 channel. When the transmission channel is a T/P/Q channel, the T/P/Q channel may contain audio signals corresponding to C/LFE/Hfl3, hfr3 channels. When the transmission channel is an S/U/V channel, the S/U/V channel may contain audio signals corresponding to L, R/Ls, rs/Hfl, hfr channels. In the present disclosure, a "3D audio signal" may refer to an audio signal that enables a listener to feel a sense of sound height around the listener to experience audio more deeply.
In the present disclosure, the "multi-channel audio signal centered in front of a listener" may refer to an audio signal based on a channel layout in which sound images are formed around the front of the listener (e.g., a display device). When a multi-channel audio signal centered in front of a listener is arranged around a screen of a display device located in front of the listener, the multi-channel audio signal centered in front of the listener may be referred to as a "screen-centered audio signal" or a "front 3D audio signal".
In the present disclosure, "listener-centered multi-channel audio signal" may refer to an audio signal based on a channel layout in which sound images are formed around a listener. Because the listener-centered multi-channel audio signal is based on a channel layout in which channels are arranged omnidirectionally around the listener, the listener-centered multi-channel audio signal may be referred to as an "all 3D audio signal".
In this disclosure, a "basic channel group" may refer to a group including at least one "basic channel". The audio signal of the "base channel" may include an audio signal capable of constituting a predetermined channel layout by being independently decoded without information on the audio signal of another channel (e.g., a sub-channel). For example, the audio signal of the "basic channel group" may be a mono audio signal, or may include a left channel audio signal and a right channel audio signal that constitute a stereo audio signal.
In this disclosure, a "slave channel group" may refer to a group including at least one "slave channel". The audio signals of the "slave channels" may include audio signals of at least one channel mixed with the audio signals of the "base channel" to constitute a predetermined channel layout.
When the encoding apparatus according to the embodiment of the present disclosure encodes and outputs a multi-channel audio signal of a predetermined channel layout, the encoding apparatus may obtain an audio signal of a base channel group and an audio signal of a dependent channel group by mixing the multi-channel audio signals, and may compress and transmit the obtained audio signals. For example, when the base channel group includes left and right channels constituting a stereo channel, the sub channel group may include channels other than the two left and right channels corresponding to the base channel group among channels included in a predetermined channel layout.
The decoding apparatus according to an embodiment of the present disclosure may decode an audio signal of a base channel group and an audio signal of a dependent channel group from a received bitstream and mix the audio signals of the base channel group and the dependent channel group, thereby reconstructing a multi-channel audio signal of a predetermined channel layout.
In the present disclosure, "side channel information" is information for reconstructing an audio signal of a first channel layout of a greater number than an audio signal of a second channel layout, and may refer to an audio signal regarding at least one side channel included in the first channel layout. For example, the side channels may include channels of which the positional correlation with the channels included in the second channel layout is relatively low among the channels included in the first channel layout. The present disclosure is not limited to an example in which the side channel includes a channel having relatively low correlation with a channel included in the second channel layout among channels included in the first channel layout. For example, a channel satisfying a certain criterion among channels of the first channel layout may be a side channel, or a channel desired by a manufacturer of the audio signal may be a side channel.
Embodiments according to the technical spirit of the present disclosure will be sequentially described in detail.
When a display device such as a TV reproduces immersive audio content, an audio codec for reproducing sound images based on a screen of the display device may be used. However, depending on the installation method, the display device may be used alone or with additional speakers such as a sound bar. For example, when a plurality of speakers for home theater are included in addition to speakers mounted on a display device, a method of restoring audio content having a screen-centered sound image to audio content having a listener-centered sound image is required.
Fig. 1A illustrates an example of an audio processing system in which sound images are transformed according to an audio content consumption environment.
As shown in image 10 of fig. 1A, a content manufacturer may make immersive audio content (e.g., audio content of a 7.1.4 channel layout) having a listener-centric sound image. The manufactured audio content may be converted into audio content having a screen-centered sound image and transmitted to a user. As shown in image 20 of fig. 1A, audio content having a screen-centered sound image (e.g., audio content of a 3.1.2 channel layout) may be consumed by a display device such as a television. As shown in image 30 of fig. 1A, in order to consume audio content in an environment where additional speakers are used in addition to a display device, audio content having a screen-centered sound image needs to be reconstructed into audio content having a listener-centered sound image (e.g., audio content of a 7.1.4-channel layout).
In order to transmit audio contents transformed and output according to various channel layouts without degrading sound quality, a method of transmitting audio signals of all channel layouts included in the audio contents may be used. However, this method has problems in that transmission efficiency is low and it is difficult to backward-compatible with a conventional channel layout such as a mono channel or a stereo channel.
Accordingly, the audio encoding apparatus according to the embodiment of the present disclosure may divide a multi-channel audio signal into an audio signal of a base channel group and an audio signal of a dependent channel group, and encode and output the audio signals such that the multi-channel audio signal is suitable for a channel layout of a screen-centered sound image, and backward compatibility is possible.
Fig. 1B illustrates a method of processing a multi-channel audio signal by dividing the multi-channel audio signal into an audio signal of a base channel group and an audio signal of a sub-channel group, which is performed by an audio encoding apparatus and an audio decoding apparatus according to an embodiment of the present disclosure.
The audio encoding apparatus 200 according to the embodiment may transmit the compressed audio signal of the basic channel group generated by compressing the stereo channel audio signal and the compressed audio signal of the first sub channel group generated by compressing the audio signals of some channels of the 3.1.2 channel layout so as to transmit information about the audio signal 170 of the 3.1.2 channel layout. The audio encoding apparatus 200 may generate the compressed audio signal of the basic channel group by compressing the L2 channel audio signal and the R2 channel audio signal 160 of the stereo channel layout. For example, the L2 channel audio signal may be a signal of a left channel of a stereo audio signal, and the R2 channel audio signal may be a signal of a right channel of the stereo audio signal.
The audio encoding apparatus 200 may generate the compressed audio signals of the first slave channel group by compressing respective audio signals of the Hfl3 channel, the Hfr3 channel, the LFE channel, and the C channel in the audio signal 170 of the 3.1.2 channel layout. The 3.1.2 channel layout may be a channel layout including six channels in which an acoustic image is formed around the front of the listener. In the 3.1.2 channel layout, the C channel may refer to the center channel, the LFE channel may refer to the ultra-bass channel, the Hfl3 channel may refer to the upper left channel, and the Hfr3 channel may refer to the upper right channel.
According to an embodiment, the audio encoding apparatus 200 may transmit the compressed audio signal of the base channel group and the compressed audio signal of the first dependent channel group to the audio decoding apparatus 300.
The audio decoding apparatus 300 according to the embodiment may reconstruct an audio signal of a 3.1.2 channel layout from the compressed audio signals of the base channel group and the compressed audio signals of the first dependent channel group.
First, the audio decoding apparatus 300 may obtain an L2 channel audio signal and an R2 channel audio signal by decompressing the compressed audio signals of the base channel group. The audio decoding apparatus 300 may obtain the respective audio signals of C, LFE, hfl and Hfr3 channels by decompressing the compressed audio signals of the first slave channel group.
As shown by arrow (1) in fig. 1B, the audio decoding apparatus 300 may reconstruct an L3 channel audio signal of a 3.1.2 channel layout by mixing an L2 channel audio signal and a C channel audio signal. As shown by arrow (2) in fig. 1B, the audio decoding apparatus 300 may reconstruct an R3 channel audio signal of a 3.1.2 channel layout by mixing an R2 channel audio signal and a C channel audio signal. In a 3.1.2 channel layout, the L3 channel may refer to the left channel and the R3 channel may refer to the right channel.
In addition to the compressed audio signals of the base channel group and the compressed audio signals of the first sub channel group, the audio encoding apparatus 200 according to the embodiment may generate compressed audio signals of the second sub channel group by compressing the audio signals of the L5 and R5 channels in the audio signal 180 of the 5.1.2 channel arrangement so as to transmit information about the audio signal 180 of the 5.1.2 channel arrangement. The 5.1.2 channel layout may be a channel layout comprising eight channels, wherein an acoustic image is formed around the front of the listener. In a 5.1.2 channel layout, the L5 channel may refer to the front left channel and the R5 channel may refer to the front right channel.
The audio encoding apparatus 200 may transmit the compressed audio signal of the base channel group, the compressed audio signal of the first sub-channel group, and the compressed audio signal of the second sub-channel group to the audio decoding apparatus 300.
The audio decoding apparatus 300 according to the embodiment may reconstruct an audio signal of a 5.1.2 channel layout from the compressed audio signal of the base channel group, the compressed audio signal of the first sub channel group, and the compressed audio signal of the second sub channel group.
First, the audio decoding apparatus 300 according to the embodiment may reconstruct the audio signal 170 of the 3.1.2 channel layout from the compressed audio signals of the base channel group and the compressed audio signals of the first dependent channel group.
Next, the audio decoding apparatus 300 may obtain the corresponding audio signals of the L5 and R5 channels by decompressing the compressed audio signals of the second sub-channel group.
As shown by arrow (3) in fig. 1B, the audio decoding apparatus 300 may reconstruct an Ls5 channel audio signal of a 5.1.2 channel layout by mixing an L3 channel audio signal and an L5 channel audio signal. As shown by arrow (4) in fig. 1B, the audio decoding apparatus 300 may reconstruct an Rs5 channel audio signal of the 5.1.2 channel layout by mixing an R3 channel audio signal and an R5 channel audio signal. In a 5.1.2 channel layout, the Ls5 channel may refer to the left channel and the Rs5 channel may refer to the right channel. As shown by arrow (5) in fig. 1B, the audio decoding apparatus 300 may reconstruct the Hl5 channel audio signal of the 5.1.2 channel layout by mixing the Hfl3 channel audio signal and the Ls5 channel audio signal. As shown by arrow (6) in fig. 1B, the audio decoding apparatus 300 may reconstruct an Hr5 channel audio signal of a 5.1.2 channel layout by mixing an Hfr3 channel audio signal and an Rs5 channel audio signal. In a 5.1.2 channel layout, the Hl5 channel may refer to the front upper left channel and the Hr5 channel may refer to the front upper right channel.
In addition to the compressed audio signals of the basic channel group, the compressed audio signals of the first sub channel group, and the compressed audio signals of the second sub channel group, the audio encoding apparatus 200 according to the embodiment may generate the compressed audio signals of the third sub channel group by compressing the audio signals of the Hfl channel, the Hfr channel, the Ls channel, and the Rs channel in the audio signal 190 of the 7.1.4 channel layout so as to transmit information about the audio signal 190 of the 7.1.4 channel layout. The 7.1.4 channel layout may be a channel layout including twelve channels in which sound images are formed around a listener. In a 7.1.4 channel layout, the Hfl channel may refer to the front upper left channel, the Hfr channel may refer to the front upper right channel, the Ls channel may refer to the left channel, and the Rs channel may refer to the front right channel.
The audio encoding apparatus 200 may transmit the compressed audio signal of the base channel group, the compressed audio signal of the first sub channel group, the compressed audio signal of the second sub channel group, and the compressed audio signal of the third sub channel group to the audio decoding apparatus 300.
The audio decoding apparatus 300 according to the embodiment may reconstruct an audio signal of a 7.1.4 channel layout from the compressed audio signal of the base channel group, the compressed audio signal of the first sub channel group, the compressed audio signal of the second sub channel group, and the compressed audio signal of the third sub channel group.
First, the audio decoding apparatus 300 according to the embodiment may reconstruct the audio signal 180 of the 5.1.2 channel layout from the compressed audio signal of the base channel group, the compressed audio signal of the first sub channel group, and the compressed audio signal of the second sub channel group.
Next, the audio decoding apparatus 300 may obtain respective audio signals of the Hfl channel, the Hfr channel, the Ls channel, and the Rs channel by decompressing the compressed audio signals of the third sub-channel group.
As shown by arrow (7) in fig. 1B, the audio decoding apparatus 300 may reconstruct an Lb channel audio signal of a 7.1.4 channel layout by mixing an Ls5 channel audio signal and an Ls channel audio signal. As shown by arrow (8) in fig. 1B, the audio decoding apparatus 300 may reconstruct the Rb channel audio signal of the 7.1.4 channel layout by mixing the Rs5 channel audio signal and the Rs channel audio signal. In the 7.1.4 channel layout, the Lb channel may refer to the rear left channel and the Rb channel may refer to the rear right channel.
As shown by arrow (9) in fig. 1B, the audio decoding apparatus 300 may reconstruct the Hbl channel audio signal of the 7.1.4 channel layout by mixing the Hfl channel audio signal and the Hbl channel audio signal. As shown by arrow (10) in fig. 1B, the audio decoding apparatus 300 may reconstruct the Hbr channel audio signal of the 7.1.4 channel layout by mixing the Hfr channel audio signal and the Hr5 channel audio signal. In a 7.1.4 channel layout, the Hbl channel may refer to the rear top left channel and the Hbr channel may refer to the rear top right channel.
As described above, the audio decoding apparatus 300 according to the embodiment may extend the reconstructed output multi-channel audio signal from the stereo channel layout to the 3.1.2 channel layout, the 5.1.2 channel layout, or the 7.1.4 channel layout. However, the embodiments of the present disclosure are not limited to the example shown in fig. 1B, and the audio signals processed by the audio encoding apparatus 200 and the audio decoding apparatus 300 may be implemented to be expandable to various channel layouts other than the stereo channel layout, the 3.1.2 channel layout, the 5.1.2 channel layout, and the 7.1.4 channel layout.
The audio encoding apparatus 200 according to the embodiment will now be described in detail, and the audio encoding apparatus 200 processes a multi-channel audio signal such that the multi-channel audio signal is suitable for a channel layout having a screen-centered sound image and is backward compatible.
Fig. 2A is a block diagram of a structure of an audio encoding apparatus 200 according to an embodiment of the present disclosure.
The audio encoding apparatus 200 includes a memory 210 and a processor 230. The audio encoding apparatus 200 may be implemented as an apparatus capable of audio processing, such as a server, a television, a camera, a mobile phone, a computer, a digital broadcasting terminal, a tablet PC, and a notebook computer.
Although the memory 210 and the processor 230 are shown separately in fig. 2A, the memory 210 and the processor 230 may be implemented by one hardware module (e.g., chip).
Processor 230 may be implemented as a dedicated processor for neural network-based audio processing. Alternatively, the processor 230 may be implemented by a combination of a general-purpose processor, such as an Application Processor (AP), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU), and software. The special purpose processor may include a memory to implement embodiments of the present disclosure or a memory processing unit to use external memory.
Processor 230 may include multiple processors. In this case, the processor 230 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors such as an AP, a CPU, or a GPU.
Memory 210 may store one or more instructions for audio processing. According to an embodiment, the memory 210 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for Artificial Intelligence (AI) or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 210. The neural network may be implemented as an external device (e.g., a server). In this case, the audio encoding apparatus 200 may request the neural network-based result information from the external apparatus and receive the neural network-based result information from the external apparatus.
The processor 230 may output a bitstream including the compressed audio signal by sequentially processing the audio frames included in the original audio signal according to instructions stored in the memory 210. The compressed audio signal may be an audio signal having the same or fewer channels than the original audio signal.
The bitstream may comprise a compressed audio signal of the base channel group and a compressed audio signal of the at least one dependent channel group. The processor 230 may change the number of the dependent channel groups included in the bitstream according to the number of channels to be transmitted.
Fig. 2A is a block diagram of a structure of an audio encoding apparatus 200 according to an embodiment of the present disclosure.
Referring to fig. 2B, the audio encoding apparatus 200 may include a multi-channel audio encoder 250, a bitstream generator 280, and an additional information generator 285.
The audio encoding apparatus 200 may include the memory 210 and the processor 230 of fig. 2A, and instructions for implementing the components 250, 260, 270, 280, and 285 of fig. 2B may be stored in the memory 210 of fig. 2A. Processor 230 may execute instructions stored in memory 210.
The multi-channel audio encoder 250 of the audio encoding apparatus 200 according to the embodiment may obtain the compressed audio signal of the base channel group, the compressed audio signal of the at least one sub-channel group, and the additional information by processing the original audio signal. The multi-channel audio encoder 250 may include a multi-channel audio signal processor 260 and a compressor 270.
The multi-channel audio signal processor 260 may obtain at least one audio signal of a base channel group and at least one audio signal of at least one dependent channel group from an original audio signal. For example, the audio signal of the basic channel group may be a mono audio signal or an audio signal of a stereo channel arrangement. The audio signal of the at least one sub-channel group may include at least one audio signal of remaining channels among multi-channel audio signals included in the original audio signal except for at least one channel corresponding to the base channel group.
For example, when the original audio signal is an audio signal of a 7.1.4 channel layout, the multi-channel audio signal processor 260 may obtain an audio signal of a stereo channel layout from the audio signal of the 7.1.4 channel layout as an audio signal of a basic channel group.
The multi-channel audio signal processor 260 may obtain audio signals of a first set of slave channels, which are used by the decoding stage to reconstruct the audio signals of the 3.1.2 channel layout. The multi-channel audio signal processor 260 may determine channels other than the two channels corresponding to the channels of the base channel group from among the channels included in the 3.1.2 channel layout as the first sub-channel group. The multi-channel audio signal processor 260 may obtain audio signals of the first sub-channel group from audio signals of the 3.1.2 channel layout.
The multi-channel audio signal processor 260 may obtain audio signals of the second set of secondary channels, which are used by the decoding stage to reconstruct the audio signals of the 5.1.2 channel layout. The multi-channel audio signal processor 260 may determine channels other than six channels corresponding to the channels of the base channel group and the first sub-channel group from among channels included in the 5.1.2 channel layout as the second sub-channel group. The multi-channel audio signal processor 260 may obtain audio signals of the second sub-channel group from the audio signals of the 5.1.2 channel layout.
The multi-channel audio signal processor 260 may obtain audio signals of a third set of dependent channels, which are used by the decoding stage to reconstruct the audio signals of the 7.1.4 channel layout. The multi-channel audio signal processor 260 may determine channels other than eight channels corresponding to the channels of the base channel group, the first sub-channel group, and the third sub-channel group from among channels included in the 7.1.4 channel layout as the third sub-channel group. The multi-channel audio signal processor 260 may obtain audio signals of the third sub-channel group from among the audio signals of the 7.1.4-channel layout.
The audio signal of the basic channel group may be a mono or stereo signal. Alternatively, the audio signal of the base channel group may include an audio signal of a first channel generated by mixing the audio signal L of the left stereo channel with c_1. Here, c_1 may be an audio signal of a center channel in front of the listener. In the name of the audio signal ("x_y"), the "X" may represent the name of a channel, and the "Y" may represent a factor for error cancellation (i.e., scaled) that is decoded, up-mixed, applied, or gain applied. For example, the decoded signal may be denoted as "x_1", and a signal generated by up-mixing the decoded signal may be denoted as "x_2". Alternatively, the signal to which the gain is applied to the decoded signal may be also denoted as 'x_2'. The signal in which the factor for error cancellation is applied (i.e., scaled) to the up-mix signal may be denoted as "x_3".
The audio signals of the base channel group may include the audio signal of the second channel generated by mixing the audio signal R of the right stereo channel with c_1.
The compressor 270 may obtain the compressed audio signal of the base channel group by compressing the at least one audio signal of the base channel group, and may obtain the compressed audio signal of the at least one slave channel group by compressing the audio signal of the at least one slave channel group. The compressor 270 may compress the audio signal through processes such as frequency conversion, quantization, and entropy. For example, an audio signal compression method such as the AAC standard or the OPUS standard may be used.
The additional information generator 285 may generate the additional information based on at least one of the original audio signal, the compressed audio signal of the base channel group, or the compressed audio signal of the sub channel group. The additional information may include various information for reconstructing the multi-channel audio signal in the decoding terminal.
For example, the additional information may include an audio object signal indicating at least one of an audio signal, a position, or a direction of an audio object (sound source). Alternatively, the additional information may include information about the total number of audio streams including a base channel audio stream and a dependent (or auxiliary) channel audio stream. The additional information may include downmix gain information (downmix gain information). The additional information may include channel map information. The additional information may include volume information. The additional information may include LFE gain information. The additional information may include Dynamic Range Control (DRC) information. The additional information may include channel layout rendering information. The additional information may include information indicating the number of other coupled audio streams, information indicating a multi-channel layout, information about whether dialog and dialog levels exist in the audio signal, information indicating whether low frequency effects are output, information about whether audio objects exist on a screen, information about the presence or absence of continuous channel audio signals, and information about the presence or absence of discontinuous channel audio signals.
The additional information may include information about the downmix (demix), which includes at least one downmix weight parameter for reconstructing a downmix matrix of the multi-channel audio signal. Since the downmix and the (down) mix correspond to each other, the information on the downmix may correspond to the information on the (down) mix, and the information on the downmix may include the information on the (down) mix. For example, the information about the de-mixing may comprise at least one (down) mixing weight parameter of the (down) mixing matrix. The de-mixing weight parameters may be obtained based on the (down) mixing weight parameters.
The additional information may be various combinations of the above. In other words, the additional information may include at least one of the foregoing information.
When there is an audio signal of a sub channel corresponding to at least one audio signal of the basic channel group, the additional information generator 285 may generate information indicating that the audio signal of the sub channel exists.
The bitstream generator 280 may generate a bitstream including the compressed audio signal of the base channel group and the compressed audio signal of the dependent channel group. The bit stream generator 280 may generate a bit stream further including the additional information generated by the additional information generator 285.
For example, the bitstream generator 280 may generate a bitstream by performing encapsulation (encapsulation) such that the compressed audio signals of the base channel group are included in the base channel audio stream and the compressed audio signals of the dependent channel group are included in the dependent channel audio stream. The bitstream generator 280 may generate a bitstream including a base channel audio stream and a plurality of dependent channel audio streams.
The channel layout of the multi-channel audio signal reconstructed from the bitstream by the audio decoding apparatus 300 according to the embodiment may follow the following rule.
For example, the first channel layout of the multi-channel audio signal reconstructed by the audio decoding apparatus 300 from the compressed audio signals of the base channel group and the compressed audio signals of the first dependent channel group may include S n-1 Surround channels W n-1 Ultra-bass channel and H n-1 And a high-level channel. From the compressed audio signal of the base channel group, the compressed audio signal of the first dependent channel group and the second dependent by the audio decoding apparatus 300The second channel layout of the multi-channel audio signal reconstructed from the compressed audio signals of the channel group may comprise S n Surround channels W n Ultra-bass channel and H n And a high-level channel.
In addition to the base channel group and the first sub-channel group, the second channel layout of the multi-channel audio signal reconstructed by the audio decoding apparatus 300 by further considering the second sub-channel group may include more channels than the first channel layout of the multi-channel audio signal reconstructed by the audio decoding apparatus 300 by considering the base channel group and the first sub-channel group. In other words, the first channel layout may be a lower channel layout (lower channel layout) of the second channel layout.
In detail, S n-1 Can be less than or equal to S n ,W n-1 Can be less than or equal to W, H n-1 May be less than or equal to H n . Excluding S n-1 Equal to S n 、W n-1 Equal to W n 、H n-1 Equal to H n Is the case in (a). For example, when the first channel layout is a 5.1.2 channel layout, the second channel layout may be a 5.1.4 channel layout or a 7.1.2 channel layout.
In response to the bit stream generated and transmitted by the audio encoding apparatus 200, the audio decoding apparatus 300 may reconstruct an audio signal of a basic channel group from the basic channel audio stream, and may reconstruct a multi-channel audio signal including various channel layouts of more channels than the basic channel group by further considering the sub-channel audio streams.
Fig. 2C is a block diagram of the structure of the multi-channel audio signal processor 260 of the audio encoding apparatus 200 according to an embodiment of the present disclosure.
Referring to fig. 2C, the multi-channel audio signal processor 260 includes a channel layout identifier 261, a multi-channel audio transformer 262, and a mixer 266.
The channel layout identifier 261 may identify at least one channel layout from the original audio signal. The at least one channel layout may comprise a plurality of hierarchical levels of channel layouts.
First, the channel layout identifier 261 may identify a channel layout of an original audio signal, and may identify a channel layout lower than the channel layout of the original audio signal. For example, when the original audio signal is an audio signal of a 7.1.4 channel layout, the channel layout identifier 261 may identify a 7.1.4 channel layout, and may identify a 5.1.2 channel layout, a 3.1.2 channel layout, and a 2 channel layout that are lower than the 7.1.4 channel layout.
The channel layout identifier 261 may identify a channel layout of an audio signal included in a bitstream to be output by the audio encoding apparatus 200 as a target channel layout. The target channel layout may be a channel layout of the original audio signal or a channel layout lower than the channel layout of the original audio signal. The channel layout identifier 261 may identify a target channel layout from among predetermined channel layouts.
Based on the identified target channel layout, the channel layout identifier 261 may determine a downmix channel audio generator corresponding to the identified target channel layout from among the first downmix channel audio generator 263, the second downmix channel audio generator 264 to the nth downmix channel audio generator 265. The multi-channel audio transformer 262 may generate an audio signal of the target channel layout by using the determined down-mix channel audio generator.
The first, second, and nth down-mix channel audio generators 263, 264, 265 of the multi-channel audio transformer 262 may generate an audio signal of the second channel arrangement, an audio signal of the third channel arrangement, or an audio signal of the fourth channel arrangement, respectively, from the original audio signal of the first channel arrangement by using a down-mix matrix including down-mix weight parameters.
Fig. 2C illustrates a multi-channel audio transformer 262 including a plurality of down-mix channel audio generators, i.e., a first down-mix channel audio generator 263, a second down-mix channel audio generator 264, and an nth down-mix channel audio generator 265, but embodiments of the present disclosure are not limited thereto. The multi-channel audio transformer 262 may transform the original audio signal into at least one other channel arrangement and may output the at least one other channel arrangement. For example, the multi-channel audio transformer 262 may transform an original audio signal of a first channel layout into an audio signal of a second channel layout that is a lower channel layout than the first channel layout, and the first channel layout and the second channel layout may include various multi-channel layouts depending on the implementation.
The mixer 266 may obtain the audio signal of the base channel group and the audio signal of the sub channel group by mixing the audio signals whose channel layout has been converted by the multi-channel audio converter 262, and may output the audio signal of the base channel group and the audio signal of the sub channel group. Accordingly, depending on the audio reproduction environment of the decoding terminal, only the audio signals of the basic channel group may be output, or the multi-channel audio signal may be reconstructed and output based on the audio signals of the basic channel group and the audio signals of the subordinate channel group.
According to an embodiment, a "basic channel group" may refer to a group comprising at least one "basic channel". The audio signal of the "base channel" may include an audio signal capable of constituting a predetermined channel layout by being independently decoded without information on the audio signal of another channel (e.g., a sub-channel).
A "slave channel group" may refer to a group comprising at least one "slave channel". The audio signal of the "slave channel" may include an audio signal mixed with the audio signal of the "base channel" to constitute at least one channel of the predetermined channel layout.
The mixer 266 according to the embodiment may obtain the audio signal of the base channel by mixing the audio signals of at least two channels of the transformed channel layout. The mixer 266 may obtain, as the audio signal of the dependent channel group, an audio signal of a channel other than at least one channel corresponding to the base channel group from channels included in the transformed channel layout.
Fig. 2D is a view for explaining the operation of the multi-channel audio signal processor 260 according to an embodiment of the present disclosure.
Referring to fig. 2D, the multi-channel audio transformer 262 of fig. 2C may obtain an audio signal of the 5.1.2 channel arrangement 291, an audio signal of the 3.1.2 channel arrangement 292, an audio signal of the 2 channel arrangement 293, and an audio signal of the mono arrangement 294, which are audio signals of the lower channel arrangement, from the original audio signal of the 7.1.4 channel arrangement 290. Since the first and second downmix channel audio generators 263 and 264 to nth downmix channel audio generators 265 of the multi-channel audio transformer 262 are connected to each other according to a cascade manner, the first and second downmix channel audio generators 263 and 264 to nth downmix channel audio generators 265 can sequentially obtain audio signals of a channel layout from a current channel layout to a direct down channel layout thereof.
Fig. 2D shows a case where the mixer 266 classifies the audio signal of the mono layout 294 into the audio signal of the basic channel group 295 and outputs the audio signal.
The mixer 266 according to the embodiment may classify the audio signal of the L2 channel included in the audio signal of the 2-channel layout 293 as the audio signal of the sub-channel group #1 296. As shown in fig. 2D, the audio signal of the L2 channel and the audio signal of the R2 channel are mixed, thereby generating an audio signal of the mono layout 294. The audio decoding apparatus 300 may reconstruct the audio signal of the R2 channel by mixing the audio signal of the mono layout 294 (i.e., the audio signal of the base channel group 295) with the audio signal of the L2 channel of the dependent channel group # 1. Accordingly, even when the audio encoding apparatus 200 transmits only the audio signal of the mono layout 294 and the audio signal of the L2 channel of the sub-channel group #1 296 without transmitting the audio signal of the R2 channel, the audio decoding apparatus 300 can reconstruct the audio signal with respect to the mono layout 294 or the stereo channel layout 293.
The mixer 266 according to the embodiment may classify audio signals of Hfl3 channel, C channel, LFE channel, and Hfr3 channel included in the audio signals of the 3.1.2 channel layout 292 as audio signals of the subordinate channel group # 2297. As shown in fig. 2D, the audio signal of the L3 channel and the audio signal of the C channel are mixed, thereby generating an audio signal of the L2 channel of the stereo channel layout 293. The audio signal of the R3 channel and the audio signal of the C channel are mixed, thereby generating an audio signal of the R2 channel of the stereo channel layout 293.
The audio decoding apparatus 300 may reconstruct the audio signal of the L3 channel of the 3.1.2 channel by mixing the audio signal of the L2 channel of the stereo channel arrangement 293 with the audio signal of the C channel of the sub channel group #2 297. The audio decoding apparatus 300 may reconstruct the audio signal of the R3 channel of the 3.1.2 channel by mixing the audio signal of the L2 channel of the stereo channel arrangement 293 with the audio signal of the C channel of the sub channel group #2 297. Therefore, even when the audio encoding apparatus 200 transmits only the audio signal of the mono layout 294, the audio signal of the sub-channel group #1 296, and the audio signal of the sub-channel group #2 296 without transmitting the audio signals of the L2 channel and the R2 channel of the 3.1.2 channels, the audio decoding apparatus 300 can reconstruct the audio signal with respect to the mono layout 294, the stereo channel layout 293, or the 3.1.2 channel layout 292.
The mixer 266 according to the embodiment may transmit the audio signal of the L channel and the audio signal of the R channel, which are audio signals of some channels of the 5.1.2 channel arrangement 291, as the audio signals of the slave channel group #3 298 so as to transmit the audio signals of the 5.1.2 channel arrangement 291. Even when the audio encoding apparatus 200 does not transmit all of the Ls5 channel, the Hl5 channel, the Rs5 channel, and the Hr5 channel of the 5.1.2 channels, the audio decoding apparatus 300 may reconstruct the audio signal to the 5.1.2 channel layout 291 by mixing at least two of the audio signal of the mono layout 294, the audio signal of the sub-channel group #1 296, the audio signal of the sub-channel group #2 297, or the audio signal of the sub-channel. The audio decoding apparatus 300 may reconstruct the audio signal into the mono layout 294, the stereo channel layout 293, the 3.1.2 channel layout 292, or the 5.1.2 channel layout 291 based on at least one of the audio signal of the mono layout 294, the audio signal of the sub channel group #1 296, the audio signal of the sub channel group #2 297, or the audio signal of the sub channel group #3 298.
As shown in fig. 2D, when mixing at least two audio signals, α, β, γ, δ, and w are used. Here, α, β, γ, and δ may indicate mixing weight parameters, and may be variable. w may indicate a surround height mixing weight and may be variable.
When audio signals of some channels of a predetermined multi-channel layout are determined as audio signals of a subordinate channel group, the audio encoding apparatus 200 according to the embodiment may preferentially determine channels arranged in front of a listener as subordinate channels. The audio encoding apparatus 200 may compress audio signals of channels arranged in front of a listener without change and transmit the compression result as compressed audio signals of a sub-channel group, thereby improving sound quality of audio signals of audio channels in front of the listener in a decoding terminal. Accordingly, the listener can feel that the sound quality of the audio content reproduced through the display device is improved.
However, the present disclosure is not limited to this embodiment, and a channel satisfying a predetermined criterion among channels of a predetermined multi-channel layout or a channel set by a user may be determined as a channel included in the dependent channel group. The channels determined to be included in the dependent channel group may be determined differently according to the implementation.
Fig. 2D shows a case where the multi-channel audio transformer 262 obtains all of the audio signals of the 5.1.2 channel arrangement 291, the audio signals of the 3.1.2 channel arrangement 292, the audio signals of the 2 channel arrangement 293, and the audio signals of the mono arrangement 294, which are the audio signals of the lower channel arrangement, from the original audio signals of the 7.1.4 channel arrangement 290. However, embodiments of the present disclosure are not limited to this case.
The multi-channel audio transformer 262 may transform an original audio signal of a first channel layout into an audio signal of a second channel layout, which is a lower channel layout than the first channel layout, and the first channel layout and the second channel layout may include various multi-channel layouts according to an implementation. For example, the multi-channel audio transformer 262 may transform an original audio signal of a 7.1.4 channel layout into an audio signal of a 3.1.2 channel layout.
The audio decoding apparatus 300 of an embodiment of reconstructing an audio signal from a bitstream received from the audio encoding apparatus 200 by processing will now be described in detail.
Fig. 3A is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment of the present disclosure.
The audio decoding apparatus 300 includes a memory 310 and a processor 330. The audio decoding apparatus 300 may be implemented as an apparatus capable of audio processing, such as a server, a television, a camera, a mobile phone, a computer, a digital broadcasting terminal, a tablet PC, and a notebook computer.
Although the memory 310 and the processor 330 are separately shown in fig. 3A, the memory 310 and the processor 330 may be implemented by one hardware module (e.g., chip).
The processor 330 may be implemented as a dedicated processor for neural network based audio processing. Alternatively, the processor 230 may be implemented by a combination of a general-purpose processor (e.g., an AP, CPU, or GPU) and software. The special purpose processor may include a memory to implement embodiments of the present disclosure or a memory processing unit to use external memory.
Processor 330 may include a plurality of processors. In this case, the processor 330 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors such as an AP, a CPU, or a GPU.
Memory 310 may store one or more instructions for audio processing. According to one embodiment, the memory 310 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for AI or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 310. The neural network may be implemented as an external device (e.g., a server). In this case, the audio decoding apparatus 300 may request the neural network-based result information from the external apparatus and receive the neural network-based result information from the external apparatus.
The processor 330 sequentially processes successive frames according to instructions stored in the memory 310 to obtain successive reconstructed frames. Consecutive frames may refer to frames constituting audio.
The processor 330 may receive the bitstream and may output a multi-channel audio signal by performing an audio processing operation on the received bitstream. The bitstream may be implemented in a scalable form to increase the number of channels from the base channel group.
The processor 330 may obtain the compressed audio signals of the basic channel sets from the bitstream and may reconstruct the audio signals of the basic channel sets (e.g., mono audio signals or stereo audio signals) by decompressing the compressed audio signals of the basic channel sets. In addition, the processor 330 may reconstruct the audio signals of the at least one slave channel group by decompressing the compressed audio signals of the at least one slave channel group from the bitstream. Based on the audio signals of the base channel group and the audio signals of the at least one dependent channel group, the processor 330 may reconstruct a multi-channel audio signal that is increased in number of channels from the base channel group.
The processor 330 according to an embodiment may reconstruct the audio signals of the first set of slave channels by decompressing the compressed audio signals of the first set of slave channels from the bitstream. The processor 330 may reconstruct the audio signals of the second set of slave channels by decompressing the compressed audio signals of the second set of slave channels from the bitstream. The processor 330 may reconstruct a multi-channel audio signal that is increased in number of channels from the base channel group based on the audio signal of the base channel group and the corresponding audio signals of the first and second dependent channel groups.
The processor 330 according to an embodiment may decompress the compressed audio signals of up to n number of the sub-channel groups (where n is an integer greater than 2), and may reconstruct the multi-channel audio signals of the increased number of channels based on the audio signals of the base channel group and the corresponding audio signals of the n number of sub-channel groups.
Fig. 3B is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment of the present disclosure.
Referring to fig. 3B, the audio decoding apparatus 300 may include an information acquirer 350 and a multi-channel audio decoder 360. The multi-channel audio decoder 360 may include a decompressor 370 and a multi-channel audio signal reconstructor 380.
The audio decoding apparatus 300 may include the memory 310 and the processor 330 of fig. 3A, and instructions for implementing the components 350, 360, 370, and 380 of fig. 3A may be stored in the memory 310. Processor 330 may execute instructions stored in memory 310.
The information acquirer 350 of the audio decoding apparatus 300 according to the embodiment may acquire the base channel audio stream, the sub channel audio stream, and the metadata from the bitstream. The information acquirer 350 may acquire the base channel audio stream, the sub channel audio stream, and metadata encapsulated in the bitstream.
The information acquirer 350 may classify a basic channel audio stream including compressed audio signals from a basic channel group of a bitstream. The information acquirer 350 may acquire a compressed audio signal of a base channel group from the base channel audio stream.
The information acquirer 350 may classify at least one sub-channel audio stream including the compressed audio signals of the sub-channel groups from the bitstream. The information acquirer 350 may acquire compressed audio signals of the sub-channel groups from the sub-channel audio streams.
The information acquirer 350 may acquire additional information related to reconstruction of multi-channel audio from metadata of a bitstream. The information acquirer 350 may classify metadata including additional information from a bitstream and may acquire the additional information from the classified metadata.
The multi-channel audio decoder 360 of the audio decoding apparatus 300 according to the embodiment may reconstruct the output multi-channel audio signal by decoding the compressed audio signal included in the bitstream. The multi-channel audio decoder 360 may include a decompressor 370 and a multi-channel audio signal reconstructor 380.
The decompressor 370 may obtain at least one audio signal of the base channel group and an audio signal of the dependent channel group by performing decompression processing, such as entropy decoding, inverse quantization and inverse frequency transformation, on the compressed audio signals of the base channel group and the compressed audio signals of the dependent channel group. For example, an audio signal reconstruction method corresponding to an audio signal compression method such as the AAC standard or the OPUS standard may be used.
The decompressor 370 may reconstruct at least one audio signal of the basic channel set by decompressing at least one compressed audio signal of the basic channel set. The decompressor 370 may reconstruct at least one audio signal of the at least one slave channel group by decompressing the compressed audio signal of the at least one slave channel group.
The decompressor 370 may include first to nth decompressors for decoding the compressed audio signal of each of the plurality of channel groups (N channel groups). The first to nth decompressors may operate in parallel with each other.
The multi-channel audio signal reconstructor 380 according to an embodiment may reconstruct an output multi-channel audio signal based on at least one audio signal of a base channel group, at least one audio signal of at least one dependent channel group, and additional information.
For example, when the audio signal of the base channel group is an audio signal of a stereo channel, the multi-channel audio signal reconstructor 380 may reconstruct a multi-channel audio signal centered in front of the listener based on the audio signal of the base channel group and the audio signal of the first sub-channel group. For example, the reconstructed multi-channel audio signal centered in front of the listener may be an audio signal of a 3.1.2 channel layout.
Alternatively, the multi-channel audio signal reconstructor 380 may reconstruct a multi-channel audio signal centered on the listener based on the audio signal of the base channel group, the audio signal of the first dependent channel group, and the audio signal of the second dependent channel group. For example, the listener centered multi-channel audio signal may be an audio signal of a 5.1.2 channel layout or a 7.1.4 channel layout.
The multi-channel audio signal reconstructor 380 may reconstruct the multi-channel audio signal based not only on the audio signal of the base channel group and the audio signal of the dependent channel group, but also on the additional information. The additional information may be additional information for reconstructing the multi-channel audio signal. The multi-channel audio signal reconstructor 380 may output the reconstructed multi-channel audio signal.
Fig. 3C is a block diagram of the structure of a multi-channel audio signal reconstructor 380 according to an embodiment of the present disclosure.
Referring to fig. 3C, the multi-channel audio signal reconstructor 380 may include a mixer 383 and a renderer 381.
The mixer 383 of the multi-channel audio signal reconstructor 380 may obtain a mixed audio signal of a predetermined channel layout by mixing at least one audio signal of a base channel group with an audio signal of at least one dependent channel group. The mixer 383 may obtain a weighted sum of the at least one audio signal of the base channel and the audio signal of the at least one sub-channel as the audio signal of the at least one channel of the predetermined channel layout.
The mixer 383 may generate an audio signal of a predetermined channel layout based on the audio signal of the base channel group and the audio signal of the dependent channel group. The audio signal of the predetermined channel arrangement may be a multi-channel audio signal. The mixer 383 may generate a multi-channel audio signal by further considering additional information (e.g., information on dynamic unmixed weight parameters).
The mixer 383 may generate an audio signal of a predetermined channel layout by mixing at least one audio signal of a base channel group and at least one audio signal of a dependent channel group. For example, the mixer 383 may generate the audio signals of the L3 channel and the R3 channel of the 3.1.2 channel layout by mixing the audio signals of the L2 channel and the R2 channel included in the base channel group with the audio signals of the C channel included in the dependent channel group.
The mixer 383 may bypass the above-described mixing operation with respect to some audio signals of the slave channel groups. For example, the mixer 383 may obtain the audio signals of the C channel, LFE channel, hfl3 channel, and Hfr3 channel of the 3.1.2 channel layout without performing a mixing operation with at least one audio signal of the basic channel group.
The mixer 383 may generate a multi-channel audio signal of a predetermined channel layout from an audio signal of a sub-channel which has never undergone mixing and an audio signal of at least one channel obtained by mixing between an audio signal of a base channel and an audio signal of a sub-channel. For example, the mixer 383 may obtain an audio signal of a 3.1.2-channel layout from the audio signals of the L3 channel and the R3 channel obtained by mixing and the audio signals of the C channel, LFE channel, hfl3 channel, and Hfr3 channel included in the sub-channel group.
The renderer 381 may render and output the multi-channel audio signals obtained by the mixer 383. The renderer 381 may include a fader and limiter.
For example, the renderer 381 may control the volume of the audio signal of each channel to a target volume (e.g., -24 LKFS) based on volume information signaled through the bitstream. Renderer 381 may limit (e.g., to-1 dBTP) the true peak level of the audio signal after volume control.
Fig. 3D is a block diagram for explaining the operation of the mixer 383 of the multi-channel audio signal reconstructor 380 according to an embodiment of the present disclosure.
The mixer 383 may obtain an audio signal of a predetermined channel layout by mixing at least one audio signal of a base channel group with an audio signal of at least one sub channel group. The mixer 383 may obtain a weighted sum of the at least one audio signal of the base channel and the at least one audio signal of the dependent channel as an audio signal of the at least one channel of the predetermined channel layout.
Referring to fig. 3D, the mixer 383 may include a first unmixer 384, a second unmixer 385, and up to an nth unmixer 386.
The mixer 383 may obtain at least one audio signal of the basic channel group as an audio signal of the first channel arrangement. The mixer 383 may bypass the mixing operation of at least one audio signal of the base channel group.
The first de-mixer 384 may obtain an audio signal of the second channel arrangement from the at least one audio signal of the base channel group and the audio signal of the first slave channel group. The first de-mixer 384 may obtain the audio signals of the channels included in the second channel arrangement by mixing at least one base channel audio signal and at least one first slave channel audio signal.
For example, the second channel layout may be a 3.1.2 channel layout, the base channel group may include L2 channels and R2 channels constituting a stereo channel, and the first sub channel group may include Hfl3 channels, hfr3 channels, LFE channels, and C channels. In this case, the first de-mixer 384 may obtain a weighted sum of the audio signal of the L2 channel included in the base channel group and the audio signal of the C channel included in the dependent channel group as an L3 channel audio signal of the 3.1.2 channel layout. The first de-mixer 384 may obtain a weighted sum of the audio signal of the R2 channel included in the base channel group and the audio signal of the C channel included in the dependent channel group as an R3 channel audio signal of the 3.1.2 channel layout. The first de-mixer 384 may obtain an audio signal of a 3.1.2 channel layout from the audio signals of the Hfl3 channel, the Hfr3 channel, the LFE channel, and the C channel of the dependent channel group, and the mixed L3 channel and R3 channel audio signals.
The second unmixer 385 may obtain an audio signal of the third channel arrangement from at least one audio signal of the base channel group, an audio signal of the first dependent channel group, and an audio signal of the second dependent channel group. The second unmixer 385 may obtain the audio signals of the channels included in the third channel arrangement by mixing at least one audio signal of the second channel arrangement obtained by the first unmixer 384 with at least one audio signal of the second dependent channel group.
For example, the third channel layout may be a 5.1.2 channel layout, and the second slave channel group may include an L channel and an R channel. In this case, the second unmixer 385 may obtain the Ls5 channel audio signal of the 5.1.2 channel layout by mixing the L3 channel audio signal included in the 3.1.2 channel layout with the L channel audio signal included in the dependent channel group. The second unmixer 385 may obtain an audio signal of Rs5 channel of the 5.1.2 channel layout by mixing an audio signal of R3 channel included in the 3.1.2 channel layout with an audio signal of R channel included in the dependent channel group. The second unmixer 385 may obtain the Hl5 channel audio signal of the 5.1.2 channel layout by mixing the Hfl3 channel audio signal included in the 3.1.2 channel layout with the newly obtained Ls5 channel audio signal. The second unmixer 385 may obtain the audio signal of the Hr5 channel of the 5.1.2 channel layout by mixing the audio signal of the Hfr3 channel included in the 3.1.2 channel layout with the newly obtained audio signal of the Rs5 channel.
The second unmixer 385 may obtain an audio signal of a 5.1.2-channel arrangement from the LFE and C audio signals of the first sub-channel group, the L and R audio signals of the second sub-channel group, and the mixed Ls5 channel, the mixed Rs5 channel, the mixed Hl5 channel, and the mixed Hr5 channel.
Fig. 3D shows a case where the mixer 383 obtains all of the audio signal of the first channel layout, the audio signal of the second channel layout, and the audio signal of the third channel layout through a plurality of unmixers (i.e., the first and second unmixers 384 and 385). However, embodiments of the present disclosure are not limited thereto.
The mixer 383 may obtain an audio signal of a predetermined channel layout by mixing at least one audio signal of a base channel group with an audio signal of at least one sub channel group. According to an embodiment, the predetermined channel layout of the audio signal obtained by the mixer 383 may include various multi-channel layouts.
As described above, the audio decoding apparatus 300 according to the embodiment of the present disclosure may reconstruct not only an audio signal into a lower channel layout such as a mono layout or a stereo channel layout, but also into various channel layouts having a 3D sound image centered on a screen by reconstructing a multi-channel audio signal from an audio signal of a base channel group and an audio signal of at least one sub channel group obtained from a bitstream.
In order to improve transmission efficiency, the audio encoding apparatus according to the embodiment may downsample and separately transmit audio signals of unused or less used side channels when transforming a multi-channel audio signal into a channel layout having a screen-centered sound image. Thus, the heated aerosol can be effectively cooled. An audio decoding apparatus according to an embodiment of the present disclosure may perform AI-based decoding in order to compensate for data loss caused by downsampling in an encoding terminal.
Fig. 4A is a block diagram of an audio encoding apparatus 400 according to an embodiment of the present disclosure, and fig. 4B is a block diagram of an audio decoding apparatus 500 according to an embodiment of the present disclosure.
The audio encoding apparatus 400 according to the embodiment may output a bitstream by encoding an audio signal. Referring to fig. 4A, an audio encoding apparatus 400 according to an embodiment may include a multi-channel audio encoder 410, a channel information generator 420, and a bitstream generator 430.
The multi-channel audio encoder 410 of the audio encoding apparatus 400 may obtain a second audio signal 415 (hereinafter, referred to as "second audio signal corresponding to a second channel group") including a small number of channels by down-mixing a first audio signal 405 (hereinafter, referred to as "first audio signal corresponding to a first channel group") corresponding to channels included in a first channel group. The first channel group may include a channel group of the original audio signal, and the second channel group may be constructed by combining at least two channels among channels included in the first channel group. The multi-channel audio encoder 410 according to an embodiment may obtain a second audio signal 415 (which is a listener centered (or screen centered) multi-channel audio signal) corresponding to a second channel group from a first audio signal 405 (which is a listener centered multi-channel audio signal) corresponding to a first channel group.
The number of channels included in the second channel group of the second audio signal 415 obtained by the multi-channel audio encoder 410 needs to be less than the number of channels included in the first channel group. In other words, the second channel group needs to be the lower channel group (lower channel group) of the first channel group.
For example, the first channel group may include S n Surround channels W n Ultra-bass channel and H n The second channel group may include S n-1 Surround channels W n-1 Ultra-bass channel and H n-1 And a high-level channel. Here, n may be an integer equal to or greater than 1. S is S n-1 S is required to be less than or equal to n ,W n-1 It is required to be less than or equal to W n ,H n-1 Need to be less than or equal to H n Exclude S n-1 Equal to S n 、W n-1 Equal to W n Hn-1 is equal to Hn. For example, when the first audio signal 405 is a 7.1.4 channel audio signal, the second audio signal 415 may be a 2 channel, 3.1.2 channel, 3.1.4 channel, 5.1.2 channel, 5.1.4 channel, or 7.1.2 channel audio signal.However, various embodiments of the present disclosure are not limited thereto, and audio signals of various channel groups may be used. For example, the first audio signal 405 may include a 5.1.4 channel, 5.1.2 channel, 3.1.4 channel, or 3.1.2 channel audio signal, and the second audio signal 415 may be an audio signal of a lower channel group of the first audio signal 405.
The multi-channel audio encoder 410 may mix and compress the second audio signal 415 corresponding to the second channel group and output the mixed and compressed result to the bitstream generator 430.
The channel information generator 420 of the audio encoding apparatus 400 may obtain information on at least one channel from the first audio signal, which may be used by the audio decoding apparatus 500 to up-mix the audio signals of the second channel group to the first channel group. The channel information generator 420 may identify at least one channel from among channels of the first channel group and downsample a third audio signal corresponding to the identified at least one channel to obtain at least one downsampled third audio signal. The channel information generator 420 may compress the at least one downsampled third audio signal and output the compressed result to the bitstream generator 430.
The bitstream generator 430 may generate a bitstream including information on the second audio signal 415 corresponding to the second channel group and information on the at least one downsampled third audio signal, and may output the bitstream to the audio decoding apparatus 500 of fig. 4B.
Referring to fig. 4B, the audio decoding apparatus 500 may include an information acquirer 510, a multi-channel audio decoder 520, and a sound image reconstructor 530.
The audio decoding apparatus 500 may reconstruct the multi-channel audio signal from the bitstream received from the audio encoding apparatus 400.
The information acquirer 510 of the audio decoding apparatus 500 may acquire information on a first audio signal corresponding to a first channel group and information on a downsampled second audio signal from a bitstream. The multi-channel audio decoder 520 may decompress and mix the compressed audio signals to obtain a first audio signal 505 corresponding to the first channel group.
The sound image reconstructor 530 may decompress and upsample information on the downsampled second audio signal to obtain at least one second audio signal corresponding to at least one channel of the channels included in the second channel group. The sound image reconstructor 530 may reconstruct the third audio signal 515 corresponding to the second channel group from the first audio signal 505 corresponding to the first channel group and the at least one second audio signal, wherein the number of channels of the second channel group is greater than the number of channels of the first channel group.
Fig. 5 illustrates an example of a transformation between channel groups performed in an audio processing system according to an embodiment of the present disclosure.
The audio encoding apparatus 400 according to an embodiment may receive a first audio signal of a first channel group as an original audio signal. For example, the audio encoding apparatus 400 may receive 7.1.4 channel audio signals including Ls channel, lb channel, HBL channel, L channel, HFL channel, C channel, LFE channel, HFR channel, R channel, HBR channel, rb channel, and Rs channel as original audio signals.
The audio encoding apparatus 400 may transform a first channel group of an original audio signal into a second channel group in which a sound image is implemented around a screen of a display device. For example, the audio encoding apparatus 400 may transform an original audio signal of 7.1.4 channels into an audio signal O of 3.1.2 channels tv . The audio encoding apparatus 400 may include the audio signal transformed into the second channel group in a bitstream and transmit the bitstream to the audio decoding apparatus 500.
When the channel group is transformed so as to realize a sound image around the screen, the audio encoding apparatus 400 may determine at least one channel that is not used or has least used related information from among channels of the first channel group as a side channel. For example, the audio encoding apparatus 400 may determine an Ls channel, an Lb channel, an HBL channel, an HBR channel, an Rb channel, and an Rs channel from among channels of a 7.1.4 channel group as side channels. The audio encoding apparatus 400 may determine channels other than the at least one channel determined as the side channel from among the channels of the first channel group as a main channel.
The audio encoding apparatus 400 may determine the side channel pair on the time axisIs downsampled and transmits the downsampled audio signal to the audio decoding apparatus 500. The audio encoding apparatus 400 may further include in the bitstream a signal (or audio signal) M obtained by downsampling N channels determined as side channels among the channels of the first channel group by a factor of 1/s adv And may transmit the bitstream to the audio decoding apparatus 500 (where N is an integer greater than 1 and s is a rational number greater than 1).
The audio decoding apparatus 500 may decode the audio signal M of at least one side channel which has been downsampled and transmitted adv Upsampling is performed to reconstruct the audio signal of the at least one side channel. For example, the audio decoding apparatus 500 may reconstruct audio signals of Ls channel, lb channel, HBL channel, HBR channel, rb channel, and Rs channel from channels of the 7.1.4 channel group by upsampling.
The audio decoding apparatus 500 may reconstruct the audio signal O from the second channel group by using the reconstructed audio signal of the at least one side channel tv An audio signal of a first channel group in which an acoustic image is realized around a listener is reconstructed. The audio decoding apparatus 500 may output the audio signal O of the second channel group from the audio signals O of the second channel group by using the audio signals of at least one side channel tv The cross-over improves the main channel and the side channels of the first channel group to reconstruct the audio signal of the first channel group.
Each component of the audio encoding apparatus 400 will now be described in more detail with reference to fig. 6.
Fig. 6 is a block diagram of an audio encoding apparatus 400 according to an embodiment of the present disclosure.
The multi-channel audio encoder 410 of the audio encoding apparatus 400 according to the embodiment may transform and encode a first audio signal corresponding to a first channel group into a second audio signal corresponding to a second channel group to obtain a first compressed signal of a base channel group, a second compressed signal of a dependent channel group, and additional information. The multi-channel audio encoder 410 may include a multi-channel audio signal processor 450, a first compressor 411, and an additional information generator 413.
The multi-channel audio transformer 451 according to an embodiment may obtain the second audio signal corresponding to the second channel group by down-mixing the first audio signal corresponding to the first channel group. The multi-channel audio transformer 451 may obtain an audio signal of one channel included in the second channel group by mixing audio signals of at least two channels included in the first channel group according to a channel group transformation rule.
Fig. 7A illustrates an example of a transformation rule between channel groups performed in an audio encoding apparatus according to an embodiment of the present disclosure.
As shown in fig. 7A, the first channel group is a channel group in which an acoustic image is arranged around a listener, and may be a channel group suitable for an acoustic image reproducing system centered on the listener. For example, the first channel group may be a 7.1.4 channel group including three surround channels (left channel L, center channel C, and right channel R) in front of the listener, four surround channels (side left channel Ls, side right channel Rs, rear left channel Lb, and rear right channel Rb) beside and behind the listener, one sub-woofer channel (sub-woofer channel LFE) in front of the listener, two upper channels (high front left channel HFL and high front right channel HFR) in front of the listener, and two upper channels (high rear left channel HBL and high rear right channel HBR) behind the listener.
The second channel group is a channel group in which an acoustic image is arranged around a screen of the display device, and may be a channel group suitable for an acoustic image reproduction system centered on the screen. For example, the second channel group may be a 3.1.2 channel group having three surround channels (left channel L3, center channel C, and right channel R3) in front of the listener, one sub-woofer channel (super-woofer channel LFE) in front of the listener, and two upper channels (high front left channel HFL3 and high front right channel HFR 3).
As shown in fig. 7A, the multi-channel audio transformer 451 may generate a screen-centered sound image signal corresponding to channels included in the second channel group from a weighted sum of audio signals of channels arranged behind the first channel group and audio signals of channels arranged in front of the first channel group.
For example, the multi-channel audio transformer 451 may obtain an audio signal of the left channel L3 of the 3.1.2 channel group by mixing an audio signal of the front left channel L, an audio signal of the left channel Ls, and an audio signal of the rear left channel Lb among channels included in the 7.1.4 channel group. The multi-channel audio transformer 451 may obtain an audio signal of the right channel R3 of the 3.1.2 channel group by mixing an audio signal of the front right channel R, an audio signal of the right channel Rs, and an audio signal of the rear right channel Rb included in the channels of the 7.1.4 channel group.
The multi-channel audio transformer 451 may obtain the audio signal of the high front left channel HFL3 of the 3.1.2 channel group by mixing the audio signal of the left channel Ls, the audio signal of the rear left channel Lb, the audio signal of the high front left channel HFL, and the audio signal of the high rear left channel HBL included in the channels of the 7.1.4 channel group. The multi-channel audio transformer 451 may obtain the audio signal of the front-right channel HFR3 of the 3.1.2 channel group by mixing the audio signal of the right channel Rs, the audio signal of the rear-right channel Rb, the audio signal of the front-right channel HFR, and the audio signal of the rear-right channel HBR among the channels included in the 7.1.4 channel group.
Referring back to fig. 6, the mixer 453 according to the embodiment may obtain the audio signal of the base channel group and the audio signal of the dependent channel group by mixing the second audio signal corresponding to the second channel group.
The mixer 453 according to the embodiment may transform the second audio signal corresponding to the second channel group into the audio signal of the base channel group and the audio signal of the dependent channel group, so that the channel group (e.g., the second channel group) having the screen-centered sound image is compatible with the lower channel group (e.g., the mono or stereo channel), and may output the audio signal of the base channel group and the audio signal of the dependent channel group.
Accordingly, according to the output speaker layout of the decoding terminal, only the audio signals of the basic channel group may be output, or the multi-channel audio signal may be reconstructed and output based on the audio signals of the basic channel group and the audio signals of the dependent channel group. The method of dividing the multi-channel audio signal into the base channel group and the sub-channel group and encoding the multi-channel audio signal according to the base channel group and the sub-channel group is the same as that described above with reference to fig. 2A to 2D, and thus a detailed description thereof will be omitted.
The mixer 453 according to the embodiment may obtain the audio signals of the base channels by mixing signals of at least two channels included in the second audio signal corresponding to the second channel group. The mixer 453 may obtain, as the audio signal of the sub-channel group, audio signals of channels other than at least one channel corresponding to the audio signal of the base channel group from channels included in the second channel group.
For example, when the second channel group is a 3.1.2 channel group and the base channel group includes an L2 channel and an R2 channel that constitute a stereo channel, the mixer 453 can obtain audio signals of the L2 channel and the R2 channel that constitute the stereo channel by mixing audio signals of the left channel L3, the right channel R3, and the center channel C in the 3.1.2 channel group. The mixer 453 can obtain audio signals of the L2 channel and the R2 channel constituting a stereo channel as audio signals of a base channel group. The mixer 453 can obtain audio signals of channels (i.e., the center channel C, the ultra-low sound channel LFE, the high front left channel HFL3, and the high front right channel HFL 3) other than the two channels (i.e., the left channel L3 and the right channel R3) corresponding to the stereo channels in the 3.1.2 channel group as audio signals of the dependent channel group.
The first compressor 411 may obtain a first compressed signal by compressing audio signals of the base channel group, and may obtain a second compressed signal by compressing audio signals of the dependent channel group. The first compressor 411 may compress an audio signal through processes such as frequency conversion, quantization, and entropy. For example, an audio signal compression method such as the AAC standard or the OPUS standard may be used.
The additional information generator 413 according to the embodiment may obtain additional information from the first audio signal, the first compressed signal of the base channel group, and the second compressed signal of the dependent channel group. The additional information may include information for decoding the multi-channel audio signal based on the audio signal of the base channel and the audio signal of the sub-channel in the decoding terminal.
The additional information generator 413 according to the embodiment may obtain the additional information by separately decoding the first compressed signal, the second compressed signal, and the third compressed signal of the side channel information, obtaining a reconstructed audio signal corresponding to a channel included in the first channel group from the decoded signals, and comparing the reconstructed audio signal with the first audio signal.
For example, the additional information generator 413 may obtain error-cancellation-related information (e.g., a scale factor for error cancellation) as additional information so that an error between the reconstructed audio signal and the first audio signal is minimized.
As another example, the additional information may include an audio object signal indicating at least one of an audio signal, a position, or a direction of an audio object (sound source). Alternatively, the additional information may include information about the total number of audio streams including a base channel audio stream and a dependent channel audio stream. The additional information may include downmix gain information. The additional information may include channel map information. The additional information may include volume information. The additional information may include LFE gain information. The additional information may include dynamic range control information. The additional information may include channel group rendering information. The additional information may include information indicating the number of other coupled audio streams, information indicating a multi-channel group, information about whether dialog and dialog levels exist in the audio signal, information indicating whether LFE is output, information about whether audio objects exist on a screen, information about the presence or absence of continuous channel audio signals, and information about the presence or absence of discontinuous channel audio signals. The additional information may include information about the downmix, the information including at least one downmix weight parameter for reconstructing a downmix matrix of the multi-channel audio signal.
The additional information may be various combinations of the above. In other words, the additional information may include at least one of the foregoing information.
When the first audio signal corresponding to the first channel group is transformed into the second audio signal corresponding to the second channel group, the channel information generator 420 according to the embodiment may generate information on unused or less used side channels. The channel information generator 420 may generate information about at least one side channel included in the first channel group. The channel information generator 420 may include a side channel identifier 421, a down-sampler 423, and a second compressor 425.
The side channel identifier 421 according to an embodiment may identify at least one side channel from among channels included in the first channel group, and may output an audio signal of the at least one side channel. The side channel identifier 421 may identify, as a side channel, a channel having relatively low correlation with a channel included in the second channel group from among channels included in the first channel group. Among the plurality of channels of the original signal, the side channel may be used with no or minimal use in performing the audio signal form conversion to change the original audio signal to a converted audio signal (e.g., from a listener-centered audio signal to a screen-centered audio signal, and vice versa). For example, the side channel identifier 421 may identify channels having relatively low correlation with channels included in the second channel group from among channels included in the first channel group based on weights applied to the channels of the first channel group so as to transform the first channel group into the second channel group. For example, the side channel identifier 421 may identify, as the side channel, a channel to which a weight less than or equal to a predetermined value is applied from among channels of the first channel group.
Fig. 7B illustrates an example of a transformation rule between channel groups performed in an audio encoding apparatus according to an embodiment of the present disclosure.
Fig. 7B shows and explains a case where the first channel group includes 7.1.4 channels having a sound image centered on the listener and the second channel group includes 3.1.2 channels having a sound image centered in front of the listener (or centered on the screen). The Ls channel, the Rs channel, the Lb channel, the Rb channel, the HBL channel, and the HBR channel among the 7.1.4 channels are channels having relatively low correlation with the 3.1.2 channels (i.e., channels far from the front of the listener), and can be recognized as side channels.
The audio encoding apparatus 400 according to the embodiment may obtain an audio signal corresponding to a channel included in a second channel group by applying weights to at least one audio signal corresponding to at least one channel included in a first channel group. The audio encoding apparatus 400 may obtain an audio signal corresponding to a channel included in a second channel group by calculating a weighted sum of audio signals corresponding to at least two channels included in a first channel group. The audio encoding device 400 may identify the side channels based on weights applied to the audio signals corresponding to the first channel group to obtain audio signals corresponding to the second channel group.
As shown in fig. 7B, the audio signal of the L3 channel of the 3.1.2 channel group may be represented as a weighted sum of the audio signals of the L channel, the Ls channel, and the Lb channel of the 7.1.4 channel group. When the weight value w1 applied to the L channel is greater than the weight values w2 and w3 applied to the Ls channel and the Lb channel, the audio encoding apparatus 400 may determine the Ls channel and the Lb channel as side channels.
In more detail, as described above with reference to fig. 7A, the L3 channel of the 3.1.2 channel group may be configured as a combination of the L channel, the Ls channel, and the Lb channel of the 7.1.2 channel group. In order for the display device to output an audio signal according to a 3.1.2 speaker channel layout forming an acoustic image around the screen, it is necessary to map rear channels to front channels among channels included in the 7.1.2 channel group. Accordingly, the weight value w1 applied to the L channel arranged in front of the 7.1.2 channel may be greater than the weight values w2 and w3 applied to the Ls channel and the Lb channel, respectively. Based on the applied weight values, the Ls channel and the Lb channel may be determined as the channels that are least correlated with the L3 channel, and may be determined as side channels. The method of identifying the main channel and the side channel of the L3 channel can be expressed using the following equation.
S L3 ={C|C∈M L3 c }
In this form, M L3 Indicating a main channel group including at least one main channel of the L3 channel, S L3 Indicating a side channel group including at least one side channel of the L3 channel, C L3 Channels of the first channel group for generating the L3 channel are indicated. For example, C L3 Can be used forIncluding the L channel, the Ls channel, and the Lb channel. In the above equation, F represents a similarity function between two channels. F may comprise, for example, a cross-correlation function.
According to the above equation, the audio encoding apparatus 400 may identify a highly correlated channel (e.g., an L channel) from among channels (e.g., an L channel, an Ls channel, and an Lb channel) of the first channel group for generating an L3 channel, and may identify channels other than the main channel (e.g., an Ls channel and an Lb channel).
In the same manner, by referring to the transformation rule shown in fig. 7B, the audio encoding apparatus 400 may determine that the weight values w1, w6, and w8 are greater than the weight values w2, w3, w4, and w7. Based on this determination, the audio encoding apparatus 400 may determine an L channel, an R channel, an HFL channel, and an HFR channel from among channels of the 7.1.4 channel group as a main channel. The audio encoding apparatus 400 may determine an Ls channel, an Rs channel, an Lb channel, an Rb channel, an HBL channel, and an HBR channel from among channels of the 7.1.4 channel group as side channels.
The present disclosure is not limited to the channel conversion rule shown in fig. 7B, and the weights applied to each channel may vary differently depending on the implementation. For example, fig. 7B shows that the same weight w1 is applied to the L channel and the R channel of the 7.1.4 channel group. However, various other weights are also applicable.
Embodiments of the present disclosure are not limited to an example in which channels having relatively low correlation with channels included in a second channel group are identified as side channels, and channels satisfying a specific criterion among channels of a first channel group may be identified, or channels determined by a manufacturer of an audio signal in consideration of sound image reproduction performance may be identified as side channels.
The downsampler 423 of fig. 6 downsamples the audio signal of the at least one side channel according to an embodiment, thereby saving resources for transmitting the audio signal. The downsampler 423 may extract the first and second sets of audio samples from the audio signal of the at least one side channel. For example, the down-sampler 423 may arrange audio samples of an audio signal constituting a side channel along a time axis and extract a first audio sample group and a second audio sample group including a plurality of audio samples. The downsampler 423 may obtain downsampled information by combining the first set of audio samples with the second set of audio samples.
Fig. 8A is a view for explaining downsampling of a side channel audio signal performed by an audio encoding apparatus according to an embodiment of the present disclosure.
As shown in fig. 8A, the audio encoding apparatus 400 according to an embodiment may extract a first audio sample group D of odd (odd) indexes from audio samples of an audio signal constituting at least one side channel odd And a second set of audio samples D of even (even) index even
For example, when the first audio sample group D odd And a second audio sample group D even In synthesizing to achieve downsampling, the audio encoding device 400 may use a uniform average filter that considers the uniform importance of each group and each time. The audio encoding apparatus 400 may synthesize the first audio sample group D by applying the same weight value (e.g., α, β=0.5) to all samples regardless of sample group and time odd And a second set of audio samples D even Thereby obtaining a downsampled audio signal D for at least one side channel.
As another example, the audio encoding apparatus 400 may obtain downsampled data capable of obtaining better performance by applying different weights to different audio samples. The audio encoding apparatus 400 may apply the importance weight map alpha map And beta map Synthesizing a first set of audio samples D odd And a second set of audio samples D even To obtain a downsampled audio signal D of at least one side channel, in an importance weighting map alpha map And beta map Different weight values are assigned to different groups of samples and different times.
The downsampler 423 according to the embodiment may downsample the at least one third audio signal by obtaining downsampling-related information of the first and second audio sample groups using the AI model and applying the downsampling-related information to the first and second audio sample groups. For example, the down sampler 423 according to the embodiment calculates weights to be differently applied to each audio sample by using the AI model, so that better performance can be obtained compared to a method of synthesizing audio samples by giving uniform weights.
The downsampler 423 may obtain weight values to be applied to the first and second audio sample groups, respectively, by using the AI model. The down sampler 423 may extract features of each audio sample by using a pre-trained AI model, determine the importance of each audio sample based on the extracted features, and calculate a weight value to be applied to each audio sample. The downsampler 423 may obtain a weight map including weight values differently applied to the samples over time or according to the audio sample groups, and apply the obtained weight map to the first audio sample group and the second audio sample group. The down-sampler 423 may obtain, as side channel information, a weighted sum of the first audio sample group and the second audio sample group to which the weight map is applied.
The downsampler 423 according to the embodiment may obtain the respective weights of the audio samples, and may generate downsampled data by applying the obtained weights to the audio samples. The down-sampler 423 may determine whether audio reconstruction performance in the decoding terminal is improved by applying weights and train the result of the determination. The downsampler 423 may derive the importance weight map α capable of increasing the audio reconstruction rate by using the trained AI model map And beta map
Fig. 8B is a block diagram for explaining the operation of a downsampler of an audio encoding apparatus according to an embodiment of the present disclosure.
Fig. 8B shows an AI model used by the downsampler 423 according to an embodiment. The AI model 802 for downsampling may sample the audio signal of at least one side channel and may perform convolution with a kernel size of K1 and a channel number of C1.
The S2D (space to depth) (S) of the AI model 802 refers to an operation of performing sampling by skipping input audio samples (S) one by one. As shown in symbol 803 of fig. 8B, 1DConv (K1, C1) refers to a one-dimensional (1D) convolution layer, and refers to an operation of separating a signal sampled in S2D (S) into C1 multi-channels. In fig. 8B, 1DRB may refer to a 1D residual block, and Prelu may refer to an activation function.
The basic feature extractor of the AI model 802 may extract features from the input data. For example, the first audio sample set D of FIG. 8A odd And a second set of audio samples D even Can be extracted at the feature level. The region analyzer may extract local features and perform peripheral region analysis. The importance map generator may extract an importance value for each sample that is to be weighted to each feature. As shown in fig. 8B, each module may include multiple convolutional layers.
Fig. 8B is merely an example for explaining the operation of the down-sampler 423. Embodiments of the present disclosure are not limited to the example of fig. 8B. The AI model used by the audio encoding apparatus 400 according to the embodiment may be autonomously determined and extended through learning. The AI model used by audio encoding apparatus 400 may be configured and trained in various ways to improve audio signal reconstruction performance.
Referring back to fig. 6, the down-sampler 423 may down-sample the audio signal of at least one side channel and may obtain side channel information including down-sampled data. The second compressor 425 may obtain a third compressed signal by compressing the side channel information. The second compressor 425 may compress the side channel information through processes such as frequency transformation, quantization, and entropy. For example, an audio signal compression method such as the AAC standard or the OPUS standard may be used.
The bitstream generator 430 according to the embodiment may generate a bitstream according to the first compressed signal of the basic channel group, the second compressed signal of the sub channel group, and the additional information output by the multi-channel audio encoder 410, and the third compressed signal of the side channel generated by the channel information generator 420. The bitstream generator 430 may generate a bitstream through an encapsulation process with respect to the compressed signal. The bitstream generator 430 may generate a bitstream by performing encapsulation such that a first compressed signal of a base channel group is included in a base channel audio stream, a second compressed signal of a sub channel group and a third compressed signal of a side channel are included in a sub channel audio stream, and additional information is included in metadata.
Fig. 9 is a block diagram of an audio decoding apparatus according to an embodiment of the present disclosure.
The information acquirer 510 of the audio decoding apparatus 500 according to the embodiment may identify the base channel audio stream, the sub channel audio stream, and the metadata encapsulated in the bitstream.
The information acquirer 510 may acquire the compressed audio signal of the basic channel group from the basic channel audio stream, acquire the compressed audio signal of the sub channel group and the compressed side channel information from the sub channel audio stream, and acquire the additional information from the metadata.
The multi-channel audio decoder 520 of the audio decoding apparatus 500 according to the embodiment may obtain a first audio signal corresponding to a first channel group by decoding a compressed signal obtained from a bitstream. The multi-channel audio decoder 520 may include a first decompressor 521 and a multi-channel audio signal reconstructor 550.
The first decompressor 521 may obtain at least one audio signal of the basic channel group by performing a decompression process, such as entropy decoding, inverse quantization, and inverse frequency transform, on the compressed audio signal of the basic channel group. The first decompressor 521 may obtain at least one audio signal of at least one sub-channel group by performing a decompression process such as entropy decoding, inverse quantization and inverse frequency transformation on the compressed audio signals of the sub-channel group. For example, an audio signal reconstruction method corresponding to an audio signal compression method such as the AAC standard or the OPUS standard may be used.
The multi-channel audio signal reconstructor 550 according to an embodiment may reconstruct a first audio signal corresponding to a first channel group based on at least one audio signal of a base channel group, at least one audio signal of at least one dependent channel group, and additional information.
The mixer 555 of the multi-channel audio signal reconstructor 550 may obtain a mixed audio signal corresponding to a channel included in the first channel group by mixing at least one audio signal of the base channel group with an audio signal of at least one dependent channel group. The mixer 555 may obtain a weighted sum of the at least one audio signal of the base channel and the audio signal of the at least one slave channel as a mixed audio signal.
For example, when the first channel group is a 3.1.2 channel group and the base channel group includes an L2 channel and an R2 channel that constitute a stereo channel, the mixer 555 may obtain a weighted sum of an audio signal of the L2 channel included in the base channel group and an audio signal of the C channel included in the dependent channel group as an audio signal of the L3 channel of the 3.1.2 channel group. The mixer 555 may obtain a weighted sum of the audio signal of the R2 channel included in the base channel group and the audio signal of the C channel included in the dependent channel group as the audio signal of the R3 channel of the 3.1.2 channel layout.
The second rendering unit 552 may obtain a first audio signal corresponding to the first channel group by rendering the mixed audio signal obtained by the mixer 555 based on the additional information. The additional information may include information calculated and transmitted by the audio encoding apparatus 400 to reduce the occurrence of errors during reconstruction of the audio signal.
The sound image reconstructor 530 according to an embodiment may reconstruct a third audio signal of a second channel group, which is greater than the number of channels of the first channel group, from the first audio signal corresponding to the first channel group by using side channel information included in the bitstream. The sound image reconstructor 530 may include a second decompressor 531, an upsampler 533, and a refiner (refiner) 535.
The second decompressor 531 may obtain side channel information by performing decompression processing such as entropy decoding, inverse quantization, and inverse frequency transform on the compressed side channel information. For example, an audio signal reconstruction method corresponding to an audio signal compression method such as the AAC standard or the OPUS standard may be used.
The side channel information may include at least one second audio signal corresponding to at least one channel included in the second channel group, which may be used to reconstruct an audio signal corresponding to the second channel group. The side channel may be a channel having a relatively low correlation with a channel included in the first channel group among channels included in the second channel group.
For example, the first audio signal corresponding to the first channel group may include a multi-channel audio signal having a sound image surrounding the front of the listener, and the third audio signal corresponding to the second channel group may include a multi-channel audio signal having a sound image surrounding the listener. In this case, the side channel information may include side channel components and rear channel components of the listener, which have relatively low correlation with channels included in a first channel group consisting of front channel components of the listener, among channels included in the second channel group. Channels of the second channel group other than the side channels may be identified as the main channels having relatively high correlation with the channels of the first channel group.
However, the embodiments of the present disclosure are not limited to examples in which the side channels include channels having relatively low correlation with the channels included in the first channel group, and the channels satisfying a specific criterion among the channels of the second channel group may be side channels, or the channels desired by the manufacturer of the audio signal may be side channels.
For example, when the first channel group is 3.1.2 channels having sound images around the front of the listener and the second channel group is 7.1.4 channels having sound images around the listener, information on at least six channels is required to reconstruct a third audio signal corresponding to the second channel group from the first audio signal corresponding to the first channel group.
For example, information on Ls channel, rs channel, lb channel, rb channel, HBL channel, and HBR channel among channels of the 7.1.4 channel group may be included in side channel information, which are channels having relatively low correlation with channels of the 3.1.2 channel group (i.e., channels far from the front of the listener).
As another example, information regarding an L channel, an R channel, an HFL channel, an HFR channel, a C channel, and an LFE channel among channels of the 7.1.4 channel group may be included in side channel information, which are channels having relatively high correlation with the channels of the 3.1.2 channel group.
The upsampler 533 according to the embodiment may obtain at least one second audio signal corresponding to at least one channel among channels included in the second channel group by upsampling side channel information. For example, the upsampler 533 may upsample downsampled information included in the side channel information by using the AI model. The AI model used by the upsampler 533 of the audio decoding apparatus 500 may correspond to the AI model used by the downsampler 423 of the audio encoding apparatus 400, and may be an AI model trained to improve the audio signal reconstruction performance. The upsampler 533 may obtain a second audio signal corresponding to at least one side channel by performing time-axis upsampling.
The refiner 535 according to an embodiment may reconstruct a third audio signal corresponding to the second channel group from the first audio signal corresponding to the first channel group by using the second audio signal corresponding to the at least one side channel. The refiner 535 may refine the audio signal of the at least one side channel and the audio signal of the at least one main channel based on AI to downsample and transmit the audio signal of the side channel in the encoding terminal. The refiner 535 may alternately and repeatedly refine the audio signal of the at least one side channel and the audio signal of the at least one main channel based on AI.
First, the refiner 535 may obtain the audio signal of at least one main channel of the first channel group from the third audio signal and the audio signal of at least one side channel according to a channel group transformation rule between the first channel group and the second channel group. The refiner 535 may set the audio signal of the at least one side channel and the audio signal of the at least one main channel as initial conditions. The refiner 535 may alternatively refine the audio signal of the at least one side channel and the audio signal of the at least one main channel by applying initial conditions to the pre-trained AI model and may obtain a fourth audio signal comprising the refined audio signal of the at least one side channel and the refined audio signal of the at least one main channel. The refiner 535 may improve the sound image reproduction performance by repeatedly performing operations of alternately refining the audio signal of the side channel and the audio signal of the main channel. The operation of the refiner 535 will be described in more detail later with reference to fig. 10A to 10C.
The audio decoding apparatus 500 according to an embodiment may support various channel layouts according to an audio production environment.
The multi-channel audio signal reconstructor 550 of the audio decoding apparatus 500 according to the embodiment may render and output the audio signals of the basic channel group through the first renderer 551. For example, the first renderer 551 may output a mono audio signal or a stereo audio signal from an audio signal of a basic channel group. The audio signals of the base channel group can be independently encoded and output without requiring information about the audio signals of the dependent channel groups.
The multi-channel audio signal reconstructor 550 of the audio decoding apparatus 500 according to the embodiment may render and output a first audio signal corresponding to the first channel group through the second renderer 552. When audio content having a first channel group surrounding a sound image in front of a listener is consumed, the audio decoding apparatus 500 may not reconstruct the sound image centered on the listener based on the side channel information, and may reconstruct and output a first audio signal corresponding to the first channel group.
The sound image reconstructor 530 of the audio decoding apparatus 500 according to the embodiment may reconstruct and output a third audio signal corresponding to a second channel group centered on the listener from the first audio signal corresponding to the first channel group and the second audio signal corresponding to the at least one side channel.
Fig. 10A is a block diagram for explaining the operation of a sound image reconstructor of an audio decoding apparatus according to an embodiment of the present disclosure.
Fig. 10A illustrates an example of a process of reconstructing an audio signal corresponding to a second channel group of 7.1.4 channels from an audio signal corresponding to a first channel group of 3.1.2 channels, which is performed by the audio decoding apparatus according to the embodiment. However, fig. 10A is merely an example for aiding in the understanding of the operation of the present disclosure. The type of side channel used according to the embodiment, the number of side channels used according to the embodiment, the input channel group, and the target channel group to be reconstructed may vary according to the implementation.
The upsampler 533 of the audio decoding apparatus 500 according to the embodiment may reconstruct an audio signal of at least one side channel by performing time-axis-based upsampling on the downsampled and transmitted side channel information. The upsampler 533 according to the embodiment may downsample the side channel information M included in the bitstream by using the AI model adv Up-sampling is performed.
According to the example of fig. 10A, by upsampling of side channel informationAudio signals O of Ls channel, rs channel, lb channel, rb channel, HBL channel, and HBR channel among channels of 7.1.4 channel group LS0 、O RS0 、O LB0 、O RB0 、O HBL0 And O HBR0 Can be initially reconstructed as an audio signal of a side channel group.
The refiner 535 of the audio decoding apparatus 500 according to an embodiment may be derived from the first audio signal O corresponding to the first channel group according to a channel group transformation rule between the first channel group and the second channel group TV And a second audio signal O corresponding to the side channel group LS0 、O RS0 、O LB0 、O RB0 、O HBL0 And O HBR0 Obtaining an audio signal O of a main channel group included in a second channel group L0 、O R0 、O HFL0 And O HFR0 . The refiner 535 may refine the second audio signal O corresponding to the side channel group LS0 、O RS0 、O LB0 、O RB0 、O HBL0 And O HBR0 Audio signal O of main channel group L0 、O R0 、O HFL0 And O HFR0 Set as initial conditions. The refiner 535 may be based on a channel group transformation rule between the first channel group and the second channel group, a first audio signal O corresponding to the first channel group TV And an initial condition for alternately refining the audio signal of the side channel group and the audio signal of the main channel group by using the AI model.
The refiner 535 of the audio decoding apparatus 500 may be derived from the first audio signal O according to a transformation rule from channels comprised in the second channel group to channels comprised in the first channel group TV And a second audio signal O corresponding to a side channel group included in the second channel group LS0 、O RS0 、O LB0 、O RB0 、O HBL0 And O HBR0 Obtaining audio signals O corresponding to the main channel group L0 、O R0 、O HFL0 And O HFR0 . The refiner 535 may refine the second audio signal O by using an AI model LS0 、O RS0 、O LB0 、O RB0 、O HBL0 And O HBR0 Audio signal O L0 、O R0 、O HFL0 And O HFR0
The refiner 535 according to an embodiment may refine the audio signal O through a first layer within the AI model L0 、O R0 、O HFL0 And O HFR0 To obtain a refined audio signal O L1 、O R1 、O HFL1 And O HFR1 And may refine the second audio signal O by a second layer within the AI model LS0 、O RS0 、O LB0 、O RB0 、O HBL0 And O HBR0 To obtain a refined audio signal O LS1 、O RS1 、O LB1 、O RB1 、O HBL1 And O HBR1 . The refiner 535 may obtain a third audio signal O corresponding to the second set of audio signals from the refined audio signals obtained by the AI model LS2 、O RS2 、O LB2 、O RB2 、O HBL2 、O HBR2 、O L2 、O R2 、O HFL2 And O HFR2
The AI model for refining the audio signals of the side channel group and the audio signals of the main channel group is an AI model for reconstructing a third audio signal corresponding to all channels of the second channel group from the first audio signal corresponding to the first channel group and the second audio signal corresponding to the side channel group, and may be an AI model trained to minimize errors of the reconstructed audio signal and the original audio signal.
The refiner 535 of fig. 10A performs the operation of alternately refining the audio signals of the side channel group and the audio signals of the main channel group twice.
As a result of a refinement operation, the refiner 535 may obtain a refined audio signal O of the main channel group L1 、O R1 、O HFL1 And O HFR1 Refined audio signal O of side channel group LS1 、O RS1 、O LB1 、O RB1 、O HBL1 And O HBR1 . As a result of the two refinement operations, the refiner 535 may obtain a refined audio signal O of the main channel group L2 、O R2 、O HFL2 And O HFR2 Thin side channel groupConverted audio signal O LS2 、O RS2 、O LB2 、O RB2 、O HBL2 And O HBR2
Fig. 10B is a block diagram for explaining respective operations of an up-sampler and a refiner of an audio decoding apparatus according to an embodiment of the present disclosure.
Fig. 10B shows a block diagram of AI models used by up-sampler 533 and refiner 535 according to an embodiment. As shown in fig. 10B, each module may include a plurality of convolution layers.
In fig. 10B, 1DRB may refer to a 1D residual block, and 1DConv (K1, C1) may refer to a 1D convolutional layer. K1 may refer to the number of kernels of the convolutional layer, and C1 may refer to the input of the convolutional layer being split into C1 multi-channels. Prelu may refer to activating a function.
The up-sampler 533 according to the embodiment may up-sample data of each channel included in the side channel information by using an AI neural network. The AI model used by the upsampler 533 may correspond to the AI model used by the downsampler 423 of the audio encoding apparatus 400 and may be an AI model trained to improve the audio signal reconstruction performance.
Fig. 10B is merely an example for explaining the operations of the upsampler 533 and the refiner 535. Embodiments of the present disclosure are not limited to the example of fig. 10B. The AI model used by the audio decoding apparatus 500 according to the embodiment may be determined and extended by learning from the master. The AI model used by the audio decoding apparatus 500 may be configured and trained in various ways in order to improve audio signal reconstruction performance.
The upsampler 533 may perform upsampling by calculating a weight of each channel included in the side channel information and applying the weight to data of each channel included in the side channel information. The up-sampler 533 may determine whether audio reconstruction performance in the decoding terminal is improved by applying weights, and train the result of the determination. The upsampler 533 may derive weights capable of increasing the audio reconstruction rate by using a trained AI model.
Fig. 10C is a block diagram for explaining the operation of a refiner of an audio decoding apparatus according to an embodiment of the present disclosure.
Fig. 10A illustrates a case where the audio decoding apparatus according to the embodiment performs an operation of alternately thinning the audio signal of the side channel group and the audio signal of the main channel group twice.
Referring to the graph 1031, the audio decoding apparatus 500 according to the embodiment may initially reconstruct audio signals of an Ls channel, an Rs channel, an Lb channel, an Rb channel, an HBL channel, and an HBR channel among channels of the 7.1.4 channel group into the audio signals of the side channel group by upsampling the side channel information. The audio decoding apparatus 500 may obtain the refined audio signal of the main channel group (i.e., the L channel, the R channel, the HFL channel, and the HFR channel) and the refined audio signal of the side channel group through the first stage of the refinement operation. The audio decoding apparatus 500 may obtain the secondarily-thinned audio signal of the main channel group and the secondarily-thinned audio signal of the side channel group through the second stage of the thinning operation.
However, embodiments of the present disclosure are not limited thereto. The stages and structure of the alternative reconstruction operation of the audio decoding apparatus 500 may vary depending on the computing environment and latency conditions.
The audio decoding apparatus 500 according to the embodiment may perform the substitution reconstruction operation only once.
Referring to the graph 1032, the audio decoding apparatus 500 according to the embodiment may initially reconstruct audio signals of an Ls channel, an Rs channel, an Lb channel, an Rb channel, an HBL channel, and an HBR channel among channels of the 7.1.4 channel group into the audio signals of the side channel group by upsampling the side channel information. The audio decoding apparatus 500 may obtain the refined audio signal of the main channel group (i.e., the L channel, the R channel, the HFL channel, and the HFR channel) and the refined audio signal of the side channel group through the first stage of the refinement operation.
The audio decoding apparatus 500 according to an embodiment may perform an alternative refinement operation on the audio signals of the side channels within the side channel group.
Referring to the graph 1033, the audio decoding apparatus 500 according to the embodiment may initially reconstruct audio signals of Ls channel, rs channel, lb channel, rb channel, HBL channel, and HBR channel among channels of the 7.1.4 channel group into the audio signals of the side channel group by upsampling the side channel information.
In the first stage of the refinement operation, first, the audio decoding apparatus 500 may obtain refined audio signals of the main channel group (i.e., L channel, R channel, HFL channel, and HFR channel) based on the initial audio signals of the side channel group. Next, the audio decoding apparatus 500 may obtain refined audio signals of Ls channels and Rs channels of the side channel group based on the initial audio signal of the side channel group and the refined audio signal of the main channel group. Next, the audio decoding apparatus 500 may obtain refined audio signals of Lb and Rb channels of the side channel group based on the initial audio signal of the side channel group, the refined audio signal of the main channel group, and the refined audio signals of Ls and Rs channels of the side channel group. Next, the audio decoding apparatus 500 may obtain refined audio signals of HBL channels and HBR channels of the side channel group based on the initial audio signal of the side channel group, the refined audio signal of the main channel group, and the refined audio signals of Ls channels, rs channels, lb channels, and Rb channels of the side channel group.
In the second stage of the refinement operation, the audio decoding apparatus 500 may obtain secondary refined audio signals of the main channel group (i.e., L channel, R channel, HFL channel, and HFR channel) based on the refined audio signals of the main channel group and the refined audio signals of the side channel group. Next, the audio decoding apparatus 500 may obtain secondary refined audio signals of Ls channels and Rs channels of the side channel group based on the refined audio signals of the side channel group and the secondary refined audio signals of the main channel group. Next, the audio decoding apparatus 500 may obtain secondarily-thinned audio signals of Lb and Rb channels of the side channel group based on the thinned audio signals of the side channel group, the secondarily-thinned audio signals of the main channel group, and the secondarily-thinned audio signals of Ls and Rs channels of the side channel group. Next, the audio decoding apparatus 500 may obtain secondarily-thinned audio signals of the HBL channel and the HBR channel of the side channel group based on the thinned audio signal of the side channel group, the thinned audio signal of the main channel group, and the secondarily-thinned audio signals of the Ls channel, the Rs channel, the Lb channel, and the Rb channel of the side channel group.
A method of processing an audio signal performed by the audio encoding apparatus 400 according to an embodiment will now be described with reference to the flowchart of fig. 11. Each of the operations shown in fig. 11 may be performed by at least one component included in the audio encoding apparatus 400, and redundant description thereof will be omitted.
Fig. 11 is a flowchart of an audio signal encoding method performed by the audio encoding apparatus 400 according to an embodiment of the present disclosure.
In operation S1101, the audio encoding apparatus 400 according to the embodiment may obtain a second audio signal corresponding to a channel included in a second channel group from a first audio signal corresponding to a channel included in a first channel group. The audio encoding apparatus 400 may obtain a second audio signal corresponding to a second channel group by down-mixing a first audio signal corresponding to a first channel group. For example, the first channel group may include a channel group of the original audio signal, and the second channel group may be constructed by combining at least two channels of channels included in the first channel group.
The audio encoding apparatus 400 according to the embodiment may assign a weight value to channels included in the first channel group based on a correlation of each of the channels included in the first channel group with the second channel group. The audio encoding apparatus 400 may obtain the second audio signal from the first audio signal by calculating a weighted sum of at least two first audio signals among the first audio signals based on weight values assigned to channels included in the first channel group.
The audio encoding apparatus 400 according to the embodiment may down-mix a first audio signal including a multi-channel audio signal centered around a listener into a second audio signal including a multi-channel audio signal centered in front of the listener. The audio encoding apparatus 400 may transform a first audio signal of a first channel group into a second audio signal of a second channel group based on a predetermined channel group transform rule. The audio encoding apparatus 400 may obtain an audio signal of one channel of the second channel group by mixing audio signals of at least two channels of the first channel group.
For example, the audio encoding apparatus 400 may obtain a second audio signal of a 3.1.2 channel group including six channels by down-mixing a first audio signal of a 7.1.4 channel group including twelve channels.
The 7.1.4 channel group may include an L (left) channel, a C (center) channel, an R (right) channel, an Ls (side left) channel, an Rs (side right) channel, an Lb (rear left) channel, an Rb (rear right) channel, an LFE (subwoofer) channel, an HFL (high front left) channel, an HFR (high front right) channel, an HBL (high rear left) channel, and an HBR (high rear right) channel. The 3.1.2 channel group may include an L3 (left) channel, a C (center) channel, an R3 (right) channel, an LFE channel, an HFL3 (high front left) channel, and an HFR3 (high front right) channel.
For example, the audio encoding apparatus 400 may obtain an audio signal of the left channel L3 of the 3.1.2 channel group by mixing an audio signal of the front left channel L, an audio signal of the left channel Ls, and an audio signal of the rear left channel Lb among channels included in the 7.1.4 channel group. The audio encoding apparatus 400 may obtain an audio signal of a right channel R3 of the 3.1.2 channel group by mixing an audio signal of a front right channel R, an audio signal of a right channel Rs, and an audio signal of a rear right channel Rb among channels included in the 7.1.4 channel group.
The audio encoding apparatus 400 may obtain the audio signal of the high front left channel HFL3 of the 3.1.2 channel group by mixing the audio signal of the left channel Ls, the audio signal of the rear left channel Lb, the audio signal of the high front left channel HFL, and the audio signal of the high rear left channel HBL included in the channels of the 7.1.4 channel group. The audio encoding apparatus 400 may obtain the audio signal of the front-right channel HFR3 of the 3.1.2 channel group by mixing the audio signal of the right channel Rs, the audio signal of the rear-left channel Rb, the audio signal of the front-right channel HFR, and the audio signal of the rear-right channel HBL included in the channels of the 7.1.4 channel group. The audio encoding apparatus 400 may obtain the audio signal of the ultra-bass channel LFE and the audio signal of the center channel C of the 3.1.2 channel group by applying weights to the audio signal of the ultra-bass channel LFE and the audio signal of the center channel C among the channels included in the 7.1.4 channel group.
In operation S1102, the audio encoding apparatus 400 according to the embodiment may downsample at least one third audio signal corresponding to at least one channel identified from channels included in the first channel group based on correlation with the second channel group by using the AI model. The audio encoding apparatus 400 may obtain side channel information by downsampling an audio signal of at least one side channel included in a first audio signal corresponding to a first channel group.
The audio encoding apparatus 400 according to an embodiment may identify at least one side channel from among channels included in the first channel group based on correlation between channels included in the first channel group and channels included in the second channel group. Alternatively, the audio encoding apparatus 400 may identify at least one channel having a correlation with the second channel group smaller than a preset value from among channels included in the first channel group as a side channel. The audio encoding apparatus 400 may identify channels other than the at least one side channel from among channels included in the first channel group as a main channel.
The audio encoding apparatus 400 according to the embodiment may obtain the second audio signal of the second channel group by down-mixing the first audio signal of the first channel group based on the weight values of the channels allocated to the first channel group. The audio encoding apparatus 400 may identify at least one side channel and at least one main channel from among channels included in the first channel group based on weight values assigned to channels of the first channel group.
The main channel among the channels included in the first channel group may be referred to as a first sub-group of channels, and the side channel among the channels included in the first channel group may be referred to as a second sub-group of channels. When an audio signal corresponding to one channel of the second channel group is obtained by calculating a weighted sum of audio signals corresponding to at least two channels of the first channel group, the audio encoding apparatus 400 may identify a channel having the largest assigned weight value among the at least two channels of the first channel group as a main channel and another channel among the at least two channels of the first channel group as a side channel.
For example, when the first channel group is a 7.1.4 channel group and the second channel group is a 3.1.2 channel group, the audio encoding apparatus 400 may determine the Ls channel, the Rs channel, the Lb channel, the Rb channel, the HBL channel, and the HBR channel of the first channel group as side channels. The audio encoding apparatus 400 may determine the L channel, the R channel, the HFL channel, and the HFR channel as main channels, which are the remaining channels in the channels of the first channel group.
The audio encoding apparatus 400 may encode and output only an audio signal of at least one side channel among channels of the first channel group, except for an audio signal of a main channel. At this time, the audio encoding apparatus 400 may downsample the audio signal of at least one side channel to improve transmission efficiency.
The audio encoding apparatus 400 according to an embodiment may extract a first set of audio samples and a second set of audio samples from an audio signal of at least one side channel. The audio encoding apparatus 400 may downsample the audio signal of the at least one side channel by obtaining a weighted sum of the first set of audio samples and the second set of audio samples.
For example, the audio encoding apparatus 400 may downsample the audio signal of the at least one side channel by applying a uniform weight value to each audio sample included in the first and second audio sample groups and the audio sample groups and to each audio sample group.
As another example, the audio encoding apparatus 400 may downsample the audio signal of the at least one side channel by applying different weight values to each audio sample included in the first and second audio sample groups and the audio sample groups and to each audio sample group.
The audio encoding apparatus 400 according to the embodiment may obtain downsampling-related information about the first audio sample group and the second audio sample group by using the AI model. The audio encoding apparatus 400 may downsample at least one third audio signal by applying the downsampling-related information to the first and second groups of audio samples. In this case, the AI model may be an AI model trained to obtain downsampling-related information that minimizes an error between a reconstructed audio signal reconstructed based on the second audio signal and the downsampled at least one third audio signal and the original audio signal.
For example, the audio encoding apparatus 400 may obtain weight values to be applied to the first audio sample group and the second audio sample group, respectively, by using the AI model. The AI model may be an AI model trained to obtain weight values that minimize errors between the reconstructed audio signals of all channels of the first channel group to be reconstructed based on the second audio signal and the side channel information and the first audio signal.
In operation S1103, the audio encoding apparatus 400 according to the embodiment may generate a bitstream including the second audio signal corresponding to the channels included in the second channel group and at least one third audio signal downsampled. The audio encoding apparatus 400 may generate a bitstream by encoding the second audio signal and the downsampled at least one third audio signal.
The audio encoding apparatus 400 according to the embodiment may obtain the audio signal of the base channel group and the audio signal of the dependent channel group by mixing the second audio signal. The audio encoding apparatus 400 may obtain a first compressed signal by compressing audio signals of a base channel group, a second compressed signal by compressing audio signals of a dependent channel group, and a third compressed signal by compressing side channel information. The audio encoding apparatus 400 may generate a bitstream including the first compressed signal, the second compressed signal, and the third compressed signal.
For example, the base channel group may include left and right channels constituting a stereo channel. The sub-channel group may include channels other than the two channels corresponding to the base channel group among the channels included in the second channel group.
The audio encoding apparatus 400 according to the embodiment may obtain additional information, which is information for decoding a multi-channel audio signal based on an audio signal of a base channel and an audio signal of a sub-channel in a decoding terminal.
For example, the audio encoding apparatus 400 may decode the first compressed signal, the second compressed signal, and the third compressed signal, and may obtain a reconstructed audio signal corresponding to the first channel group from the decoded signals. The audio encoding apparatus 400 may obtain the additional information by comparing the reconstructed audio signal with the first audio signal. The audio encoding apparatus 400 may obtain an audio signal applicable to a channel included in the reconstructed audio signal to minimize an error between the reconstructed audio signal and the first audio signal as the additional information.
The audio encoding apparatus 400 may generate a bitstream including additional information in addition to the first compressed signal, the second compressed signal, and the third compressed signal.
A method of reconstructing an audio signal from a received bitstream, which is performed by the audio decoding apparatus 500 according to an embodiment, will now be described with reference to the flowchart of fig. 12. Each operation shown in fig. 12 may be performed by at least one component included in the audio decoding apparatus 500, and a redundant description thereof will be omitted.
Fig. 12 is a flowchart of an audio signal decoding method performed by the audio decoding apparatus 500 according to an embodiment of the present disclosure.
The audio decoding apparatus 500 according to the embodiment may obtain a first audio signal corresponding to a channel included in a first channel group and a downsampled second audio signal by decoding a bitstream in operation S1201. The audio decoding apparatus 500 may obtain side channel information including a first audio signal corresponding to a first channel group and a downsampled second audio signal.
The audio decoding apparatus 500 according to the embodiment may obtain the audio signal of the base channel group and the audio signal of the sub channel by decompressing the bit stream. The audio decoding apparatus 500 may obtain a first audio signal corresponding to the first channel group based on the additional information, the audio signal of the base channel group included in the bitstream, and the audio signal of the dependent channel.
According to an embodiment, the base channel group may include left and right channels constituting a stereo channel, and the sub channel group may include channels other than the two left and right channels corresponding to the base channel group among channels included in the first channel group.
The audio decoding apparatus 500 according to the embodiment may obtain a mixed audio signal corresponding to the first channel group by mixing an audio signal of the base channel group and an audio signal of the dependent channel. The audio decoding apparatus 500 may obtain a first audio signal corresponding to the first channel group by rendering the mixed audio signal based on the additional information.
According to an embodiment, the first audio signal corresponding to the first channel group may comprise a multi-channel audio signal centered in front of the listener. For example, the first channel group may include 3.1.2 channels, and the 3.1.2 channels include an L3 channel, a C channel, an R3 channel, an LFE channel, an HFL3 channel, and an HFR3 channel.
In operation S1202, the audio decoding apparatus 500 according to the embodiment may obtain at least one second audio signal corresponding to at least one channel among channels included in the second channel group by upsampling the downsampled second audio signal using the AI model. The audio decoding apparatus 500 may obtain an audio signal of at least one side channel included in the second channel group by upsampling the side channel information. The first channel group may be a lower channel group including a smaller number of channels than the second channel group.
The audio decoding apparatus 500 according to the embodiment may obtain the audio signal of at least one side channel by upsampling the downsampled second audio signal included in the side channel information using the AI model.
According to an embodiment, the AI model is an AI model for reconstructing audio signals of all channels of the second channel group from a first audio signal corresponding to the first channel group and a second audio signal of at least one side channel of channels included in the second channel group, and may be an AI model trained to minimize an error between the reconstructed audio signal and the original audio signal.
Based on the correlation of each channel included in the second channel group with the first channel group, the channels included in the second channel group may be classified into channels of a first sub-group having a relatively high correlation with the first channel group and channels of a second sub-group having a relatively low correlation with the first channel group. The first subset of channels may be referred to as the primary channels and the second subset of channels may be referred to as the side channels.
The audio decoding apparatus 500 according to the embodiment may obtain an audio signal of at least one main channel included in the first channel group from the first audio signal corresponding to the first channel group and the audio signal of at least one side channel according to a channel group transformation rule between the first channel group and the second channel group.
The audio decoding apparatus 500 according to an embodiment may refine the audio signal of at least one side channel and the audio signal of at least one main channel by using an AI model. The audio decoding apparatus 500 may obtain a third audio signal corresponding to the second channel group based on the refined audio signal of the side channel and the refined audio signal of the main channel.
According to an embodiment, the third audio signal corresponding to the second channel group may comprise a multi-channel audio signal centered on the listener. For example, the second channel group may include 7.1.4 channels, the 7.1.4 channels including an L (left) channel, a C (center) channel, an R (right) channel, an Ls (side left) channel, an Rs (side right) channel, an Lb (rear left) channel, an Rb (rear right) channel, an LFE (subwoofer) channel, an HFL (high front left) channel, an HFR (high front right) channel, an HBL (high rear left) channel, and an HBR (high rear right) channel. In this case, the audio signal of the at least one side channel may include audio signals of an Ls channel, an Rs channel, an Lb channel, an Rb channel, an HBL channel, and an HBR channel of the second channel group.
The audio decoding apparatus 500 according to the embodiment may obtain a fourth audio signal corresponding to the main channel from the first audio signal and the second audio signal corresponding to the second channel according to a transformation rule from the channels included in the second channel group to the channels included in the first channel group. The audio decoding apparatus 500 may refine the second audio signal and the fourth audio signal by using the AI model. The audio decoding apparatus 500 may obtain third audio signals corresponding to all channels included in the second channel group from the thinned second audio signal and the thinned fourth audio signal.
For example, the fourth audio signal may be refined by a first layer within the AI model and the second audio signal may be refined by a second layer within the AI model. The refined fourth audio signal may be obtained by inputting the first audio signal, the second audio signal and the fourth audio signal to the first layer. The refined second audio signal may be obtained by inputting the first audio signal, the second audio signal, the refined fourth audio signal, and the value output by the first layer to the second layer.
In operation S1203, the audio decoding apparatus 500 according to the embodiment may reconstruct a third audio signal corresponding to a second channel group based on a first audio signal corresponding to a first channel group and a second audio signal of at least one side channel.
As described above, when a channel group is transformed so as to realize a sound image around a screen, the audio encoding apparatus 400 according to the embodiment may determine a channel that is not used or has least used related information from among channels of a second channel group as a side channel. However, embodiments of the present disclosure are not limited thereto. The audio encoding apparatus 400 according to the embodiment may further include a sound image characteristic analysis module to change the type of side channels and the number of side channels according to a change in sound image characteristics of an input audio signal over time.
Fig. 13 illustrates an example of a transformation between channel groups performed based on sound image characteristic analysis in an audio processing system according to an embodiment of the present disclosure.
The audio encoding apparatus 400 according to an embodiment may receive an audio signal corresponding to a first channel group as an original audio signal. For example, as shown in fig. 13, the audio encoding apparatus 400 may receive an audio signal of 7.1.4 channels including an Ls channel, an Lb channel, an HBL channel, an L channel, an HFL channel, a C channel, an LFE channel, an HFR channel, an R channel, an HBR channel, an Rb channel, and an Rs channel as an original audio signal.
The audio encoding apparatus 400 according to the embodiment may analyze the sound image characteristics of the input original audio signal. The audio encoding apparatus 400 may transform a first channel group of an original audio signal into a second channel group in which a sound image is implemented around a screen of a display device based on sound image characteristics. For example, the audio encoding apparatus 400 may transform an original audio signal of a 7.1.4 channel group into an audio signal O of a 3.1.2 channel group tv . The audio encoding apparatus 400 may include the audio signal transformed into the second channel group in a bitstream and transmit the bitstream to the audio decoding apparatus 500.
The audio encoding apparatus 400 may determine at least one side channel from among channels of the first channel group based on the sound image characteristics.
The audio encoding apparatus 400 according to the embodiment may analyze the sound source characteristics of the original audio signal in each time unit by using the AI model. For example, the audio encoding apparatus 400 may analyze whether an original audio signal is a signal including talking information of a plurality of speakers or a signal including information uttered by one speaker. Alternatively, the audio encoding apparatus 400 may analyze sound source distribution characteristics in a horizontal direction or a vertical direction with respect to a listener.
Based on the result of the analysis, the audio encoding apparatus 400 may determine at least one side channel capable of maximizing audio reconstruction performance in the decoding terminal. The audio encoding apparatus 400 may downsample the audio signal of the at least one side channel along the time axis and transmit the downsampled audio signal to the audio decoding apparatus 500.
When the sound image characteristics of the original audio signal vary with time, the type and number of channels determined as side channels from among the channels of the first channel group by the audio encoding apparatus 400 may vary.
The audio encoding apparatus 400 may further include in the bitstream a signal M obtained by downsampling N channels determined as side channels among the channels of the first channel group by a factor of 1/s adv And may transmit the bitstream to the audio decoding apparatus 500 (where N is an integer greater than 1 and s is a rational number greater than 1).
The audio decoding apparatus 500 may decode the audio signal M of at least one side channel which has been downsampled and transmitted adv Upsampling is performed to reconstruct the audio signal of the at least one side channel.
The audio decoding apparatus 500 may reconstruct the audio signal O from the second channel group by using the reconstructed audio signal of the at least one side channel tv An audio signal of a first channel group in which an acoustic image is realized around a listener is reconstructed. At this time, the audio decoding apparatus 500 may reconstruct the audio signals of the first channel group by reflecting the sound image characteristics.
Fig. 14 shows an example in which an audio processing system downsamples an audio signal of a side channel based on characteristics of the channel according to an embodiment.
The audio encoding apparatus 400 according to an embodiment may perform downsampling of an audio signal of at least one side channel based on a predetermined rule. When downsampling is performed based on a predetermined rule, the audio encoding apparatus 400 may be advantageous for real-time processing without delay.
According to another embodiment, as shown in fig. 14, the audio encoding apparatus 400 may extract characteristics (e.g., sparsity) of the side channels and may downsample the audio signals of the side channels based on the characteristics of the side channels.
As described above, the audio encoding apparatus 400 according to the embodiment may transform a first audio signal including a listener centered multi-channel audio signal into a second audio signal including a listener front centered multi-channel audio signal, and may transmit the second audio signal including the listener front centered multi-channel audio signal. The audio decoding apparatus 500 may reconstruct a screen-centered multi-channel audio signal or a listener-centered multi-channel audio signal according to a change in a content consumption environment. However, embodiments of the present disclosure are not limited to embodiments in which the original input audio signal comprises a listener centered multi-channel audio signal and the transmitted bitstream comprises a screen centered multi-channel audio signal.
Fig. 15 illustrates an embodiment of an audio processing method suitable for use in accordance with embodiments of the present disclosure.
For example, when the input audio constitutes a screen-centered sound image, the audio encoding apparatus 400 may obtain "intermediate result audio signals from the input audio, in which the sound image is transformed around a listening environment (e.g., a listening environment using a general 2-channel speaker or a listening environment using a 2-channel headphone), and may extract side channel information having minimal correlation with the intermediate result audio signals. The audio encoding apparatus 400 may compress and transmit the intermediate result audio signal and the side channel information. The audio decoding apparatus 500 may reconstruct "an audio signal constituting a screen-centered sound image" from the intermediate result audio signal and the side channel information.
Referring to fig. 15, the audio encoding apparatus 400 may transform an audio signal 1501 constituting a screen-centered sound image into an audio signal 1502 constituting a listening environment-centered sound image, and may transmit the audio signal 1502. At this time, the audio encoding apparatus 400 may also transmit an audio signal of a side channel that is least correlated with the audio signal 1502. The audio decoding apparatus 500 may reconstruct an audio signal 1502 constituting a sound image centered around the listening environment from the transmitted bit stream. The audio decoding apparatus 500 may reconstruct the audio signal 1502 into the audio signal 1503 constituting the screen-centered sound image by using the audio signals of the side channels.
As another example, when the input audio constitutes a listener-centered sound image, the audio encoding apparatus 400 may obtain an intermediate result audio signal according to a listening environment of a user through a device, and may extract side channel information having a minimum correlation with the intermediate result audio signal. The audio encoding apparatus 400 may compress and transmit the intermediate result audio signal and the side channel information. The audio decoding apparatus 500 may reconstruct "an audio signal constituting a screen-centered sound image" from the intermediate result audio signal and the side channel information.
Referring to fig. 15, the audio encoding apparatus 400 may transform an audio signal 1511 constituting an acoustic image centered on a listener into an audio signal 1512 constituting an acoustic image centered on a listening environment, and may transmit the audio signal 1512. For example, the audio signal 1512 constituting a sound image centered on the listening environment may include an audio signal of a 2-channel layout. The audio signal 1512 constituting the listening environment-centered sound image may be an intermediate result audio signal in which sound effects such as channel conversion are emphasized while maintaining the listener-centered sound image. The audio encoding apparatus 400 may also transmit an audio signal of a side channel that is least correlated with the audio signal 1512 constituting the listening environment-centered sound image. The audio decoding apparatus 500 may reconstruct the audio signal 1512 constituting a listening environment centered sound image from the transmitted bitstream. The audio decoding apparatus 500 may reconstruct the audio signal 1512 constituting the sound image centered on the listening environment into the audio signal 1513 constituting the sound image centered on the screen by using the audio signals of the side channels.
The machine-readable storage medium may be provided as a non-transitory storage medium. Here, a "non-transitory storage medium" is a tangible device, meaning only that it does not contain a signal (e.g., electromagnetic waves). The term does not distinguish between a case where data is semi-permanently stored in a storage medium and a case where data is temporarily stored. For example, the non-transitory storage medium may include a buffer that temporarily stores data.
According to embodiments of the present disclosure, methods according to various disclosed embodiments may be provided by being included in a computer program product. The computer program product is an article of commerce and thus can be transacted between a seller and a buyer. The computer program product may be distributed in the form of a device readable storage medium, such as a compact disc read only memory (CD-ROM), or may be distributed (e.g., downloaded or uploaded) directly and online through an application store or between two user devices, such as smartphones. In the case of online distribution, at least a portion of the computer program product (e.g., the downloadable app) may be at least temporarily stored in a device readable storage medium, such as a memory of a manufacturer server, a server of an application store, or a relay server, or may be temporarily generated.

Claims (15)

1. An audio processing apparatus, comprising:
a memory storing one or more instructions; and
a processor configured to execute one or more instructions to
Obtaining a second audio signal corresponding to a second channel included in a second channel group, which is obtained by combining at least two channels of the first channel included in the first channel group,
Downsampling at least one third audio signal corresponding to at least one channel identified from a first channel included in the first channel group based on correlation between the at least one channel and the second channel group by using an Artificial Intelligence (AI) model, and
a bitstream is generated comprising the second audio signal corresponding to the second channel included in the second channel group and at least one third audio signal downsampled.
2. The audio processing apparatus of claim 1, wherein the processor is further configured to identify the at least one channel having a correlation with the second channel group below a predetermined value from among first channels included in the first channel group.
3. The audio processing apparatus of claim 1, wherein the processor is further configured to:
assigning a weight value to the first channels included in the first channel group based on a correlation of each of the first channels included in the first channel group with the second channel group;
obtaining a second audio signal from the first audio signals by calculating a weighted sum of at least two first audio signals among the first audio signals based on weight values assigned to the first channels included in the first channel group; and
The at least one channel is identified from the first channels included in the first channel group based on weight values assigned to the first channels included in the first channel group.
4. The audio processing apparatus of claim 3, wherein the processor is further configured to:
obtaining an audio signal corresponding to one channel among the second channels included in the second channel group by summing audio signals corresponding to at least two channels among the first channels included in the first channel group based on weight values assigned to the at least two channels; and
the channel of the at least two channels having the largest assigned weight value is identified as the channel corresponding to the first subset of the first channel group, and the remaining channels of the at least two channels are identified as channels corresponding to the second subset of the first channel group.
5. The audio processing apparatus of claim 1, wherein the processor is further configured to:
extracting a first set of audio samples and a second set of audio samples from the at least one third audio signal;
obtaining downsampling-related information about the first and second sets of audio samples by using the AI model; and
At least one third audio signal is downsampled by applying the downsampling-related information to the first and second groups of audio samples.
6. The audio processing apparatus of claim 5, wherein the AI model is trained to obtain the downsampling-related information by minimizing an error between a reconstructed audio signal and the first audio signal, the reconstructed audio signal being reconstructed based on the second audio signal and the downsampled at least one third audio signal.
7. The audio processing apparatus of claim 1, wherein the processor is further configured to:
obtaining an audio signal of a base channel group and an audio signal of a dependent channel group from a second audio signal corresponding to a second channel included in a second channel group;
obtaining a first compressed signal by compressing audio signals of a basic channel group;
obtaining a second compressed signal by compressing the audio signals of the slave channel group;
obtaining a third compressed signal by compressing the downsampled at least one third audio signal; and
a bitstream is generated comprising the first compressed signal, the second compressed signal and the third compressed signal.
8. The audio processing apparatus of claim 7, wherein the basic channel group includes two channels for stereo reproduction, and
The sub channel group includes channels other than two channels having relatively high correlation with two channels for stereo reproduction among the second channels included in the second channel group.
9. The audio processing apparatus of claim 1, wherein the first audio signal corresponding to the first channel included in the first channel group includes a listener-centered multi-channel audio signal, and
the second audio signal corresponding to the second channel included in the second channel group includes a multi-channel audio signal centered in front of the listener.
10. The audio processing apparatus of claim 1, wherein the first channel group comprises 7.1.4 channels, the 7.1.4 channels comprising a front left channel, a front right channel, a center channel, a left channel, a right channel, a rear left channel, a rear right channel, a subwoofer channel, a front left overhead channel, a front right overhead channel, a rear left overhead channel, and a rear right overhead channel,
the second channel group includes 3.1.2 channels, the 3.1.2 channels include a front left channel, a front right channel, a subwoofer channel, a front left overhead channel, and a front right overhead channel, and
the processor is further configured to identify, from among the first channels included in the first channel group, a left channel, a right channel, a rear left channel, a rear right channel, a rear left overhead channel, and a rear right overhead channel having relatively low correlation with the second channel group as the second channel of the second sub-group.
11. An audio processing apparatus, comprising:
a memory storing one or more instructions; and
a processor configured to execute one or more instructions to
A first audio signal corresponding to a first channel included in a first channel group and a downsampled second audio signal are obtained from a bitstream,
upsampling the downsampled second audio signal by using an Artificial Intelligence (AI) model to obtain at least one second audio signal corresponding to at least one channel among the second channels included in the second channel group, and
reconstructing a third audio signal corresponding to a second channel included in the second channel group from the first audio signal and the at least one second audio signal,
wherein the first channel group includes a smaller number of channels than the second channel group.
12. The audio processing apparatus of claim 11, wherein the second channels included in the second channel group are classified into the first and second sub-groups of channels and the second sub-group of channels based on a correlation of each of the second channels included in the second channel group with the first channel group, and
the processor is further configured to obtain a second audio signal corresponding to a second subset of channels.
13. The audio processing apparatus of claim 12, wherein the processor is further configured to:
obtaining a fourth audio signal corresponding to the channels of the first sub-group from the first audio signal and the second audio signal corresponding to the channels of the second sub-group according to a transformation rule from the second channel included in the second channel group to the first channel included in the first channel group;
refining the second audio signal and the fourth audio signal by using the AI model; and
the third audio signal is obtained from the refined second audio signal and the refined fourth audio signal.
14. The audio processing apparatus of claim 13, wherein the fourth audio signal is refined by a first neural network layer within the AI model and the second audio signal is refined by a second neural network layer within the AI model,
the refined fourth audio signal is obtained by inputting the first audio signal, the second audio signal, and the fourth audio signal to the first neural network layer, and
the refined second audio signal is obtained by inputting the first audio signal, the second audio signal, the refined fourth audio signal, and the value output by the first neural network layer to the second neural network layer.
15. An audio processing method, comprising:
performing audio signal format conversion on an original audio signal by combining at least two channels of a plurality of channels included in the original audio signal to convert the original audio signal into a converted audio signal;
identifying at least one side channel signal and a plurality of base channel signals of the original audio signal based on a correlation of each of a plurality of channels of the original audio signal with the audio signal format conversion;
downsampling at least one side channel signal using an Artificial Intelligence (AI) model; and
a bitstream is generated comprising the converted audio signal and the downsampled at least one side channel signal.
CN202280011465.XA 2021-01-27 2022-01-27 Audio processing apparatus and method Pending CN116762128A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0011915 2021-01-27
KR1020210138834A KR20220108704A (en) 2021-01-27 2021-10-18 Apparatus and method of processing audio
KR10-2021-0138834 2021-10-18
PCT/KR2022/001496 WO2022164229A1 (en) 2021-01-27 2022-01-27 Audio processing device and method

Publications (1)

Publication Number Publication Date
CN116762128A true CN116762128A (en) 2023-09-15

Family

ID=87961410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280011465.XA Pending CN116762128A (en) 2021-01-27 2022-01-27 Audio processing apparatus and method

Country Status (1)

Country Link
CN (1) CN116762128A (en)

Similar Documents

Publication Publication Date Title
EP3729425B1 (en) Priority information for higher order ambisonic audio data
JP5227946B2 (en) Filter adaptive frequency resolution
US10075802B1 (en) Bitrate allocation for higher order ambisonic audio data
EP4246510A1 (en) Audio encoding and decoding method and apparatus
US20200120438A1 (en) Recursively defined audio metadata
CN111034225B (en) Audio signal processing method and apparatus using ambisonic signal
US20190392846A1 (en) Demixing data for backward compatible rendering of higher order ambisonic audio
CN108141688B (en) Conversion from channel-based audio to higher order ambisonics
US11538489B2 (en) Correlating scene-based audio data for psychoacoustic audio coding
US20220286799A1 (en) Apparatus and method for processing multi-channel audio signal
US11081116B2 (en) Embedding enhanced audio transports in backward compatible audio bitstreams
JP6686015B2 (en) Parametric mixing of audio signals
US20240087580A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
US20230360665A1 (en) Method and apparatus for processing audio for scene classification
US11361776B2 (en) Coding scaled spatial components
TW202107451A (en) Performing psychoacoustic audio coding based on operating conditions
EP4243015A1 (en) Audio processing device and method
CN116762128A (en) Audio processing apparatus and method
US20200402523A1 (en) Psychoacoustic audio coding of ambisonic audio data
JP2023551016A (en) Audio encoding and decoding method and device
US11062713B2 (en) Spatially formatted enhanced audio data for backward compatible audio bitstreams
KR20220108704A (en) Apparatus and method of processing audio
CN113228169A (en) Apparatus, method and computer program for encoding spatial metadata
EP4310839A1 (en) Apparatus and method for processing multi-channel audio signal
CN117321680A (en) Apparatus and method for processing multi-channel audio signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination