GB2561844A

GB2561844A - Spatial audio processing

Info

Publication number: GB2561844A
Application number: GB1706476.7A
Authority: GB
Inventors: Johannes Eronen Antti; Artturi Leppanen Jussi; Juhani Lehtiniemi Arto; Johannes Pihlajakuja Tapani
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-04-24
Filing date: 2017-04-24
Publication date: 2018-10-31
Also published as: GB201706476D0; WO2018197747A1

Abstract

A method 400 for automatically changing an allocation of frequency sub-channels (51, fig 4) to multiple output audio channels (52, fig 4). The frequency sub-channels are of an input audio signal (113, fig 4), and each output audio channel is for rendering at a location within a sound space (10, fig 1). The allocation of frequency sub-channels is automatically changed, and may be in response to one or more changes in a power spectrum of the input audio signal. The automatic change may be in response to an automatic detection of a sub-optimal allocation such as by comparing cost function values of current and putative/alternative allocations. The method may have the steps of automatically determining a first power spectral density function 404 for the input audio signal (Fig 8A); automatically determining a second power spectral density function 414 for each of the allocated putative frequency sub-channels (Figs 8B-8E), performing a running average smoothing over frequency bins using a sliding window for first and second functions 406, 416, comparing the functions 420, and determining whether the putative allocation is better than the current allocation 422 using a mean square error value.

Description

(54) Title of the Invention: Spatial audio processing

Abstract Title: Spatial audio processing for providing a sound object with a spatial extent (57) A method 400 for automatically changing an allocation of frequency sub-channels (51, fig 4) to multiple output audio channels (52, fig 4). The frequency sub-channels are of an input audio signal (113, fig 4), and each output audio channel is for rendering at a location within a sound space (10, fig 1). The allocation of frequency sub-channels is automatically changed, and may be in response to one or more changes in a power spectrum of the input audio signal. The automatic change may be in response to an automatic detection of a sub-optimal allocation such as by comparing cost function values of current and putative/alternative allocations. The method may have the steps of automatically determining a first power spectral density function 404 for the input audio signal (Fig 8A); automatically determining a second power spectral density function 414 for each of the allocated putative frequency sub-channels (Figs 8B-8E), performing a running average smoothing over frequency bins using a sliding window for first and second functions 406, 416, comparing the functions 420, and determining whether the putative allocation is better than the current allocation 422 using a mean square error value.

FIG. 7

At least one drawing originally filed was informal and the print reproduced here is taken from a later filed formal copy.

1/6

1406 17

SOUND SPACE

FIG. 1A

FIG. 1B

FIG. 1C \/lQl ΙΔΙ ΩΓΓΜΓ

312

POSITION

TRACKING

312Γ'

FIG. 1D

AUDIO

IMAGE

X 312

200

<310

312

AUDIO

OUT

312

USER I INPUT !

FIG. 3A

306

2/6

1406 17

FIG. 4

3/6

100

102

104

FIG. 5

1406 17

FIG. 6

4/6

1406 17

O CD

Ο Ν'

FIG. 7

5/6

1406 17

6/6

500

CD

Ο

FIG. 9

Application No. GB1706476.7

RTM

Date :20 October 2017

Intellectual

Property

Office

The following terms are registered trade marks and should be read as such wherever they occur in this document:

Bluetooth (Page 4)

Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

TITLE

Spatial audio processing.

TECHNOLOGICAL FIELD

Embodiments of the present invention relate to spatial audio processing. In particular, embodiments relate to providing a sound object with spatial extent.

BACKGROUND

Audio content may or may not be a part of other content. For example, multimedia content comprises a visual content and an audio content. The visual content and/or the audio content may be perceived live or they may be recorded and rendered.

For example, in an augmented reality application, at least part of the visual content is observed by a user via a see-through display while another part of the visual content is displayed on the see-through display. The audio content may be live or it may be rendered to a user.

In a virtual reality application, the visual content and the audio content are both rendered.

It may in some circumstances be desirable to control how a user perceives audio content.

BRIEF SUMMARY

According to various, but not necessarily all, embodiments of the invention there is provided a method comprising: allocating frequency sub-channels of an input audio signal to multiple output audio channels, each output audio channel for rendering at a location within a sound space; and automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

According to various, but not necessarily all, embodiments of the invention there is provided an apparatus comprising: means for allocating frequency sub-channels of an input audio signal to multiple output audio channels, each output audio channel for rendering at a location within a sound space; and means for automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

According to various, but not necessarily all, embodiments of the invention there is provided a computer program than when run on a processor enables: allocating frequency sub-channels of an input audio signal to multiple output audio channels, each output audio channel for rendering at a location within a sound space; and automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

According to various, but not necessarily all, embodiments of the invention there is provided an apparatus comprising: at least one processor; and at least one memory including computer program code the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: allocating frequency sub-channels of an input audio signal to multiple output audio channels, each output audio channel for rendering at a location within a sound space; and automatically changing an allocation of frequency subchannels of the input audio signal to multiple output audio channels.

According to various, but not necessarily all, embodiments of the invention there is provided examples as claimed in the appended claims.

BRIEF DESCRIPTION

For a better understanding of various examples that are useful for understanding the detailed description, reference will now be made by way of example only to the accompanying drawings in which:

Figs 1A to 1D illustrates examples of a sound space comprising one or more sound objects;

Figs. 2A to 2D illustrate examples of a recorded visual scene that respectively correspond with the sound space illustrated in Figs 1A to 1D;

Fig 3A illustrates an example of a controller and Fig 3B illustrates an example of a computer program;

Fig 4 illustrates an example of a spatial audio processing system comprising a spectral allocation module and a spatial allocation module;

Fig 5 illustrates an example of a method;

Fig 6 illustrates an example of a system;

Fig 7 illustrates an example of a method;

Fig 8A illustrates an example of a power spectral density function for an input audio signal, and each of Figs 8B to 8E illustrates an example of a power spectral density function for each one of the allocated frequency sub-channels of the input audio signal;

Fig 9 illustrates an example of a method for controlling rendering of spatial audio and, in particular, controlling rendering of a sound object that has a changing spatial extent, for example width.

DETAILED DESCRIPTION

The following description describes methods, apparatuses and computer programs that control how audio content is perceived and, in particular, control a perceived size and/or position of a source of the audio content. In some, but not necessarily all examples, spatial audio rendering may be used to render sound sources as sound objects at particular positions within a sound space.

Automatic or user controlled editing of a sound space may occur by, for example, repositioning one or more sound objects or by changing sound characteristics of the sound objects such as a perceived lateral and/or vertical extent of the sound source.

Fig 1A illustrates an example of a sound space 10 comprising a sound object 12 within the sound space 10. The sound object 12 may be a sound object as recorded or it may be a sound object as rendered. It is possible, for example using spatial audio processing, to modify a sound object 12, for example to change its sound or positional characteristics. For example, a sound object can be modified to have a greater volume, to change its position within the sound space 10 (Figs 1B & 1C) and/or to change its spatial extent within the sound space 10 (Fig 1D)

Fig 1B illustrates the sound space 10 before movement ofthe sound object 12 in the sound space 10. Fig 1C illustrates the same sound space 10 after movement ofthe sound object 12.

The sound object 12 may be a sound object as recorded and be positioned at the same position as a sound source of the sound object or it may be positioned independently of the sound source.

The position of a sound source may be tracked to render the sound object at the position ofthe sound source. This may be achieved, for example, when recording by placing a positioning tag on the sound source. The position and the position changes of the sound source can then be recorded. The positions of the sound source may then be used to control a position of the sound object 12. This may be particularly suitable where an up-close microphone such as a boom microphone or a Lavalier microphone is used to record the sound source.

In other examples, the position of the sound source within the visual scene may be determined during recording of the sound source by using spatially diverse sound recording. An example of spatially diverse sound recording is using a microphone array. The phase differences between the sound recorded at the different, spatially diverse microphones, provides information that may be used to position the sound source using a beam forming equation. For example, time-difference-of-arrival (TDOA) based methods for sound source localization may be used.

The positions of the sound source may also be determined by post-production annotation. As another example, positions of sound sources may be determined using Bluetooth-based indoor positioning techniques, or visual analysis techniques, a radar, or any suitable automatic position tracking mechanism.

Fig 1D illustrates a sound space 10 after extension of the sound object 12 in the sound space 10. The sound space 10 of Fig. 1D differs from the sound space 10 of Fig. 1C in that the spatial extent of the sound object 12 has been increased so that the sound object has a greater breadth (greater width).

In some examples, a visual scene 20 may be rendered to a user that corresponds with the rendered sound space 10. The visual scene 20 may be the scene recorded at the same time the sound source that creates the sound object 12 is recorded.

Fig. 2A illustrates an example of a visual scene 20 that corresponds with the sound space 10. Correspondence in this sense means that there is a one-to-one mapping between the sound space 10 and the visual scene 20 such that a position in the sound space 10 has a corresponding position in the visual scene 20 and a position in the visual scene 20 has a corresponding position in the sound space 10. Corresponding also means that the coordinate system of the sound space 10 and the coordinate system of the visual scene 20 are in register such that an object is positioned as a sound object in the sound space and as a visual object in the visual scene at the same common position from the perspective of a user.

The sound space 10 and the visual scene 20 may be three-dimensional.

A portion of the visual scene 20 is associated with a position of visual content representing a sound source 22 within the visual scene 20. The position of the sound source 22 in the visual scene 20 corresponds with a position of the sound object 12 within the sound space 10.

In this example, but not necessarily all examples, the sound source 22 is an active sound source producing sound that is or can be heard by a user, for example via rendering or live, while the user is viewing the visual scene via the display 200.

In some examples, parts of the visual scene 20 are viewed through the display 200 (which would then need to be a see-through display). In other example, the visual scene 20 is rendered by the display 200.

In an augmented reality application, the display 200 is a see-through display and at least parts of the visual scene 20 is a real, live scene viewed through the see-through display 200. The sound source 22 may be a live sound source or it may be a sound source that is rendered to the user. This augmented reality implementation may, for example, be used for capturing an image or images of the visual scene 20 as a photograph or a video.

In another application, the visual scene 20 may be rendered to a user via the display 200, for example, at a location remote from where the visual scene 20 was recorded. This situation is similar to the situation commonly experienced when reviewing images via a television screen, a computer screen or a mediated/virtual/augmented reality headset. In these examples, the visual scene 20 is a rendered visual scene. The active sound source 22 produces rendered sound, unless it has been muted. This implementation may be particularly useful for editing a sound space by, for example, modifying characteristics of sound sources and/or moving sound sources within the visual scene 20.

Fig 2B illustrates a visual scene 20 corresponding to the sound space 10 of Fig 1B, before movement of the sound source 22 in the visual scene 20. Fig 2C illustrates the same visual scene 20 corresponding to the sound space 10 of Fig 1C, after movement of the sound source 22.

Fig 2D illustrates the visual scene 20 after extension of the sound object 12 in the corresponding sound space 10. While the sound space 10 of Fig. 1D differs from the sound space 10 of Fig. 1C in that the spatial extent of the sound object 12 has been increased so that the sound object has a greater breadth, the visual scene 20 is not necessarily changed.

The above described methods may be performed using a controller. An example of a controller 400 is illustrated in Fig 3A.

Implementation of the controller 300 may be as controller circuitry. The controller 300 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).

As illustrated in Fig 3A the controller 300 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 306 in a general-purpose or special-purpose processor 302 that may be stored on a computer readable storage medium (disk, memory etc.) to be executed by such a processor 302.

The processor 302 is configured to read from and write to the memory 304. The processor 302 may also comprise an output interface via which data and/or commands are output by the processor 302 and an input interface via which data and/or commands are input to the processor 302.

The memory 304 stores a computer program 306 comprising computer program instructions (computer program code) that controls the operation of the apparatus 300 when loaded into the processor 302. The computer program instructions, ofthe computer program 306, provide the logic and routines that enables the apparatus to perform the methods illustrated in the figures. The processor 302 by reading the memory 304 is able to load and execute the computer program 306.

The controller 300 may be part of an apparatus or system 320. The apparatus or system 320 may comprise one or more peripheral components 312. The display 200 is a peripheral component. Other examples of peripheral components may include: an audio output device or interface for rendering or enabling rendering ofthe sound space 10 to the user; a user input device for enabling a user to control one or more parameters of the method; a positioning system for positioning a sound source; an audio input device such as a microphone or microphone array for recording a sound source; an image input device such as a camera or plurality of cameras.

The apparatus or system 320 may be comprised in a headset for providing mediated reality.

The controller 300 may be configured as a sound rendering engine that is configured to control characteristics of a sound object 12 defined by sound content. For example, the rendering engine may be configured to control the volume of the sound content, a position of the sound object 12 for the sound content within the sound space 10, a spatial extent of new sound object 12 for the sound content within the sound space 10, and other characteristics of the sound content such as, for example, tone or pitch or spectrum or reverberation etc. The sound object may, for example, be rendered via an audio output device or interface. The sound content may be received by the controller 300.

The sound rendering engine may, for example comprise a spatial audio processing system 50 that is configured to control the position and/or extent of a sound object 12 within a sound space 10. Fig 4, illustrates an example of a spatial audio processing system 50 comprising a spectral allocation module 70 and a spatial allocation module 72. The spectral allocation module 70 takes frequency sub-channels 51 of a received input audio signal 113 and allocates them to multiple spatial audio channels 52 as allocated frequency sub-channels 53.

In some but not necessarily all examples, the input audio signal 113 comprises a monophonic source signal and comprises, is accompanied with or is associated with one or more spatial processing parameters defining a position and/or spatial extent of the sound source that will render the monophonic source signal.

Each spatial audio channel is for rendering at a different location within a sound space. The spatial allocation module 72 achieves the correct spatial rendering of the spatial audio channels 52 by controlled mixing 74 of the different spatial audio channels 52 across different audio device channels 76 that are rendered by different audio output devices. In this example, there are four audio device channels one for front right (FR), one for front left (FL), one for rear right (RR) and one for rear left (RL) but other configurations are possible. However, in other examples, there may be more (e.g. 5.1 or 7.1 surround sound) or less (binaural) audio output devices.

The sound space 10 may be considered to be a collection of spatial audio channels 52 where each spatial audio channel 52 is a different direction. In some examples, the collection of spatial audio channels may be globally defined for all sound objects 12. In other examples, the collection of spatial audio channels may be locally defined for each sound object 12. The collection of spatial audio channels may be fixed or may vary dynamically with time.

In some but not necessarily all examples, each spatial audio channel may be rendered as a single rendered sound source using amplitude panning signals 54, for example, using Vector Base Amplitude Panning (VBAP).

For example, in spherical polar co-ordinates the direction ofthe spatial audio channel

S_nm may be represented by the couplet of polar angle θ_η and azimuthal angle <t>_m.

Where θ_η is one polar angle in a set of N possible polar angles and <t>_m is one azimuthal angle in a set of M possible azimuthal angles.

A sound object 12 at position z may be associated with the spatial audio channel S_nmthat is closest to Arg(z).

If a sound object 12 is associated with a spatial audio channel S_nm then it is rendered as a point source.

A sound object 12 may however have spatial extent and be associated with a plurality of spatial audio channels. For example, a sound object 12 may be simultaneously rendered in a set of spatial audio channels {S} defined by Arg(z) and a spatial extent of the sound object 12. That set of spatial audio channels {S} may, for example, include the set of spatial audio channels S_nm’ for each value of n’ between η-δ_η and η+δ_η and of rri between n-b_m and n+b_m, where n and m define the spatial audio channel closest to Arg(z) and δ_η and b_m define in combination a spatial extent of the sound object 12. The value of δ_η, defines a spatial extent in a polar direction and the value of b_m defines a spatial extent in an azimuthal direction.

The number of spatial audio channels and their spatial relationship in the set of spatial audio channels {S}, allocated by the spatial allocation module 72 is dependent upon the desired spatial extent of the sound object 12.

A single sound object 12 may be simultaneously rendered in a set of spatial audio channels {S} by decomposing the audio signal representing the sound object 12 into multiple different frequency sub-channels 51 and allocating each frequency subchannel 51 to one of multiple spectrally-limited audio signals 53.

Each of the multiple spectrally-limited audio signals 53 may have one or more frequency sub-channels 51 allocated to it (as an allocated frequency sub-channel). Each frequency sub-channel 51 may be allocated to only one spectrally-limited audio signal 53 (as an allocated frequency sub-channel).

Each spectrally-limited audio signals 53 is allocated into the set of spatial audio channels {S} 52.

For example, each spectrally-limited audio signal 53 is allocated to one spatial audio channel 52 and each spatial audio channel 52 comprises only one spectrally-limited audio signal 53, that is, there is a one-to-one mapping between the spectrally-limited audio signals and the spatial audio channels at the interface between the spectral allocation module 70 and the spatial allocation module 72. In some but not necessarily all examples, each spectrally-limited audio signal may be rendered as a single sound source using amplitude panning by the spatial allocation module 72.

For example, if the set of spatial audio channels {S} comprised X channels, the audio signal 113 representing the sound object 12 would be separated into X different spectrally-limited audio signals 53 in different non-overlapping frequency bands each frequency band comprising one or more different frequency sub-channels 51 that may be contiguous and/or non-contiguous. In some but not necessarily all examples, there may be Ν=2^Λη frequency sub-bands, for example, N may be 512 (n=9), 1024 (n=10), 2048 (n=11). This may be achieved using a filter bank comprising a selective band pass limited filter for each spectrally-limited audio signal 53/spatial audio channel or, as illustrated in Fig 4, by using digital signal processing to distribute timefrequency bins to different spectrally-limited audio signals 53/spatial audio channels 52. Each of the X different spectrally-limited audio signals 53 in different nonoverlapping frequency bands would be provided to only one of the set of spatial audio channels {S}. Each of the set of spatial audio channels {S} would comprise only one of the X different spectrally-limited audio signals in different non-overlapping frequency bands.

Where digital signal processing is used to distribute time-frequency bins to different spatial audio channels, then a short-term Fourier transform (STFT) may be used to transform from the time domain to the frequency domain, where selective filtering occurs for each frequency band. The different spectrally-limited audio signals 53 may be created using the same time period or different time periods for each STFT. The different spectrally-limited audio signals 53 may be created by selecting frequency sub-channels 51 of the same bandwidth (different center frequencies) or different bandwidths. The different spatial audio channels {S) into which the spectrally-limited audio signals 53 are placed may be defined by a constant angular distribution e.g.

the same solid angle (ΔΩ=3ίηθ.Δθ.ΔΦ in spherical coordinates) or by a nonhomogenous angular distribution e.g. different solid angles.

An inverse transform 78 will be required to convert from the frequency to the time domain. In some examples, this may occur in the spectral allocation module 70 or the spatial allocation module 72 before mixing. In the example illustrated in Fig 4, the inverse transform 78 occurs for each audio device channel 76, after mixing 74, in the spatial allocation module 72.

Which frequency sub-channel 51 is allocated to which spectrally-limited audio signal 53/spatial audio channel 52 in the set of spatial audio channels {S} may be controlled by allocation module 60. The allocation may be a quasi-random allocation or may be determined based on a set of predefined rules. In some but not necessarily all examples, the allocation module 60 is a programmable filter bank.

The predefined rules may, for example, constrain spatial-separation of spectrallyadjacent frequency sub-channels 51 to be above a threshold value. Thus frequency sub-channels 51 adjacent in frequency may be separated spatially so that they are not spatially adjacent. In some examples, effective spatial separation of the multiple frequency sub-channels 51 that are adjacent in frequency may be maximized.

The predefined rules may additionally or alternatively define how frequency subchannels 51 are distributed amongst the spectrally-limited audio signals 53/set of spatial audio channels {S} 52. For example, a low discrepancy sequence such as a Halton sequence, for example, may be used to quasi-randomly distribute the frequency sub-channels 51 amongst the spectrally-limited audio signals 53/spatial audio channels {S} 52.

Which frequency sub-channel 51 is allocated to which spectrally-limited audio signal 53/ spatial audio channel 52 in the set of spatial audio channels {S} may be dynamically controlled. For example, the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52,53 may be automatically changed.

Fig 5 illustrates an example of a method 100 comprising: at block 102, allocating frequency sub-channels 51 of an input audio signal 113 to multiple output audio channels 52, 54, each output audio channel 52, 54 for rendering at a location within a sound space; and at block 104, automatically changing an allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52, 54.

The method 100 may be performed by the allocation module 60 and/or the controller 300. The method 100 may be used to improve the perceived spatial uniformity of a rendered spatially extended sound. Distinct audio components of the sound cannot, as a consequence, be heard at distinct spatial positions and the sound is heard as a uniform, spatially extended sound.

The apparatus or controller 300 may therefore comprises: at least one processor 302; and at least one memory 304 including computer program code the at least one memory 304 and the computer program code configured to, with the at least one processor 302, cause the apparatus 300 at least to perform:

allocating frequency sub-channels 51 of an input audio signal 113 to multiple output audio channels 52,54 each output audio channel 52,54 for rendering at a location within a sound space 10; and automatically changing an allocation of frequency subchannels 51 of the input audio signal 113 to multiple output audio channels 52,54.

Fig 6 illustrates an example of a system 110 that is configured to perform an example of the method 100. In this example, the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels (spatial audio channels 52) is dependent upon one or more changes in the input audio signal 113.

In this example, but not necessarily all examples, the system 110 is configured to automatically detect a sub-optimal allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52, and in response to detecting a sub-optimal allocation, automatically uses a new allocation of frequency sub-channels 51 of the input audio signal to multiple output audio channels 52.

The system comprises a spatial extent synthesizer module 114 that changes an allocation of frequency sub-channels 51 of the input audio signal to multiple output audio channels to change a spatial extent of a sound object 12.

The allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels is defined by a distribution 119 provided by the distribution generator 118.

The distribution generator generates the new distribution 119 in response to a control signal 117 from the analyser module 116. The analyser module 116 is configured to automatically detect a sub-optimal allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52, and in response to detecting a sub-optimal allocation, automatically controls the distribution generator 118 to define a new allocation 119 of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52 that is used by spatial extent synthesizer module 114 to change the allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels. Each output audio channel is for rendering at a different location within a sound space 10 and thereby changes a spatial extent of a sound object 12 for rendering in the sound space 10.

In this example, the system 110, is configured to change the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52 in dependence upon one or more changes in a power spectrum of the input audio signal 113.

In some but not necessarily all examples, the system 110, is configured to change the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52 to reduce deviation from a power spectrum of the input audio signal 113. This may prevent the power spectrum of each of the allocated frequency sub-channels 53 from deviating significantly (e.g. more than a threshold value) from a power spectrum of the input audio signal 113.

In some but not necessarily all examples, the system 110 automatically determines a first cost function value for current allocated frequency sub-channels 53 (based on a current allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels 52) and automatically determine a second cost function value for putative allocated frequency sub-channels (based on a putative allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels 52), and in response to determining the first cost function value is sufficiently greater than the second cost function value, makes the putative allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels the current allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels. The putative allocated frequency subchannels therefore become the current allocated frequency sub-channels 53.

The distribution generator module 118 may generate the putative allocation of frequency sub-channels. The analyser module 116 may determine the cost function and compare the cost function values making the decision to change the current allocation of frequency sub-channels of the input audio signal 113.

The cost function may compare one or more parameters of the current input signal 113 with one or more parameters of each of the different output audio channels 52.

The cost function may for example be based on different parameters such as, for example parameters p(f) that vary with frequency f, such as amplitude or power spectral density or be based on cepstral analysis.

The cost function may for example be based on different combinations of parameters. It may, for example, comprise a function that averages a parameter over a range of frequencies, such as a moving mean calculation.

Determining whether or not to change the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels (spatial audio channels 52) in dependence upon one or more changes in the input audio signal 113 may occur automatically as a consequence of detection of a change in the input audio signal 113 or may occur automatically, intermittently according to a schedule, for example, it may occur automatically, periodically.

In some but not necessarily all examples, determining whether or not to change the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels (spatial audio channels 52) in dependence upon one or more changes in the input audio signal 113 may occur automatically as a background process. For example, the second cost function value is determined automatically, for example continuously, intermittently or periodically, for one or more putative allocations of frequency sub-channels of the input audio signal 113 to multiple output audio channels, and in response to determining the second cost function value for a particular putative allocation is sufficiently less than the first cost function value and is the lowest of the second cost function values of the putative allocations, making the particular putative allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels the current allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels.

In some but not necessarily all examples, determining whether or not to change the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels (spatial audio channels 52) in dependence upon one or more changes in the input audio signal 113 may occur automatically as a reactive process. For example, the first cost function value for a current allocation of frequency subchannels of the input audio signal 113 to multiple output audio channels is determined automatically, for example continuously, intermittently or periodically, and if the first cost function value exceeds a threshold, a new allocation of frequency subchannels of the input audio signal 113 to multiple output audio channels is used.

In this way, a current allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52,54 may be automatically and dynamically adjusted to reduce a cost function value for the current allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels.

In some but not necessarily all examples, automatically changing an allocation of frequency sub-channels of the input audio signal 113 to multiple output audio channels, comprises changing a definition of the frequency sub-channels and/or changing a distribution of frequency-sub channels across output channels.

The frequency sub-channels 51 may for example each be defined by a center frequency and a bandwidth. The definition of a frequency sub-channel may be changed by changing its center frequency and/or by changing its bandwidth. The definition of the frequency sub-channel 52 may for example occur subject to certain constrains such as the frequency sub-channels 51 do not overlap and/or that the frequency sub-channels 51 cover in combination certain frequency ranges.

In some but not necessarily all examples, a slowly varying part of the spectrum may be covered by fewer, wider frequency sub-channels 51 and more quickly varying part of spectrum may be covered by more, narrower frequency sub-channels 51.

In some but not necessarily all examples, the lower frequency part of the spectrum may be covered by narrower sub-channels 51 and the higher frequency part of the spectrum may be covered by wider frequency sub-channels 51.

The distribution of frequency-sub channels 51 across output channels 52, 54 may be changed by changing the rules used to distribute frequency-sub channels 51 across output channels 52,54. The rules define how the spectrally-limited audio signals 53 are distributed amongst the set of spatial audio channels {S}. They may or may not include constraints concerning spatial separation of frequency sub-channels 51 that are adjacent in the frequency spectrum

The distribution of frequency-sub channels across output channels may be changed by changing one or more low-discrepancy sequences used for distribution of frequency-sub channels 51 across spatial audio channels 52.

A position or direction in the sound space 10 may be represented by one or more values derived from one or more low-discrepancy sequences. For example a point in two dimensions (x,y) (or (|z|, Arg(z)) in polar co-ordinates) may be determined from two low discrepancy sequences, one for x and one for y. For example a point in three dimensions (x,y, x), (or (|z|, Arg(z)) in polar co-ordinates) may be determined from three low discrepancy sequences, one for x, one for y and one for z.

There are various different examples of low-discrepancy sequences. One example is a Halton sequence. A Halton sequence is defined by a base value and by a skip value. A new Halton sequence is a Halton sequence with a new base value and/or a new skip value. A new Halton sequence may additionally or alternatively be created by scrambling a Halton sequence or leaping or changing leaping in a Halton sequence. Scrambling changes the order of a Halton sequence. Leaping results in certain values in the Halton sequence not being used.

The distribution of frequency-sub channels 51 across output channels 52 may be changed by changing one or more Halton sequences used for distribution of frequency-sub channels across output channels.

The parameters used for sequence generation, for example, base, skip, scrambling or leaping may be changed randomly or in a preset manner.

Fig 7 illustrates an example of method 400 that changes the allocation of frequency sub-channels 51 of the input audio signal 113 to multiple output audio channels 52 so that a power spectrum of the frequency sub-channels has less deviation from a power spectrum of the input audio signal 113- they have the same overall spectral shape

Digital signal processing is used to distribute time-frequency bins to different spatial audio channels. A short-term Fourier transform (STFT) is used to transform from the time domain to the frequency domain.

The method 400 comprises, at block 402, filtering the input audio signal 113 in the frequency domain; at block 404, automatically determining a first power spectral density function for the input audio signal 113 (Fig 8A), then at block 406, performing a running average smoothing over frequency bins using a sliding window to simplify the first power spectral density function.

The method 400 comprises, at block 412, filtering the input audio signal 113 in the frequency domain according to a putative allocation of frequency sub-channels; at block 414, automatically determining a second power spectral density function for each of the allocated putative frequency sub-channels (Figs 8B-8E); then at block 416, performing a running average smoothing over frequency bins using a sliding window to simplify the first power spectral density function for each allocated putative frequency sub-channel.

The method 400 then comprises, at block 420, comparing the simplified first power spectral density function and the simplified second power spectral density function using a mean square error function.

At block 422, the average mean square error value is compared to the previously stored mean square error (if any).

If the average mean square error value is less than the previously stored mean square error value, then the putative allocation of frequency sub-channels is better than the current allocation of sub-channels, and the putative allocation of frequency sub-channels is used as the current allocation of sub-channels (block 430). The mean square error value is stored in memory for subsequent comparison at block 422 in the next iteration of the method 400. The method 400 may then start again immediately, after a delay or in response to an interrupt.

If the average mean square error value is not less than the previous mean square error value, then the putative allocation of frequency sub-channels is not better than the current allocation of sub-channels, and the current allocation of sub-channels is unchanged. A new putative allocation of frequency sub-channels is generated and the method 400 repeats (block 432). In this example, a new Halton sequence(s) is generated.

In this example, the cost function is based on a mean square error comparison between a putative allocation of frequency sub-channels and the input audio signal 113. When the cost function value for a particular putative allocation is lower than that for the current allocation, the putative allocation becomes the current allocation. Although this method 400 is performed for only a single putative allocation of frequency sub-channels, it will be recognised that it may be performed in parallel simultaneously for multiple putative allocations of frequency sub-channels.

Fig 9 illustrates an example of a method 500 for controlling rendering of spatial audio and in particular controlling rendering of a sound object 12 that has a changing spatial extent, for example width.

In the previous description, a first allocation of frequency sub-channels 51 of an input audio signal 113 to multiple output audio channels 52, 54 is automatically changed to a second allocation of frequency sub-channels of an input audio signal 113 to multiple output audio channels 52,54. However, the second allocation of frequency sub-channels 51 may not be immediately used and there may be a gradual transition between the first allocation of frequency sub-channels 51 and the second allocation of frequency sub-channels 51.

In the example method 500, illustrated in Fig 9, at block 502 a first allocation of frequency sub-channels 51 is used to render a sound object 12, at block 506 the second allocation of frequency sub-channels 51 is used to render a sound object 12, and between blocks 502 and 506, at block 504, a transitional allocation of frequency sub-channels 51 is used to render the sound object 12.

For example, there may be a cross-fade from the first allocation to the second allocation. There may be an independently controlled cross-fade for each frequency sub-channel 51 such that different frequency sub-channels cross-fade at different rates.

Referring back to the preceding examples, in some situations, additional processing may be required. For example, when the sound space 10 is rendered to a listener through a head-mounted audio output device, for example headphones or a headset using binaural audio coding, it may be desirable for the rendered sound space to remain fixed in space when the listener turns their head in space. This means that the rendered sound space needs to be rotated relative to the audio output device by the same amount in the opposite sense to the head rotation. The orientation of the rendered sound space tracks with the rotation of the listener’s head so that the orientation of the rendered sound space remains fixed in space and does not move with the listener’s head. The system uses a transfer function to perform a transformation T that rotates the sound objects 12 within the sound space. A head related transfer function (HRTF) interpolator may be used for rendering binaural audio. Vector Base Amplitude Panning (VBAP) may be used for rendering in loudspeaker format (e.g. 5.1) audio.

The distance of a sound object 12 from an origin at the user may be controlled by using a combination of direct and indirect processing of audio signals representing the sound object 12. The audio signals are passed in parallel through a “direct” path and one or more “indirect” paths before the outputs from the paths are mixed together. The direct path represents audio signals that appear, to a listener, to have been received directly from an audio source and an indirect (decorrelated) path represents audio signals that appear to a listener to have been received from an audio source via an indirect path such as a multipath or a reflected path or a refracted path. Modifying the relative gain between the direct path and the indirect paths, changes the perception of the distance D of the sound object 12 from the listener in the rendered sound space 10. Increasing the indirect path gain relative to the direct path gain increases the perception of distance. The decorrelated path may, for example, introduce a pre-delay of at least 2 ms.

In some but not necessarily all examples, to achieve a sound object with spatial extent (width and/or height and/or depth)_the spatial audio channels 52 are treated as spectrally distinct sound objects that are then positioned at suitable widths and/or heights and/or distances using known audio reproduction methods.

For example, in the case of loudspeaker sound reproduction amplitude panning can be used for positioning a spectrally distinct sound object in the width and/or height dimension, and distance attenuation by gain control and optionally direct to reverberant (indirect) ratio can be used to position spectrally distinct sound objects in the depth dimension.

For example, in case of binaural rendering, positioning in width and/or height dimension is obtained by selecting suitable head related transfer function (HRTF) filters (one for left ear, one for right ear) for each of the spectrally distinct sound objects depending on its position. A pair of HRTF filters model the path from a point in space to the listener’s ears. The HRFT coefficient pairs are stored for all the possible directions of arrival for a sound. Similarly, distance dimension of a spectrally distinct sound object is controlled by modelling distance attenuation with gain control and optionally direct to reverberant (indirect) ratio.

Thus, assuming that the sound rendering system supports width, then the width of a sound object may be controlled by the spatial allocation module 72. It achieves the correct spatial rendering of the spatial audio channels 52 by controlled mixing 74 of the different spatial audio channels 52 across different width-separated audio device channels 76 that are rendered by different audio output devices.

Thus assuming that the sound rendering system supports height, then the height of a sound object may be controlled in the same manner as a width of a sound object. The spatial allocation module 72 achieves the correct spatial rendering of the spatial audio channels 52 by controlled mixing 74 of the different spatial audio channels 52 across different height-separated audio device channels 76 that are rendered by different audio output devices.

Thus assuming that the sound rendering system supports depth, then the depth of a sound object may be controlled in the same manner as a width of a sound object.

The spatial allocation module 72 achieves the correct spatial rendering of the spatial audio channels 52 by controlled mixing 74 of the different spatial audio channels 52 across different depth-separated audio device channels 76 that are rendered by different audio output devices. However, if that is not possible, the spatial allocation module 72 may achieve the correct spatial rendering of the spatial audio channels 52 by controlled mixing 74 of the different spatial audio channels 52 across different depth-separated spectrally distinct sound objects at different perception distances by modelling distance attenuation using gain control and optionally direct to reverberant (indirect) ratio.

It will therefore be appreciated that the extent of a sound object can be controlled widthwise and/or heightwise and/or depthwise.

Referring back to Figs 3A and 3B, the computer program 306 may arrive at the apparatus 300 via any suitable delivery mechanism 310. The delivery mechanism 310 may be, for example, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a compact disc read-only memory (CD-ROM) or digital versatile disc (DVD), an article of manufacture that tangibly embodies the computer program 306. The delivery mechanism may be a signal configured to reliably transfer the computer program

306. The apparatus 300 may propagate or transmit the computer program 306 as a computer data signal.

Although the memory 304 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage.

Although the processor 302 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable. The processor 302 may be a single core or multi-core processor.

References to ‘computer-readable storage medium’, ‘computer program product’, ‘tangibly embodied computer program’ etc. or a ‘controller’, ‘computer’, ‘processor’ etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as fieldprogrammable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.

As used in this application, the term ‘circuitry’ refers to all of the following:

(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.

This definition of ‘circuitry’ applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or other network device.

The blocks illustrated in the enclosed figures may represent steps in a method and/or sections of code in the computer program 306. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted.

Where a structural feature has been described, it may be replaced by means for performing one or more of the functions ofthe structural feature whether that function or those functions are explicitly or implicitly described.

The controller 300 may, for example be a module. ‘Module’ refers to a unit or apparatus that excludes certain parts/components that would be added by an end manufacturer or a user.

The term ‘comprise’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use ‘comprise’ with an exclusive meaning then it will be made clear in the context by referring to “comprising only one” or by using “consisting”.

In this brief description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term ‘example’ or ‘for example’ or ‘may’ in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus ‘example’, ‘for example’ or ‘may’ refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a subclass of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example but does not necessarily have to be used in that other example.

Although embodiments of the present invention have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the invention as claimed.

Features described in the preceding description may be used in combinations other than the combinations explicitly described.

Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.

Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.

Whilst endeavoring in the foregoing specification to draw attention to those features of the invention believed to be of particular importance it should be understood that the Applicant claims protection in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not particular emphasis has been placed thereon.

l/we claim:

Claims

1. A method comprising:

allocating frequency sub-channels of an input audio signal to multiple output audio channels, each output audio channel for rendering at a location within a sound space; and automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

2. A method as claimed in claim 1, wherein automatically changing the allocation of frequency sub-channels of the input audio signal to multiple output audio channels is dependent upon one or more changes in the input audio signal.

3. A method as claimed in claim 2, wherein the automatic changing of the allocation of frequency sub-channels of the input audio signal to multiple output audio channels is in dependence upon one or more changes in a power spectrum of the input audio signal.

4. A method as claimed in any preceding claim comprising automatically detecting a sub-optimal allocation of frequency sub-channels of the input audio signal to multiple output audio channels, and in response to detecting a sub-optimal allocation, automatically using a new allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

5. A method as claimed in any preceding claim comprising:

automatically determining a first cost function value for a current allocation of frequency sub-channels of the input audio signal to multiple output audio channels; automatically determining a second cost function value for a putative allocation of frequency sub-channels of the input audio signal to multiple output audio channels, and in response to determining the first cost function value is sufficiently greater than the second cost function value, making the putative allocation of frequency subchannels of the input audio signal to multiple output audio channels the current allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

6. A method as claimed in claim 5, wherein the cost function compares one or more parameters of the current input signal with one or more parameters of each of the different output audio channels.

7. A method as claimed in claim 5 or 6, wherein at least the first cost function value for a current allocation of frequency sub-channels of the input audio signal to multiple output audio channels is determined automatically, intermittently.

8. A method as claimed in claim 7, wherein the second cost function value for multiple putative allocations of frequency sub-channels of the input audio signal to multiple output audio channels, are also determined automatically, intermittently and in response to determining the second cost function value for a particular putative allocation is sufficiently less than the first cost function value and is the lowest of the second cost function values of the putative allocations, making the particular putative allocation of frequency sub-channels of the input audio signal to multiple output audio channels the current allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

9. A method as claimed in claim 7, wherein at least the first cost function value for a current allocation of frequency sub-channels of the input audio signal to multiple output audio channels is determined automatically, intermittently and if the first cost function value exceeds a threshold, using a new allocation of frequency subchannels of the input audio signal to multiple output audio channels.

10. A method as claimed in any preceding claim, comprising dynamically adjusting a current allocation of frequency sub-channels of the input audio signal to multiple output audio channels to reduce a cost function value for the current allocation of frequency sub-channels of the input audio signal to multiple output audio channels.

11. A method as claimed in any preceding claim wherein automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels, comprises changing a definition of the frequency sub-channels and/or a distribution of frequency-sub channels across output channels.

12. A method as claimed in any preceding claim wherein automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels, comprises changing a distribution of frequency-sub channels across output channels by changing rules used to distribute frequency-sub channels across output channels.

13. A method as claimed in any preceding claim wherein automatically changing an allocation of frequency sub-channels of the input audio signal to multiple output audio channels, comprises changing a distribution of frequency-sub channels across output channels by changing one or more low-discrepancy sequences used for distribution.

14. A method comprising:

rendering multiple output audio channels at different locations within a sound space; automatically applying a transition effect, when the allocation of frequency subchannels ofthe input audio signal to multiple output audio channels is changed.

15. An apparatus comprising means for performing the method of any of claims 1 to 14 and/or a computer program that when run on a processor causes the method of any of claims 1 to 14.

Intellectual

Property

Office

Application Νo: GB 1706476.7