WO2019106221A1

WO2019106221A1 - Processing of spatial audio parameters

Info

Publication number: WO2019106221A1
Application number: PCT/FI2017/050833
Authority: WO
Inventors: Anssi RÄMÖ; Miikka Vilermo; Mikko Tammi; Adriana Vasilache; Lasse Laaksonen
Original assignee: Nokia Technologies Oy
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2019-06-06

Abstract

There is disclosed inter alia an apparatus having means for determining a plurality of spatial audio directional vectors; means for partitioning a vector space of the plurality of spatial audio directional vectors into a plurality of partitions; means for assigning a first spatial audio directional vector to a set of spatial audio directional vectors associated with a first centroid; means for assigning a second spatial audio directional vector to a set of spatial audio directional vectors associated with a further centroid, means for assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and means for assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.

Description

PROCESSING OF SPATIAL AUDIO PARAMETERS

Field

The present application relates to apparatus and methods for processing directional information relating to spatial audio capture parameters.

Background

Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.

The directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.

A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata. The stereo signal could be encoded, for example, with an EVS, IVAS or AAC encoder. A decoder can decode the audio signals into PCM signals, and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.

The aforementioned solution is particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand- alone microphone arrays). It may also be desirable for such an encoder to have also other input types than microphone-array captured signals, for example, loudspeaker signals, audio object signals, or Ambisonic signals.

With respect to the directional components of the metadata, which may comprise an elevation, azimuth (and diffuseness) of a resulting direction, each time/frequency sub band can be assigned to multiple audio source directions in order to improve the perceptual improvement of the encoder. However, the multiple sound source directions allocated to each frequency band will need to be encoded which will increase the metadata bitrate. Therefore there is an interest in encoding the sound source directions as efficiently as possible, whilst maintaining the perceptual quality of the overall encoded spatial audio signal.

Summary

This invention proceeds from the consideration that it is desirable to encode audio source direction information for each frequency sub band of a captured multichannel spatial audio signal efficiently as possible whilst maintaining the perceptual quality of the overall encoded spatial audio signal. To this end there is provided according to a first aspect a method comprising: determining for an audio signal of two or more audio signals a plurality of spatial audio directional vectors; partitioning a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid; assigning a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid; assigning a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure; assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid. Assigning the audio source direction to the set of spatial audio direction vectors associated with the first centroid and a further audio source direction to the set of spatial audio direction vectors associated with the second centroid may further comprise: determining an energy value for the set of spatial audio directional vectors associated with the first centroid; determining an energy value for the set of spatial audio directional vectors associated with the second centroid; comparing the energy value for the set of spatial audio directional vectors associated with the first centroid with the energy value for the set of spatial audio directional vectors associated with the second centroid; and determining that the audio source direction is a dominant audio source direction and the further audio source direction is a secondary audio source direction when the energy value for the set of spatial audio directional vectors associated with the first centroid is greater than the energy value for the set of spatial audio directional vectors associated with the second centroid.

The first vector distance measure may be a minimum vector distance measure and the second vector distance measure may be other than the minimum vector distance measure.

The first spatial audio directional vector and the second spatial audio directional vector may be of a frequency sub band or frequency bin of the audio signal.

Each of the plurality of spatial audio directional vectors may comprise: an elevation component; and an azimuth component.

The vector space may be partitioned using a K-mediods partitioning algorithm.

The plurality of spatial audio directional vectors may form spatial audio metadata.

There is provided according to a second aspect an apparatus comprising means for determining for an audio signal of two or more audio signals a plurality of spatial audio directional vectors; means for partitioning a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid; means for assigning a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid; means for assigning a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure; means for assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and means for assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.

The means for assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid and the means for assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid may further comprise: means for determining an energy value for the set of spatial audio directional vectors associated with the first centroid; means for determining an energy value for the set of spatial audio directional vectors associated with the second centroid; means for comparing the energy value for the set of spatial audio directional vectors associated with the first centroid with the energy value for the set of spatial audio directional vectors associated with the second centroid; and means for determining that the audio source direction is a dominant audio source direction and the further audio source direction is a secondary audio source direction when the energy value for the set of spatial audio directional vectors associated with the first centroid is greater than the energy value for the set of spatial audio directional vectors associated with the second centroid.

The first spatial audio directional vector and the second spatial audio directional vector may be associated with a frequency sub band or frequency bin of the audio signal.

Each of the plurality of spatial audio directional vectors may comprise: an elevation component; and an azimuth component. The vector space may be partitioned using a K-mediods partitioning algorithm.

The plurality of spatial audio directional vectors may be spatial audio metadata.

There is provided according to a third aspect an apparatus comprising comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to: determine for an audio signal of two or more audio signals a plurality of spatial audio directional vectors; partition a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid; assign a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid; assign a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure; assign an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and assign a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.

The at least one memory and the computer program code configured to assign an audio source direction to the set of spatial audio direction vectors associated with the first centroid and assign a further audio source direction to the set of spatial audio direction vectors associated with the second centroid may be further configured to: determine an energy value for the set of spatial audio directional vectors associated with the first centroid; determine an energy value for the set of spatial audio directional vectors associated with the second centroid; compare the energy value for the set of spatial audio directional vectors associated with the first centroid with the energy value for the set of spatial audio directional vectors associated with the second centroid; and determine that the audio source direction is a dominant audio source direction and the further audio source direction is a secondary audio source direction when the energy value for the set of spatial audio directional vectors associated with the first centroid is greater than the energy value for the set of spatial audio directional vectors associated with the second centroid.

The vector space may be partitioned using a K-mediods partitioning algorithm.

There is according to a third aspect computer program code arranged to realize the following when executed by a processor: determine for an audio signal of two or more audio signals a plurality of spatial audio directional vectors; partition a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid; assign a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid; assign a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure; assign an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and assign a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;

Figure 2 shows schematically the analysis processor as shown in figure 1 according to some embodiments;

Figure 3 shows a flow diagram of part of the operation of the metadata encoder/quantizer as shown in figure 1 according to some embodiments;

Figure 4 shows a flow diagram of the operation of the system as shown in figure 1 according to some embodiments;

Figure 5 shows a flow diagram of the operation of the analysis processor as shown in figure 2 according to some embodiments;

Figure 6 shows a vector space of audio source directional parameters/vectors as received by the metadata encoder/quantizer as shown in figure 1 ;

Figure 7 shows a partitioning the vector space of figure 6;

Figure 8 shows the classification of audio source directional parameters/vectors of figure 5 into dominant audio source direction and secondary audio source direction; and

Figure 9 shows schematically an example device suitable for implementing the apparatus shown.

Embodiments of the Application In order to represent immersive spatial sound, a number of concepts have emerged of which perhaps the currently most common method is to represent spatial sound as a set of waveform streams or channel signals where each signal can be designated to feed a particular loudspeaker in a known prescribed position relative to the listener position. Accompanying the channel signals there is typically audio related metadata relevant for the playback of the audio content captured as the individual channel signals. In particular the metadata may contain data for controlling a rendering process in a playback system and may contain information relating to the spatial characteristics of the individual audio streams. Such data may comprise information on the spatial source direction such as the azimuth and elevation (or any other type of spatial direction representation) associated with each channel signal which can be used to assist in the rendering of the channel signals in the spatial audio playback system.

It is known that there is a perceptual performance benefit if multiple source directions are assigned to each frequency band of the multichannel captured spatial audio signal, of which there is typically a dominant audio source direction and a less significant audio source direction which is termed a secondary audio source direction. Obviously one of the issues with having multiple spatial source directions for each frequency band of a channel signal is the coding required to represent the information as part of the spatial metadata accompanying the channel signals. To this end it can be beneficial for any subsequent coding or quantization process if the audio source directional parameters associated with a particular audio source direction (i.e. a dominant audio source direction or a secondary audio source direction) for a channel signal are more clustered together. For instance, a coding gain may be achieved if the audio source directional parameters (azimuth and elevation) associated with the dominant audio source direction (over the frequency bands of a channel signal) are clustered further together as a representation in an n-dimensional vector space. Similarly the coding gain can also be improved for the audio source directional parameters associated with the secondary audio source direction. Naturally one would expect that for an audio acoustic scene in which there is one dominant sound source then the audio source directional parameters over the sub bands of a channel will be fairly well aligned towards a single direction for the dominant audio source direction and also towards a different single direction for the secondary audio source direction. In this case the clustering of directional data points in the azimuth elevation vector space would exhibit little variance.

However, should the audio acoustic scene comprise a number of sources of equal loudness, then audio source directional parameters over the sub bands of a channel will have a propensity to be aligned in a number of directions for both dominant classified audio sources and secondary classified audio sources. This can be envisaged by considering the classification of audio source directional parameters over a number of continuous sub bands where the classification of an audio source directional parameters can flip between dominant and secondary classifications as the sub bands of the signal channel are traversed. In particular scenarios such as these the source data points assigned to each classification of dominant and secondary may be spread throughout the vector space thereby exhibiting poor clustering. This scenario may be prevalent in the case of an audio scene comprising a strong acoustic reflection.

In the particular instance of when the audio source directional parameter comprises two vector components such as azimuth and elevation, then the n- dimensional vector space is a two dimensional vector space.

The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective spatial analysis derived metadata parameters for multi-channel input format audio signals. In the following discussions multi-channel system is discussed with respect to a multi-channel microphone implementation. However the input format may be any suitable input format, such as multi-channel loudspeaker, ambisonic (FOA/HOA) etc. It is understood that in some embodiments the channel location is based on a location of the microphone or is a virtual location or direction. Furthermore the output of the example system is a multi-channel loudspeaker arrangement. However it is understood that the output may be rendered to the user via means other than loudspeakers, e.g. with head tracked headphones with Head Related Transform Functions. Furthermore the multi-channel loudspeaker signals may be generalised to be two or more playback audio signals.

The concept as described herein is to increase the clustering of audio source directional parameters of spatial metadata in order to improve the subsequent coding and quantization of said parameters.

The proposed metadata index may then be used alongside a downmix signal (‘channels’), to define a parametric immersive format that can be utilized, e.g., for the Immersed Voice Audio Service (IVAS) codec.

With respect to figure 1 an example apparatus and system for implementing embodiments of the application are shown. The system 100 is shown with an ‘analysis’ part 121 and a‘synthesis’ part 131 . The‘analysis’ part 121 is the part from receiving the multi-channel loudspeaker signals up to an encoding of the metadata and downmix signal and the‘synthesis’ part 131 is the part from a decoding of the encoded metadata and downmix signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).

The input to the system 100 and the‘analysis’ part 121 is the multi-channel signals 102. In the following examples a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments.

The multi-channel signals are passed to a downmixer 103 and to an analysis processor 105.

In some embodiments the downmixer 103 is configured to receive the multi- channel signals and downmix the signals to a determined number of channels and output the downmix signals 104. For example the downmixer 103 may be configured to generate a 2 audio channel downmix of the multi-channel signals. The determined number of channels may be any suitable number of channels. In some embodiments the downmixer 103 is optional and the multi-channel signals are passed unprocessed to an encoder 107 in the same manner as the downmix signal are in this example. In some embodiments the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the downmix signals 104. The analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, an audio source directional parameter 108, an energy ratio parameter 1 10, a coherence parameter 1 12, and a diffuseness parameter 1 14. The audio source directional, energy ratio and diffuseness parameters may in some embodiments be considered to be spatial audio parameters. In other words the spatial audio parameters comprise parameters which aim to characterize the sound-field created by the multi- channel signals (or two or more playback audio signals in general). The coherence parameters may be considered to be signal relationship audio parameters which aim to characterize the relationship between the multi-channel signals.

In some embodiments the parameters generated may differ from frequency band to frequency band. Thus for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the lowest or the highest band some of the parameters are not required for perceptual reasons. The downmix signals 104 and the metadata 106 may be passed to an encoder 107.

The encoder 107 may comprise a IVAS stereo core 109 which is configured to receive the downmix (or otherwise) signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The encoding may be implemented using any suitable scheme. The encoder 107 may furthermore comprise a metadata encoder or quantizer 1 1 1 which is configured to receive the metadata and output an encoded or compressed form of the information. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.

In the decoder side, the received or retrieved data (stream) may be received by a decoder/demultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a downmix extractor 135 which is configured to decode the audio signals to obtain the downmix signals. Similarly the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata. The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

The decoded metadata and downmix audio signals may be passed to a synthesis processor 139.

The system 100‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the downmix and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 110 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the downmix signals and the metadata.

With respect to Figure 4 an example flow diagram of the overview shown in Figure 1 is shown.

First the system (analysis part) is configured to receive multi-channel audio signals as shown in Figure 4 by step 401.

Then the system (analysis part) is configured to generate a downmix of the multi-channel signals as shown in Figure 4 by step 403.

Also the system (analysis part) is configured to analyse signals to generate metadata such as direction parameters; energy ratio parameters; diffuseness parameters and coherence parameters as shown in Figure 4 by step 405.

The system is then configured to encode for storage/transmission the downmix signal and metadata as shown in Figure 4 by step 407. After this the system may store/transmit the encoded downmix and metadata as shown in Figure 4 by step 409.

The system may retrieve/receive the encoded downmix and metadata as shown in Figure 4 by step 41 1 .

Then the system is configured to extract the downmix and metadata from encoded downmix and metadata parameters, for example demultiplex and decode the encoded downmix and metadata parameters, as shown in Figure 4 by step 413.

The system (synthesis part) is configured to synthesize an output multi- channel audio signal based on extracted downmix of multi-channel audio signals and metadata with coherence parameters as shown in Figure 4 by step 415.

With respect to Figure 2 an example analysis processor 105 (as shown in Figure 1 ) according to some embodiments is described in further detail. The analysis processor 105 in some embodiments comprises a time-frequency domain transformer 201 .

In some embodiments the time-frequency domain transformer 201 is configured to receive the multi-channel signals 102 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals. These time-frequency signals may be passed to a direction analyser 203 and to a signal analyser 205.

Thus for example the time-frequency signals 202 may be represented in the time-frequency domain representation by

Si(b, n),

where b is the frequency bin index and n is the frame index and i is the channel index. In another expression, n can be considered as a time index with a lower sampling rate than that of the original time-domain signals. These frequency bins can be grouped into subbands that group one or more of the bins into a band index k = 0, ..., K-1. Each subband k has a lowest bin b_{k low} and a highest bin b_kMgh, and the subband contains all bins from b_{k iow} to b_{k gh}. The widths of the subbands can approximate any suitable distribution. For example the Equivalent rectangular bandwidth (ERB) scale or the Bark scale. In some embodiments the analysis processor 105 comprises a direction analyser 203. The direction analyser 203 may be configured to receive the time- frequency signals 202 and based on these signals estimate audio source directional parameters 108. The audio source directional parameters may be determined based on any audio based‘direction’ determination.

For example in some embodiments the direction analyser 203 is configured to estimate the direction with two or more signal inputs. This represents the simplest configuration to estimate a‘direction’, more complex processing may be performed with even more signals.

The direction analyser 203 may thus be configured to provide an azimuth for each frequency band and temporal frame, denoted as azimuth <p{k,n) and elevation Q(k,n). The direction parameter 108 may be also be passed to a signal analyser 205

In some embodiments further to the audio source directional parameter the direction analyser 203 is configured to determine an energy ratio parameter 1 10. The energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from a direction. The direct-to-total energy ratio r(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter.

The estimated direction 108 parameters may be output (and passed to an encoder). The estimated energy ratio parameters 1 10 may be passed to a signal analyser 205.

In some embodiments the analysis processor 105 comprises a signal analyser 205. The signal analyser 205 is configured to receive parameters (such as the azimuth <p{k,n) and elevation (0(/c, n)) 108, and the direct-to-total energy ratios (r(/c, n)) 1 10) from the direction analyser 203. The signal analyser 205 may be further configured to receive the time-frequency signals ( Si(b, n )) 202 from the time- frequency domain transformer 201 . All of these are in the time-frequency domain; b is the frequency bin index, k is the frequency band index (each band potentially consists of several bins b), n is the time index, and / is the channel. Although directions and ratios are here expressed for each time index n, in some embodiments the parameters may be combined over several time indices. Same applies for the frequency axis, as has been expressed, the direction of several frequency bins b could be expressed by one direction parameter in band k consisting of several frequency bins b. The same applies for all of the discussed spatial parameters herein.

The signal analyser 205 is configured to produce a number of signal parameters. In the following disclosure there are the two parameters: coherence and diffuseness, both analysed in time-frequency domain. In addition, in some embodiments the signal analyser 205 is configured to modify the estimated energy ratios (r(/c, n)). The signal analyser 205 is configured to generate the coherence and diffuseness parameters based on any suitable known method.

With respect to Figure 5 a flow diagram summarising the operations of the analysis processor 105 are shown.

The first operation is one of receiving time domain multichannel (loudspeaker) audio signals as shown in Figure 5 by step 501 .

Following this is applying a time domain to frequency domain transform (e.g. STFT) to generate suitable time-frequency domain signals for analysis as shown in Figure 5 by step 503.

Then applying direction analysis to determine direction and energy ratio parameters is shown in Figure 5 by step 505.

Then applying analysis to determine coherence parameters (such as surrounding and/or spread coherence parameters) and diffuseness parameters is shown in Figure 5 by step 507. In some embodiments the energy ratio may also be modified based on the determined coherence parameters in this step.

The final operation being one of outputting the determined parameters is shown in Figure 5 by step 509.

With respect to Figure 3 there is a flow chart depicting part of the operation of the metadata encoder/quantizer 1 1 1 . Specifically, Figure 3 depicts the audio source directional parameter metadata clustering procedure according to some embodiments which may be performed as part of the metadata encoding operation within the metadata encoder/quantizer 1 1 1 .

As mentioned above, the metadata encoder/quantizer 1 1 1 may be arranged to receive for each time-frequency analysis window the audio source directional parameters corresponding to each frequency bin k. This is shown in Figure 3 as the processing step 301 .

Initially the audio source directional parameters/vectors as received in processing step 301 can be considered as a data set of audio source directional parameters/vectors. For instance in embodiments which comprise the directional parameters of azimuth and elevation, the data set will be a 2 dimensional vector data set where the each audio source directional parameter/vector within the data set comprises the components of azimuth and elevation values. Furthermore it is to be understood that the data set can comprise the audio source directional parameters/vectors (comprising the azimuth and elevation values) associated with the frequency bins k of the analysed channel signal, as produced by the analyser processor 105. In embodiments which have two classifications of audio source direction such as a dominant audio source direction and a secondary audio source direction there can be two audio source directional vectors per frequency sub division or sub band k in which one audio source directional parameter/vector can be classified as a first audio source direction and a further audio source directional parameter/vector can be classified as a second audio source direction. For instance, in the above case for each sub band there can be one audio source directional parameter/vector which may be classified as the dominant audio source direction and another audio source directional parameter/vector which may be classified as the secondary audio source direction. In this regard Figure 6 depicts a data set of audio source directional parameters/vectors analysed as a channel signal with 24 sub bands (or 24 frequency bins). In total there are 48 audio source directional parameters/vectors. In other words there are two audio source directional vectors per sub band (or frequency division) in which one audio source directional vector is classified by the analysis processor 105 as a dominant audio source direction (depicted in Figure 6 as a diamond symbol 601 ) and the other audio source directional vector is classified as a secondary source direction vector (depicted in Figure 6 as a cross symbol 603.) Figure 6 depicts one component of the directional vector as the elevation in radians and ranges from -- to - 605 and the other component as the azimuth also expressed in radians ranging from -p to p 607. It is to be appreciated that the audio source directional parameters/vectors shown in Figure 6 are not closely clustered and that they currently exhibit a high variance. As mentioned before this may be attributed to the overall acoustic space not having a clear dominant audio source.

It is to be appreciated that in some embodiments the audio source direction parameters/vectors may be initially unclassified in terms if an audio source direction when entering the metadata encoder/quantizer 1 1 1

A known data set clustering or partition algorithm may then by applied to the full data set of directional parmeters/vectors in order to determine a clustering centroids and corresponding Voronoi regions. In other words the vector space corresponding to data set of directional parameters/vectors into two equal spatial domains. Therefore in embodiments there can be one clustering centroid for corresponding to the dominant audio source direction and one clustering centroid for the secondary source direction. In embodiments the clustering algorithm used may be known to the skilled person. For example suitable known algorithms may be any one of the following K-medoids, Lloyd-Max or K-means.

Figure 7, depicts the result of using a K-medoids partition algorithm on the audio source directional parameter/vector data set of Figure 6. It can be seen that the algorithm has determined two clusters, a first cluster with centroid 701 and a second cluster with centroid 703.

The processing step of performing a partitioning algorithm on the audio source directional parameter/vector data set in order to find clustering centroids is shown as processing step 303 in Figure 3.

As mentioned above each sub band or frequency division is analysed for a number of audio source directional vectors, whereby each audio source directional parameter/vector is assigned to one of a number of different classifications of audio source directions. In examples, there have been two classifications of audio source directions given as a dominant audio source direction and a secondary audio source direction. Consequently in embodiments each sub band has two audio source directional vectors as stated above. In embodiments the metadata encoder/quantizer 1 1 1 may then be arranged to effectively reclassify (or classify) the data set of audio source directional parameter/vectors according to the clustering centroids as determined by processing step 303.

In embodiments the reclassification (or classification step) may be performed over each sub band pair of audio source directional parameters/vectors by assigning one of the pair of audio source directional parameters/vectors to a particular centroid on the condition that the distance between the assigned audio source directional vector and the particular centroid is a minimum. In other words for a pair of audio source directional parameters/vectors, on a per sub band basis, a distance measure is determined between a first audio source directional parameter/vector and the first centroid and a distance measure is determined between the first audio source directional parameter/vector and the second centroid. The process can be repeated for the second audio source directional parameter/vector of the pair of audio source directional parameters/vectors. In other words a distance measure is determined between the second audio source directional parameter/vector and the first centroid and a distance measure between the second audio source directional parameter/vector and the second centroid. The distance measures can then be compared in order to determine which audio source directional parameter/vector of the pair of audio source directional parameters/vectors for a particular sub band has a minimum distance measure irrespective of which centroid it is in relation to. The audio source directional parameter/vector with the minimum distance measure may then be assigned to a set of vector points of the centroid which provides the minimum distance measure.

The processing step of determining the audio source directional parameter/vector of a pair of audio source directional vectors of a sub band which gives a minimum distance against all centroids and assigning the audio source directional parameter/vector to the set of vector points of the centroid which provides the minimum distance between the directional parameter/vector and the centroid is shown as processing step 305 in Figure 3.

Upon determination of the minimum distance audio source directional parameter/vector, the remaining unassigned audio source directional parameter/vector of the pair of audio source directional parameter/vector can be assigned by default to the set of vector points of the remaining centroid. In other words the set of vector points associated with the centroid which has not been assigned the minimum distance directional parameter/vector. This processing step is shown as processing step 307 in Figure 3.

An illustrative example of a particular instance of the above processing steps may be viewed in relation to Figure 7, where if a first audio source directional parameter/vector is determined to be closer to the centroid of cluster 701 than the centroid of cluster 703, and the second audio source directional parameter/vector is also determined to be closer to the centroid of cluster 701 than the centroid of cluster 703, and if the distance measure of the first audio source directional parameter/vector to the centroid of cluster 701 is smaller than the distance measure of the second audio source directional parameter/vector to cluster 701 . Then in this particular example the first audio source directional parameter/vector can be assigned (and classified) to the set of vector points associated with the centroid 701 and the second audio source directional parameter/vector can be assigned to the set of vector points associated with the centroid 703.

The processing steps 305 to 307 can be repeated for other pairs of audio source directional parameter/vectors, where each pair corresponds to a different sub band of the plurality of sub bands of the channel signal. This is depicted in Figure 3 as the decision step 309 which determines whether a further pair of audio source directional parameter/vectors corresponding to a further sub band of the channel signal.

After the audio source directional parameter/vector data set has in effect been reclassified (or classified) in accordance with processing steps 303 to 307, audio source directional labels are assigned to the set of vector points associated with each centroid. In other words the vector data set associated with a first centroid can be assigned the directional label of the dominant audio source direction and the vector data set associated with the second centroid can be assigned the directional label of secondary audio source direction. In embodiments the process of assigning an audio source directional label to a particular vector data set associated with a centroid can comprise determining for the particular vector data set a value for the energy contribution of the audio source directional vectors assigned to the particular centroid. In embodiments this can be achieved by using the direct to total energy ratios parameter which is determined for each audio source direction parameter/vector by the direction analyser 203 and forms part of the metadata set. For instance, the energy contribution of the audio source directional vectors can be determined for each centroid by summing the energy of the spectral coefficients of each sub band and assigning a portion of the energy value to a particular centroid in accordance with the energy ratio parameter for the sub band and the audio source directional vector assigned to the cluster associated with the particular centroid. The energy values associated with the audio source directional parameters/vectors assigned to the vector data set of the particular centroid can then be summed to give a total energy value or energy ratio value for all centroids of the partitioned data set. The processing step of determining an energy value of the audio source directional parameters/vectors assigned to the vector data set of a particular centroid is shown as processing step 31 1 in Figure 3

The audio source directional parameters/vectors assigned to the cluster centroid which has the highest overall energy level are assigned as vectors with a dominant audio source direction. This shown as processing step 313 in Figure 3.

The audio source directional parameters/vectors assigned to the cluster centroid which does not have the highest overall energy level are assigned as vectors with a secondary audio source direction. This is shown in Figure 3 as processing step 315.

The effect of the above reclassification steps of the directional vectors can be seen in Figure 8, where it is evident that there is a tighter clustering of data points or audio source directional parameters/vectors of the same classification. For instance the diamond symbols 801 show the clustering of the audio source directional parameters/vectors which have been classified as belonging to the dominant audio source direction, and the cross symbols 803 show the clustering of the audio source directional parameters/vectors which have been classified as belonging to the secondary audio source direction. By comparing Figure 6 with Figure 8 it can be seen that the clustering of the audio source directional parameters/vectors may be tighter for each classification of audio source direction.

Furthermore it is to be appreciated that the above processing steps can result in the further advantage of improving the coding gain of a subsequent coding or quantisation step for the audio source directional parameters/vectors. This effect may be attributed to the reclassified audio source directional parameters/vectors having a lower variance which results in a more efficient range of values for quantisation.

Although not repeated throughout the document, it is to be understood that spatial audio processing, both typically and in this context, takes place in frequency bands. Those bands could be for example, the frequency bins of the time-frequency transform, or frequency bands combining several bins. The combination could be such that approximates properties of human hearing, such as the Bark frequency resolution. In other words, in some cases, we could measure and process the audio in time-frequency areas combining several of the frequency bins b and/or time indices n. For simplicity, these aspects were not expressed by all of the equations above. In case many time-frequency samples are combined, typically one set of parameters such as one direction is estimated for that time-frequency area, and all time-frequency samples within that area are synthesized according to that set of parameters, such as that one direction parameter.

The usage of a frequency resolution for parameter analysis that is different than the frequency resolution of the applied filter-bank is a typical approach in the spatial audio processing systems.

With respect to Figure 9 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1400 comprises a memory 141 1 . In some embodiments the at least one processor 1407 is coupled to the memory 141 1 . The memory 141 1 can be any suitable storage means. In some embodiments the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.

In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.

In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1 . A method comprising:

determining for an audio signal of two or more audio signals a plurality of spatial audio directional vectors;

partitioning a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid;

assigning a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid;

assigning a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure;

assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and

assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.

2. The method as claimed in Claim 1 , wherein assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid and a further audio source direction to the set of spatial audio direction vectors associated with the second centroid further comprises:

determining an energy value for the set of spatial audio directional vectors associated with the first centroid;

determining an energy value for the set of spatial audio directional vectors associated with the second centroid;

comparing the energy value for the set of spatial audio directional vectors associated with the first centroid with the energy value for the set of spatial audio directional vectors associated with the second centroid; and determining that the audio source direction is a dominant audio source direction and the further audio source direction is a secondary audio source direction when the energy value for the set of spatial audio directional vectors associated with the first centroid is greater than the energy value for the set of spatial audio directional vectors associated with the second centroid.

3. The method as claimed in Claims 1 and 2, wherein the first vector distance measure is a minimum vector distance measure and the second vector distance measure is other than the minimum vector distance measure.

4. The method as claimed in Claims 1 to 3, wherein the first spatial audio directional vector and the second spatial audio directional vector are associated with a frequency sub band or frequency bin of the audio signal.

5. The method as claimed in Claims 1 to 4, wherein each of the plurality of spatial audio directional vectors comprises: an elevation component; and an azimuth component.

6. The method as claimed in Claims 1 to 5, wherein the vector space is partitioned using a K-mediods partitioning algorithm.

7. The method as claimed in Claims 1 to 6, wherein the plurality of spatial audio directional vectors is spatial audio metadata.

8. An apparatus comprising:

means for determining for an audio signal of two or more audio signals a plurality of spatial audio directional vectors;

means for partitioning a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid; means for assigning a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid;

means for assigning a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure; means for assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and

means for assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.

9. The apparatus as claimed in Claim 8, wherein the means for assigning an audio source direction to the set of spatial audio direction vectors associated with the first centroid and the means for assigning a further audio source direction to the set of spatial audio direction vectors associated with the second centroid further comprises:

means for determining an energy value for the set of spatial audio directional vectors associated with the first centroid;

means for determining an energy value for the set of spatial audio directional vectors associated with the second centroid;

means for comparing the energy value for the set of spatial audio directional vectors associated with the first centroid with the energy value for the set of spatial audio directional vectors associated with the second centroid; and

means for determining that the audio source direction is a dominant audio source direction and the further audio source direction is a secondary audio source direction when the energy value for the set of spatial audio directional vectors associated with the first centroid is greater than the energy value for the set of spatial audio directional vectors associated with the second centroid.

10. The apparatus as claimed in Claims 8 and 9, wherein the first vector distance measure is a minimum vector distance measure and the second vector distance measure is other than the minimum vector distance measure.

1 1 . The apparatus as claimed in Claims 8 to 10, wherein the first spatial audio directional vector and the second spatial audio directional vector are associated with a frequency sub band or frequency bin of the audio signal.

12. The apparatus as claimed in Claims 8 to 1 1 , wherein each of the plurality of spatial audio directional vectors comprises: an elevation component; and an azimuth component.

13. The apparatus as claimed in Claims 8 to 12, wherein the vector space is partitioned using a K-mediods partitioning algorithm.

14. The apparatus as claimed in Claims 8 to 13, wherein the plurality of spatial audio directional vectors is spatial audio metadata.

15. Computer program code arranged to realize the following when executed by a processor:

determine for an audio signal of two or more audio signals a plurality of spatial audio directional vectors;

partition a vector space of the plurality of spatial audio directional vectors into a plurality of partitions, wherein each partition comprises a centroid;

assign a first spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a first centroid, wherein the first spatial audio directional vector has a first vector distance measure to the first centroid;

assign a second spatial audio directional vector of the plurality of spatial audio directional vectors to a set of spatial audio directional vectors associated with a further centroid, wherein the second spatial audio directional vector has a vector distance measure which is a second vector distance measure;

assign an audio source direction to the set of spatial audio direction vectors associated with the first centroid; and

assign a further audio source direction to the set of spatial audio direction vectors associated with the second centroid.