WO2017148523A1

WO2017148523A1 - Non-parametric audio classification

Info

Publication number: WO2017148523A1
Application number: PCT/EP2016/054586
Authority: WO
Inventors: Volodya Grancharov; Sigurdur Sverrisson
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2016-03-03
Filing date: 2016-03-03
Publication date: 2017-09-08

Abstract

There is provided mechanisms for non-parametric audio classification. A method is performed by a classification device. The method comprises obtaining a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The method comprises determining per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The method comprises classifying the input sequence to belong to the class for which the per-class posterior probability is largest.

Description

NON-PARAMETRIC AUDIO CLASSIFICATION

TECHNICAL FIELD

Embodiments presented herein relate to a method, a classification device, a computer program, and a computer program product for non-parametric audio classification.

BACKGROUND

In general terms, audio mining is a technique by which the content of an audio signal (comprising an audio waveform) can be automatically analyzed and searched. It is commonly used in the field of automatic speech

recognition, where the analysis tries to identify any speech within the audio. The audio signal will typically be processed by a speech recognition system in order to identify word or phoneme units that are likely to occur in the spoken content. In turn, this information can be used to identify a language used in the audio signal, which speaker that is producing the audio waveform, the gender of the speaker producing the audio waveform, etc. This information may either be used immediately in pre-defined searches for keywords, languages, speakers, gender (a real-time word spotting system), or the output of the speech recognizer may be stored in an index file. One or more audio mining index files can then be loaded at a later date in order to run searches for any of the above parameters (keywords, languages, speakers, gender, etc.).

The parameters can be represented by classes. That is, assuming that the audio signal is to be classified in terms of language, there may be a set of classes where each class represents a unique language, and where the classification intends determine which one of these languages is used in the audio signal.

Generally, a probabilistic based technique attempt to discover the thus unknown class through estimate a probability density function. In general terms, there are two major classes of such techniques, namely parametric (also known as model-based) techniques and non-parametric technique. Generally, parametric techniques assume a known form of the underlying probability density function and adjust the model parameters to available training data. This technique requires low computational and storage requirements, and it could be applied with limited amount of training data. One disadvantage is that the form of underlying probability density function is not known in most practical application, and therefore, a mismatch between the assumed form of the probability density function and the true form of the probability density function might occur.

Generally, non-parametric techniques have no prior assumption of the form of underlying probability density function, and therefore do not suffer from the mentioned above model mismatch. For example, a /c-Nearest-Neighbor approach attempts to estimate posterior probabilities Ρ(ω_πι

for an unlabeled observation point x from a set of pre-stored L labeled training samples

in the following way. In a first step, a cell is centered around x and grown until the cell captures k nearest neighbors of x. For example, k could be selected as VL. In a second step, if k_m of these samples are labeled ¾, then posterior probabilities are estimated as: k

Ρ(ύ½ |χ) = - m = l, - , M (1)

For a reliable estimate k has to be large and all k neighbors have to be close to x. This could be achieved by pre-storing large amounts of labeled training data, i.e., L→∞. Hence, non-parametric techniques come with

computational complexity and storage requirements, which are prohibited for many practical applications.

Hence, there is still a need for an improved classification of audio. SUMMARY

An object of embodiments herein is to provide efficient audio classification.

According to a first aspect there is presented a method for non-parametric audio classification. The method is performed by a classification device. The method comprises obtaining a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The method comprises determining per- class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The method comprises classifying the input sequence to belong to the class for which the per-class posterior probability is largest.

Advantageously this provides efficient audio classification.

Advantageously this method is flexible with respect to the probability density function of the short-term frequency representation of an audio waveform by using a non-parametric estimation technique whilst only requiring low complexity requirements comparable to the complexity requirements of parametric based estimation techniques.

According to a second aspect there is presented a classification device for non-parametric audio classification. The classification device comprises processing circuitry. The processing circuitry is configured to cause the classification device to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The processing circuitry is configured to cause the classification device to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The processing circuitry is configured to cause the classification device to classify the input sequence to belong to the class for which the per-class posterior probability is largest.

Advantageously the proposed classification device requires low

computational effort for performing the non-parametric audio classification and is therefore practically implementable. According to a third aspect there is presented a classification device for non- parametric audio classification. The classification device comprises processing circuitry and a computer program product. The computer program product stores instructions that, when executed by the processing circuitry, causes the classification device to perform steps, or operations. The steps, or operations, cause the classification device to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The steps, or operations, cause the classification device to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The steps, or operations, cause the classification device to classify the input sequence to belong to the class for which the per- class posterior probability is largest.

According to a fourth aspect there is presented a classification device for non- parametric audio classification. The classification device comprises an obtain module configured to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The classification device comprises a determine module configured to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The classification device comprises a classify module configured to classify the input sequence to belong to the class for which the per-class posterior probability is largest.

According to a fifth aspect there is presented a computer program for non- parametric audio classification, the computer program comprising computer program code which, when run on classification device, causes classification device to perform a method according to the first aspect. According to a sixth aspect there is presented a computer program product comprising a computer program according to the fifth aspect and a computer readable storage medium on which the computer program is stored.

It is to be noted that any feature of the first, second, third, fourth, fifth and sixth aspects may be applied to any other aspect, wherever appropriate.

Likewise, any advantage of the first aspect may equally apply to the second, third, fourth, fifth, and/or sixth aspect, respectively, and vice versa. Other objectives, features and advantages of the enclosed embodiments will be apparent from the following detailed disclosure, from the attached dependent claims as well as from the drawings.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the element, apparatus, component, means, step, etc." are to be interpreted openly as referring to at least one instance of the element, apparatus, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

BRIEF DESCRIPTION OF THE DRAWINGS

The inventive concept is now described, by way of example, with reference to the accompanying drawings, in which:

Fig. l is a schematic diagram illustrating probability density functions of two classes;

Fig. 2 is a schematic block diagram of a classification device according to an embodiment;

Fig. 3 is a schematic diagram showing functional units of a classification device according to an embodiment;

Fig. 4 is a schematic diagram showing functional modules of a classification device according to an embodiment; Figs. 5 and 6 are flowcharts of methods according to embodiments; and

Fig. 7 shows one example of a computer program product comprising computer readable storage medium according to an embodiment.

DETAILED DESCRIPTION

The inventive concept will now be described more fully hereinafter with reference to the accompanying drawings, in which certain embodiments of the inventive concept are shown. This inventive concept may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and will fully convey the scope of the inventive concept to those skilled in the art. Like numbers refer to like elements throughout the description. Any step or feature illustrated by dashed lines should be regarded as optional.

The embodiments disclosed herein relate to non-parametric audio

classification. In order to obtain non-parametric audio classification there is provided a classification device, a method performed by the classification device, a computer program product comprising code, for example in the form of a computer program, that when run on a classification device, causes the classification device 200 to perform the method.

As an introductory example, reference is now made to Fig. 1. Fig. 1 at 100 and 110 illustrate probability density functions of two classes ω₁ and ω₂. Assume that a data point x_n = 4 (as identified by reference numeral 130 in Fig. 1) is observed. The task of the classification device is to assign the data point x_n to one of the classes ω₁ and ω₂, without having any assumption of the underlying form of the probability density functions 100, 110. Further explanation of Fig. 1 will be provided below.

Reference is now made to Fig. 2. Fig. 2 is a schematic block diagram of a classification device 200 according to an embodiment. According to the embodiment of Fig. 2, the classification device 200 comprises a classification module 210 and an optional training module 220. In embodiments where the training module 220 is not present it is assumed that the classification module 210 is provided with values determined by an external training module 220

Figs. 5 and 6 are flow chart illustrating embodiments of methods for non- parametric audio classification. The methods are performed by the

classification device 200. The methods are advantageously provided as computer programs 320.

Reference is now made to Fig. 5 illustrating a method for non-parametric audio classification as performed by the classification device 200 according to an embodiment.

Step S102: The classification device 200 obtains a short-term frequency representation of an audio waveform. The short-term frequency

representation define an input sequence x which is assumed to be divided into input vectors x_n. More particularly, the input sequence x can be assumed to consist of n, N-dimensional, vectors and hence be written as x = {x_n}^₌₁.

Step S104: The classification device 200 determines per-class posterior probabilities Ρ(ω for at least two classes ¾. Each per-class posterior probability Ρ(ω_πι

based on a weighted sum Ρ(ω_πι of pre-stored per- cluster posterior probabilities Ρ(ω_πι \μ_1{) for the at least two classes ¾. Each class ω_τη (in the set of classes {%}*₌₁) represents a unique audio

classification property- Step S106: The classification device 200 classifies the input sequence x to belong to the class ω_τη for which the per-class posterior probability Ρ(ω_πι

is largest.

Embodiments relating to further details of non-parametric audio

classification as performed by the classification device 200 will now be disclosed. Reference is made to Fig. 6 illustrating methods for non- parametric audio classification as performed by the classification device 200 according to further embodiments. Steps S102, S104, S106 are performed as in Fig. 5 and a repeated description of those steps is therefore omitted.

According to an embodiment the weighted sum Ρ(ω_πι

is determined using cluster contribution weights _k. The cluster contribution weights A_k are defined by distances A_k between the input sequence x and a set of clusters c_k. Each cluster c_k belongs to one of at least two classes ¾. In general there are M classes, and this is formally denoted by a set of classes {w_m}"₌₁. For a given class ω_τη the weighted sum Ρ(ω_πι

when summed over all input vectors x_n, define the per-class posterior probability Ρ(ω_πι

for this given class ω_τη.

There could be different properties that the at least two classes ω_τη represent. For example, the non-parametric audio classification can be performed to classify languages, speaker, and/or genders. Hence, according to an embodiment each class ω_τη represents a unique language, a unique speaker, or a unique gender.

There could be different ways for the classification device 200 to obtain the short-term frequency representation of the multimedia signal as in step S102. According to an embodiment the short-term frequency representation is provided by mel-frequency cepstral components (MFCCs). In this respect, the MFCCs are coefficients that collectively make up a mel-frequency cepstrum (MFC). The MFC is a representation of the short-term power spectrum of the audio waveform, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. There could be different ways for the classification device 200 to obtain the MFCCs. According to one embodiment the MFCCs are made readily available to the classification device 200. Hence, this embodiment requires a module configured to provide the MFCCs from the audio waveform. According to another embodiment the classification device 200 receives the audio waveform and extracts the MFCCs from the audio waveform. Hence, according to an embodiment the classification device 200 is configured to perform step Si02a: Step Si02a: The classification device 200 extracts the MFCCs from the audio waveform. How to extracts MFCCs from an audio waveform is as such known in the art and further description thereof is therefore omitted. Step Si02a is performed as part of step S102. Each input vector x_n can then correspond to a vector of MFCCs. Assuming that the audio waveform is composed of frames, there is then one vector of MFCCs per frame.

There could be different types of audio waveforms. According to an

embodiment the audio waveform represents a speech signal.

There could be different types of audio classification of which the herein disclosed methods for non-parametric audio classification could be part of. According to an embodiment the step S102 of obtaining, the step S104 of determining, and the step S106 of classifying are performed in an audio mining application.

According to an embodiment the methods for non-parametric audio classification comprise a training stage and a classification stage.

Aspects of the training stage, as implemented by the training module 220, will now be disclosed in detail.

According to an embodiment the per-cluster posterior probabilities Ρ(ω_πι \μ_1{) are determined through training on a training sequence y of MFCCs.

The training could be based on /c-means clustering of the training sequence y. Further details thereof will now be disclosed. Let the training sequence y consists of I, L-dimensional, vectors y =

where each vector y is linked to its corresponding class ¾. First the training sequence y is by the classification device 200 organized in K clusters {c_fc}£₌₁, by means of a k- means clustering algorithm. This results in a codebook with K, D dimensional centroids μ . In addition to its centroid, each cluster c_k is determined by its squared median absolute deviation factor p_k, which the classification device 200 determines as: p_k = median(|Z_fe— median(Z_fe) |)² (2)

Here, Z_k denotes the set of points that belong to cluster c_k (that is, Z_k is a sub-set of y in which every point is closer to μ than to any other mean μ_≠¾). In general, the median absolute deviation factor p_k is a robust to outliers statistic that captures variations inside cluster c_k. Hence, according to an embodiment where each cluster c_k has a cluster centroid μ , each per-cluster posterior probability Ρ(ω_πι \μ_1{) for a particular class ω_τη and a particular cluster centroid μ represents a conditional probability of the particular class ω_τη given the particular cluster centroid μ .

After the set of clusters c_k≡ {μ , p_k}, for k = Ι, .,. , Κ is determined, the classification device 200 links each of the clusters c_k to a table of posterior probabilities for each of the M classes P (w_m \μ ), m = 1, ... , M. In order to do so the classification device 200 implements the following operations: for k = 1, ... , K determine the size of Z_k and store it in a variable L_k; for m = 1, ... , M: count the number of labels of class ω_τη and denote it as L^^m; and store Ρ{ω_πι \μ_Ιι) as: Ρ{ω_πι \μ_Ιι) = -f— end for end for

Aspects of the classification stage, as implemented by the classification module 210, will now be disclosed in detail.

The classification device 200 is configured to, during the classification stage, assign an unlabeled sequence, as defined by the input sequence x as obtained in step S102, to the correct class from the at least two classes {w_m}"₌₁. This corrected class is denoted o½*. As above, the input sequence x is assumed to consists of n, D dimensional, vectors x = {x_n}^₌₁.

The classification device 200 is configured to determine a set of distances A_k from each data point x_n in the input sequence x to all clusters c_k. According to an embodiment there is one cluster contribution weight A_k for each input vector x_n. The classification device 200 is then configured to determine the cluster contribution weight A_k for input vector x_n by performing step Si04a:

Si04a: The classification device 200 determines one distance A_k between the input vector x_n and each of the clusters c_k. Step Si04a is performed as part of step S104.

Further, according to an embodiment each of the distances A_k is made inversely proportional to the median absolute deviation factor p_k relating to a spread of points inside cluster c_k. According to some aspects the distance between point x_n and cluster c_k is defined as:

For data point x_n this results in K distances {A_k}_k=1. The inverse of these distances can be used to weigh the contribution of clusters in the probability density function estimation at point x_n. The distances {A_k}_k=1 from the data point x_n to all clusters c_k is used to create a weighted sum of the per-cluster posterior probabilities. This weighted sum serves as an estimate of posterior probabilities at the observation point x_n. According to an embodiment the classification device 200 is configured to determine the weighted sum

Ρ(ω_πι

by performing step Si04b:

Si04b: The classification device 200 determines one cluster contribution weight A_k for each distance A_k. The cluster contribution weight A_k is inversely proportional to the distance A_k from which it is determined. Step Si04b is performed as part of step S104. Further, each cluster contribution weight A_fc can be inversely proportional to a sum based on all distances A_k. According to some aspects the cluster contribution weights A_k are determined as:

= _K -η (4)

Here, η is an expansion constant. Values that could be used for the expansion constant η are between 2 and 8, and are based on what property the non- parametric audio classification is be performed to classify and dimensionality of the input sequence x.

According to some aspects the classification device 200 is configured to estimate the posterior probabilities at point x_n from the pre-stored per- cluster posterior probabilities Ρ(ω_πι \μ_1{) and use the class contribution weights A_fc, as:

K

P{u_m \x_n) = ^ λ_]ι Ρ{ω_πι \μ_]ι), m = l, ... , M (5)

k=l

It is here noted that ^Js— as determined during the training stage is used to

^Lk

represent Ρ{ω_πι | _¾).

Further, according to an embodiment there is one weighted sum Ρ(ω_πι

per class ω_τη and per input vector x_n, and all weighted sums Ρ(ω_πι

for one class ω_τη are combined over all input vector x_n to define one per-class posterior probability Ρ(ω_πι

According to an embodiment the classification device 200 is configured to determine the per-class posterior probabilities Ρ(ω_πι for a given class ω_τη by performing step S104C:

S104C: The classification device 200 sums the weighted sum Ρ(ω_πι

over all input vectors x_n for said given class ¾. Step S104C is performed as part of step S104. Logarithmic values of the weighted sum Ρ(ω_πι

can be summed over all input vectors x_n to determine a logarithmic value of the per-class posterior probabilities Ρ(ω_πι

for said given class ¾. That is, to assign the optimal class o½* to the input sequence x, the classification device 200 can be configured to determine the log-probability of the entire input sequence x (over all n observations) as follows

N

log(P(o _m |^)) = ^ \og(P (_a)_m \x_n)) (6)

71 =1

The classification device 200 is then configured to determine the class ω_τη that corresponds to the largest posterior as:

Mm* = argmax{\og(P(a)_m \x))} ^

ωπι

The optimal class o½* is thus found as the class ω_τη with largest posterior probability, given the input sequence x.

A non-limiting illustrative example of at least some of the herein disclosed embodiments will be provided next. A training sequence y represents data that belong to one of two classes ω , <¾₂is generated, as illustrated in the probability density functions 100 and 110 in Fig. 1, where probability density function 100 represents class ω , and where probability density function 110 represents class ω₂. The cluster centers μ₁, μ₂ , μ₃ , and μ₄ are identified at reference numerals 120a, 120b, 120c, and i2od and have values (from left to right in Fig. 1) 4.91, 8.89, 10.99, and !5-09, which are also reflected in the first column of Table 1.

The training stage of the presented algorithm obtains four clusters, and the corresponding posterior probabilities Ρ(ω_πι \μ_1{) as provided in Table 1.

Clusters Posterior probabilities μ₁ = 4.91 _! = 1.36 Ρ{ω^_μι) =0.85 Ρ(ω₂ | _!) =0.15 c₂: μ₂ = 8.89 P2 = 1.71 Ρ(ω₁ \μ₂) =0.43 Ρ(ω₂ |2) = =0.57 3 = 10.99 p₃ = 1.39 =0.56 Ρ(ω₂ |μ₃) =0.44 c₄: μ₄ = 15.09 p₄ = 1.71 Ρ(ω₁ |μ₄) = 0.12 Ρ(ω₂ |μ₄) =0.88

Table 1: Clusters and pre-stored posterior probabilities obtained from a training sequence.

For the classification stage it is assumed that a data point x_n = 4 (as identified by reference numeral 130 in Fig. 1) is observed and the task of the classification device 200 is to assign the data point x_n to one of the classes ω₁ and ω₂.

The classification device 200 determines distances to all clusters by implementing Equation (3), and thus obtains _k(x_n, c_k) = {0.6089, 13.9837, 35.1512, 71.9229}. The classification device 200 determines the cluster contributions to the given point x_n by implementing Equation (4) and obtains A_fc = {0.9977, 0.0019, 0.0003, 0.0001}. The classification device 200 determines the two posterior probabilities at point x_n by implementing Equation (5) and thus obtains Pteo xJ = 0.8491 and Ρ(ω₂ |χ_η) = 0.1509. The classification device 200 determines, by implementing Equations (6) and (7), that the point x_n = 4 belongs to class ω .

Fig. 3 schematically illustrates, in terms of a number of functional units, the components of a classification device 200 according to an embodiment.

Processing circuitry 310 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), etc., capable of executing software instructions stored in a computer program product 710 (as in Fig. 7), e.g. in the form of a storage medium 330. The processing circuitry 310 may further be provided as at least one application specific integrated circuit (ASIC), or field

programmable gate array (FPGA). Particularly, the processing circuitry 310 is configured to cause the

classification device 200 to perform a set of operations, or steps, S102-S106, as disclosed above. For example, the storage medium 330 may store the set of operations, and the processing circuitry 310 may be configured to retrieve the set of operations from the storage medium 330 to cause the classification device 200 to perform the set of operations. The set of operations may be provided as a set of executable instructions.

Thus the processing circuitry 310 is thereby arranged to execute methods as herein disclosed. The storage medium 330 may also comprise persistent storage, which, for example, can be any single one or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory. The classification device 200 may further comprise a communications interface 320 configured for communications with another device, for example to obtain the MFCCs as in step S102 and to provide a result of the classification as performed in step S106. As such the

communications interface 320 may comprise one or more transmitters and receivers, comprising analogue and digital components. The processing circuitry 310 controls the general operation of the classification device 200 e.g. by sending data and control signals to the communications interface 320 and the storage medium 330, by receiving data and reports from the communications interface 320, and by retrieving data and instructions from the storage medium 330. Other components, as well as the related

functionality, of the classification device 200 are omitted in order not to obscure the concepts presented herein.

Fig. 4 schematically illustrates, in terms of a number of functional modules, the components of a classification device 200 according to an embodiment. The classification device 200 of Fig. 4 comprises a number of functional modules; an obtain module 310a configured to perform step S102, a determine module 310b configured to perform step S104, and a classify module 310c configured to perform step S106. The classification device 200 of Fig. 4 may further comprises a number of optional functional modules, such as any of a determine module 3iod configured to perform step Si04a, a determine module 3ioe configured to perform step S104D, a sum module 3iof configured to perform step S104C, and an extract module 3iog configured to perform step Si02a. In general terms, each functional module 3ioa-3iog may in one embodiment be implemented only in hardware or and in another embodiment with the help of software, i.e., the latter embodiment having computer program instructions stored on the storage medium 330 which when run on the processing circuitry makes the classification device 200 perform the corresponding steps mentioned above in conjunction with Fig 4.

It should also be mentioned that even though the modules correspond to parts of a computer program, they do not need to be separate modules therein, but the way in which they are implemented in software is dependent on the programming language used. Preferably, one or more or all functional modules 3ioa-3iog may be implemented by the processing circuitry 310, possibly in cooperation with functional units 320 and/or 330. The processing circuitry 310 may thus be configured to from the storage medium 330 fetch instructions as provided by a functional module 3ioa-3iog and to execute these instructions, thereby performing any steps as will be disclosed herein.

The classification device 200 may be provided as a standalone device or as a part of at least one further device. For example, the classification device 200 may be provided in an audio mining device. Alternatively, functionality of the classification device 200 may be distributed between at least two devices, or nodes. These at least two nodes, or devices, may either be part of the same network part or may be spread between at least two such network parts.

Thus, a first portion of the instructions performed by the classification device 200 may be executed in a first device, and a second portion of the of the instructions performed by the classification device 200 may be executed in a second device; the herein disclosed embodiments are not limited to any particular number of devices on which the instructions performed by the classification device 200 may be executed. Hence, the methods according to the herein disclosed embodiments are suitable to be performed by a classification device 200 residing in a cloud computational environment. Therefore, although a single processing circuitry 310 is illustrated in Fig. 3 the processing circuitry 310 may be distributed among a plurality of devices, or nodes. The same applies to the functional modules 3ioa-3iog of Fig. 4 and the computer program 720 of Fig. 7 (see below).

Fig. 7 shows one example of a computer program product 710 comprising computer readable storage medium 730. On this computer readable storage medium 730, a computer program 720 can be stored, which computer program 720 can cause the processing circuitry 310 and thereto operatively coupled entities and devices, such as the communications interface 320 and the storage medium 330, to execute methods according to embodiments described herein. The computer program 720 and/or computer program product 710 may thus provide means for performing any steps as herein disclosed. In the example of Fig. 7, the computer program product 710 is illustrated as an optical disc, such as a CD (compact disc) or a DVD (digital versatile disc) or a Blu-Ray disc. The computer program product 710 could also be embodied as a memory, such as a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM) and more particularly as a non-volatile storage medium of a device in an external memory such as a USB (Universal Serial Bus) memory or a Flash memory, such as a compact Flash memory. Thus, while the computer program 720 is here schematically shown as a track on the depicted optical disk, the computer program 720 can be stored in any way which is suitable for the computer program product 710.

The inventive concept has mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the inventive concept, as defined by the appended patent claims.

Claims

l8 CLAIMS

1. A method for non-parametric audio classification, the method being performed by a classification device (200), the method comprising:

obtaining (S102) a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors x_n;

determining (S104) per-class posterior probabilities P(c>m \x) for at least two classes, wherein each per-class posterior probability P(c>m \x) is based on a weighted sum P( 6)_m \x_n) of pre-stored per-cluster posterior probabilities for the at least two classes, and wherein each class w_m represents a unique audio classification property; and

classifying (S106) the input sequence x to belong to the class w_m for which the per-class posterior probability P( 6)_m \x) is largest.

2. The method according to claim 1, wherein the weighted sum P( 6)_m \x_n) is determined using cluster contribution weights λ¾ defined by distances Ak between the input sequence x and a set of clusters c¾.

3. The method according to any of the preceding claims, wherein, for a given class, the weighted sum P( 6)_m \x_n) when summed over all input vectors x_n defines the per-class posterior probability P(c>m \x) for said given class.

4. The method according to any of the preceding claims, wherein there is one cluster contribution weight λ¾ for each input vector x_n, and wherein determining the cluster contribution weight λ¾ for input vector x_n comprises: determining (Si04a) one distance Ak between the input vector x_n and each of the clusters c¾.

5. The method according to claim 4, wherein each of the distances Ak is made inversely proportional to a median absolute deviation factor p¾ relating to a spread of points inside cluster Ck.

6. The method according to claim 4 or 5, wherein determining the weighted sum P( 6)_m \x_n) of per-cluster posterior probabilities P( 6)_m |μ¾9 comprises:

determining (Si04b) one cluster contribution weight λ¾ for each distance Ak, and wherein the cluster contribution weight λ¾ is inversely proportional to the distance Ak from which it is determined.

7. The method according to claim 6, wherein each cluster contribution weight k is inversely proportional to a sum based on all distances Ak.

8. The method according to claim 6 or 7, wherein there is one weighted sum Ρ(ωτη \xn) per class w_m and per input vector x_n, and wherein all weighted sums Ρ(ωτη \x_n) for one class w_m are combined over all input vector x_n to define one per-class posterior probability P( 6)_m \x).

9. The method according to claim 8, wherein determining the per-class posterior probabilities P(c>m \x) for a given class w_m comprises:

summing (S104C) the weighted sum P( 6)_m \x_n) over all input vectors x_n for said given class w_m.

10. The method according to claim 9, wherein logarithmic values of the weighted sum P( 6)_m \x_n) are summed over all input vectors x_n to determine a logarithmic value of the per-class posterior probabilities P(c>m \x) for said given class w_m.

11. The method according to any of the preceding claims, wherein each cluster Ck has a cluster centroid μ¾, and wherein each per-cluster posterior probability P( 6)_m |μ¾9 for a particular class w_m and a particular cluster centroid μk represents a conditional probability of said particular class w_m given said particular cluster centroid μ¾.

12. The method according to any of the preceding claims, wherein the per- cluster posterior probabilities Ρ(ωτη \μ^ are determined through training on a training sequence y.

13. The method according to claim 12, wherein the training is based on k- means clustering of the training sequence y.

14. The method according to any of the preceding claims, wherein each class Wm represents a unique language, a unique speaker, or a unique gender.

15. The method according to claim 1, wherein the short-term frequency representation is provided by mel-frequency cepstral components, MFCCs.

16. The method according to claim 15, further comprising:

extracting (Si02a) the MFCCs from the audio waveform.

17. The method according to claim 16, wherein each input vector x_n

corresponds to a vector of MFCCs, wherein the audio waveform is composed of frames, wherein there is one vector of MFCCs per frame.

18. The method according to claim 16 or 17, wherein the audio waveform represents a speech signal.

19. The method according to any of the preceding claims, wherein said obtaining, said determining, and said classifying are performed in an audio mining application.

20. A classification device (200) for non-parametric audio classification, the classification device (200) comprising processing circuitry (310), the processing circuitry being configured to cause the classification device (200) to:

obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors x_n;

determine per-class posterior probabilities

for at least two classes, wherein each per-class posterior probability

is based on a weighted sum ( 6)_m \x_n) of pre-stored per-cluster posterior probabilities

Ρ(ωτη \μι) for the at least two classes, and wherein each class w_m represents a unique audio classification property; and classify the input sequence x to belong to the class w_m for which the per- class posterior probability P( 6)_m \x) is largest.

21. A classification device (200) for non-parametric audio classification, the classification device (200) comprising:

processing circuitry (310); and

a computer program product (710) storing instructions that, when executed by the processing circuitry (310), causes the classification device (200) to:

determine per-class posterior probabilitie for at least two classes, wherein each per-class posterior probability

s based on a weighted sum P( 6)_m \x_n) of pre-stored per-cluster posterior probabilities

Ρ(ωτη \μι) for the at least two classes, and wherein each class w_m represents a unique audio classification property; and

classify the input sequence x to belong to the class w_m for which the per-class posterior probability P( 6)_m \x) is largest.

22. A classification device (200) for non-parametric audio classification, the classification device (200) comprising:

an obtain module (310a) configured to obtain a short-term frequency representation of an audio waveform, the short-term frequency

representation defining an input sequence x divided into input vectors x_n;

a determine module (310b) configured to determine per-class posterior probabilities

for at least two classes, wherein each per-class posterior probability P( 6)_m \x) is based on a weighted sum P( 6)_m \x_n) of pre- stored per-cluster posterior probabilities Ρ(ωτη \μι) for the at least two classes, and wherein each class w_m represents a unique audio classification property; and

a classify module (310c) configured to classify the input sequence x to belong to the class w_m for which the per-class posterior probability P( 6)_m \x) is largest.

23. A computer program (720) for non-parametric audio classification, the computer program comprising computer code which, when run on processing circuitry (310) of a classification device (200), causes the classification device (200) to:

obtain (S102) a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors x_n;

determine (S104) per-class posterior probabilities or at least two classes, wherein each per-class posterior probabilit

is based on a weighted sum ( 6)_m \x_n) of pre-stored per-cluster posterior probabilities Ρ(ωτη \μι) for the at least two classes, and wherein each class w_m represents a unique audio classification property; and

classify (S106) the input sequence x to belong to the class w_m for which the per-class posterior probability ( 6)_m \x) is largest.

24. A computer program product (710) comprising a computer program (720) according to claim 23, and a computer readable storage medium (730) on which the computer program is stored.