WO2017148523A1 - Non-parametric audio classification - Google Patents

Non-parametric audio classification Download PDF

Info

Publication number
WO2017148523A1
WO2017148523A1 PCT/EP2016/054586 EP2016054586W WO2017148523A1 WO 2017148523 A1 WO2017148523 A1 WO 2017148523A1 EP 2016054586 W EP2016054586 W EP 2016054586W WO 2017148523 A1 WO2017148523 A1 WO 2017148523A1
Authority
WO
WIPO (PCT)
Prior art keywords
class
per
cluster
classification
classification device
Prior art date
Application number
PCT/EP2016/054586
Other languages
French (fr)
Inventor
Volodya Grancharov
Sigurdur Sverrisson
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/EP2016/054586 priority Critical patent/WO2017148523A1/en
Publication of WO2017148523A1 publication Critical patent/WO2017148523A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/285Memory allocation or algorithm optimisation to reduce hardware requirements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • Embodiments presented herein relate to a method, a classification device, a computer program, and a computer program product for non-parametric audio classification.
  • audio mining is a technique by which the content of an audio signal (comprising an audio waveform) can be automatically analyzed and searched. It is commonly used in the field of automatic speech
  • the audio signal will typically be processed by a speech recognition system in order to identify word or phoneme units that are likely to occur in the spoken content.
  • this information can be used to identify a language used in the audio signal, which speaker that is producing the audio waveform, the gender of the speaker producing the audio waveform, etc.
  • This information may either be used immediately in pre-defined searches for keywords, languages, speakers, gender (a real-time word spotting system), or the output of the speech recognizer may be stored in an index file.
  • One or more audio mining index files can then be loaded at a later date in order to run searches for any of the above parameters (keywords, languages, speakers, gender, etc.).
  • the parameters can be represented by classes. That is, assuming that the audio signal is to be classified in terms of language, there may be a set of classes where each class represents a unique language, and where the classification intends determine which one of these languages is used in the audio signal.
  • a probabilistic based technique attempt to discover the thus unknown class through estimate a probability density function.
  • parametric also known as model-based
  • non-parametric technique there are two major classes of such techniques, namely parametric (also known as model-based) techniques and non-parametric technique.
  • parametric techniques assume a known form of the underlying probability density function and adjust the model parameters to available training data. This technique requires low computational and storage requirements, and it could be applied with limited amount of training data.
  • One disadvantage is that the form of underlying probability density function is not known in most practical application, and therefore, a mismatch between the assumed form of the probability density function and the true form of the probability density function might occur.
  • a /c-Nearest-Neighbor approach attempts to estimate posterior probabilities ⁇ ( ⁇ ⁇ for an unlabeled observation point x from a set of pre-stored L labeled training samples in the following way.
  • a cell is centered around x and grown until the cell captures k nearest neighbors of x. For example, k could be selected as VL.
  • posterior probabilities are estimated as: k
  • An object of embodiments herein is to provide efficient audio classification.
  • a method for non-parametric audio classification is performed by a classification device.
  • the method comprises obtaining a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors.
  • the method comprises determining per- class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property.
  • the method comprises classifying the input sequence to belong to the class for which the per-class posterior probability is largest.
  • this method is flexible with respect to the probability density function of the short-term frequency representation of an audio waveform by using a non-parametric estimation technique whilst only requiring low complexity requirements comparable to the complexity requirements of parametric based estimation techniques.
  • a classification device for non-parametric audio classification.
  • the classification device comprises processing circuitry.
  • the processing circuitry is configured to cause the classification device to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors.
  • the processing circuitry is configured to cause the classification device to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property.
  • the processing circuitry is configured to cause the classification device to classify the input sequence to belong to the class for which the per-class posterior probability is largest.
  • a classification device for non- parametric audio classification comprises processing circuitry and a computer program product.
  • the computer program product stores instructions that, when executed by the processing circuitry, causes the classification device to perform steps, or operations.
  • the steps, or operations cause the classification device to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors.
  • the steps, or operations cause the classification device to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property.
  • the steps, or operations cause the classification device to classify the input sequence to belong to the class for which the per- class posterior probability is largest.
  • a classification device for non- parametric audio classification.
  • the classification device comprises an obtain module configured to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors.
  • the classification device comprises a determine module configured to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property.
  • the classification device comprises a classify module configured to classify the input sequence to belong to the class for which the per-class posterior probability is largest.
  • a computer program for non- parametric audio classification comprising computer program code which, when run on classification device, causes classification device to perform a method according to the first aspect.
  • a computer program product comprising a computer program according to the fifth aspect and a computer readable storage medium on which the computer program is stored.
  • any advantage of the first aspect may equally apply to the second, third, fourth, fifth, and/or sixth aspect, respectively, and vice versa.
  • Other objectives, features and advantages of the enclosed embodiments will be apparent from the following detailed disclosure, from the attached dependent claims as well as from the drawings.
  • Fig. l is a schematic diagram illustrating probability density functions of two classes
  • Fig. 2 is a schematic block diagram of a classification device according to an embodiment
  • Fig. 3 is a schematic diagram showing functional units of a classification device according to an embodiment
  • Fig. 4 is a schematic diagram showing functional modules of a classification device according to an embodiment
  • Figs. 5 and 6 are flowcharts of methods according to embodiments.
  • Fig. 7 shows one example of a computer program product comprising computer readable storage medium according to an embodiment.
  • a classification device In order to obtain non-parametric audio classification there is provided a classification device, a method performed by the classification device, a computer program product comprising code, for example in the form of a computer program, that when run on a classification device, causes the classification device 200 to perform the method.
  • Fig. 1 at 100 and 110 illustrate probability density functions of two classes ⁇ 1 and ⁇ 2 .
  • a data point x n 4 (as identified by reference numeral 130 in Fig. 1) is observed.
  • the task of the classification device is to assign the data point x n to one of the classes ⁇ 1 and ⁇ 2 , without having any assumption of the underlying form of the probability density functions 100, 110. Further explanation of Fig. 1 will be provided below.
  • Fig. 2 is a schematic block diagram of a classification device 200 according to an embodiment.
  • the classification device 200 comprises a classification module 210 and an optional training module 220.
  • the training module 220 is not present it is assumed that the classification module 210 is provided with values determined by an external training module 220
  • Figs. 5 and 6 are flow chart illustrating embodiments of methods for non- parametric audio classification. The methods are performed by the
  • the methods are advantageously provided as computer programs 320.
  • Fig. 5 illustrating a method for non-parametric audio classification as performed by the classification device 200 according to an embodiment.
  • Step S102 The classification device 200 obtains a short-term frequency representation of an audio waveform.
  • classification property- Step S106 The classification device 200 classifies the input sequence x to belong to the class ⁇ ⁇ for which the per-class posterior probability ⁇ ( ⁇ ⁇ is largest.
  • Fig. 6 illustrating methods for non- parametric audio classification as performed by the classification device 200 according to further embodiments. Steps S102, S104, S106 are performed as in Fig. 5 and a repeated description of those steps is therefore omitted.
  • the weighted sum ⁇ ( ⁇ ⁇ is determined using cluster contribution weights k .
  • the cluster contribution weights A k are defined by distances A k between the input sequence x and a set of clusters c k .
  • the weighted sum ⁇ ( ⁇ ⁇ when summed over all input vectors x n define the per-class posterior probability ⁇ ( ⁇ ⁇ for this given class ⁇ ⁇ .
  • each class ⁇ ⁇ represents a unique language, a unique speaker, or a unique gender.
  • the classification device 200 can obtain the short-term frequency representation of the multimedia signal as in step S102.
  • the short-term frequency representation is provided by mel-frequency cepstral components (MFCCs).
  • MFCCs are coefficients that collectively make up a mel-frequency cepstrum (MFC).
  • MFC is a representation of the short-term power spectrum of the audio waveform, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
  • the classification device 200 can obtain the MFCCs.
  • the MFCCs are made readily available to the classification device 200.
  • this embodiment requires a module configured to provide the MFCCs from the audio waveform.
  • the classification device 200 receives the audio waveform and extracts the MFCCs from the audio waveform.
  • the classification device 200 is configured to perform step Si02a: Step Si02a: The classification device 200 extracts the MFCCs from the audio waveform. How to extracts MFCCs from an audio waveform is as such known in the art and further description thereof is therefore omitted.
  • Step Si02a is performed as part of step S102.
  • Each input vector x n can then correspond to a vector of MFCCs. Assuming that the audio waveform is composed of frames, there is then one vector of MFCCs per frame.
  • the audio waveform represents a speech signal.
  • step S102 of obtaining, the step S104 of determining, and the step S106 of classifying are performed in an audio mining application.
  • the methods for non-parametric audio classification comprise a training stage and a classification stage.
  • the per-cluster posterior probabilities ⁇ ( ⁇ ⁇ ⁇ 1 ⁇ ) are determined through training on a training sequence y of MFCCs.
  • the training could be based on /c-means clustering of the training sequence y. Further details thereof will now be disclosed.
  • Z k denotes the set of points that belong to cluster c k (that is, Z k is a sub-set of y in which every point is closer to ⁇ than to any other mean ⁇ ⁇ 3 ⁇ 4 ).
  • the median absolute deviation factor p k is a robust to outliers statistic that captures variations inside cluster c k .
  • each per-cluster posterior probability ⁇ ( ⁇ ⁇ ⁇ 1 ⁇ ) for a particular class ⁇ ⁇ and a particular cluster centroid ⁇ represents a conditional probability of the particular class ⁇ ⁇ given the particular cluster centroid ⁇ .
  • This corrected class is denoted o1 ⁇ 2*.
  • the classification device 200 is configured to determine a set of distances A k from each data point x n in the input sequence x to all clusters c k . According to an embodiment there is one cluster contribution weight A k for each input vector x n . The classification device 200 is then configured to determine the cluster contribution weight A k for input vector x n by performing step Si04a:
  • Step Si04a The classification device 200 determines one distance A k between the input vector x n and each of the clusters c k . Step Si04a is performed as part of step S104.
  • each of the distances A k is made inversely proportional to the median absolute deviation factor p k relating to a spread of points inside cluster c k .
  • the distance between point x n and cluster c k is defined as:
  • the classification device 200 is configured to determine the weighted sum
  • the classification device 200 determines one cluster contribution weight A k for each distance A k .
  • the cluster contribution weight A k is inversely proportional to the distance A k from which it is determined.
  • Step Si04b is performed as part of step S104.
  • each cluster contribution weight A fc can be inversely proportional to a sum based on all distances A k .
  • the cluster contribution weights A k are determined as:
  • is an expansion constant. Values that could be used for the expansion constant ⁇ are between 2 and 8, and are based on what property the non- parametric audio classification is be performed to classify and dimensionality of the input sequence x.
  • the classification device 200 is configured to estimate the posterior probabilities at point x n from the pre-stored per- cluster posterior probabilities ⁇ ( ⁇ ⁇ ⁇ 1 ⁇ ) and use the class contribution weights A f c, as:
  • Js — as determined during the training stage is used to
  • the classification device 200 is configured to determine the per-class posterior probabilities ⁇ ( ⁇ ⁇ for a given class ⁇ ⁇ by performing step S104C:
  • Step S104C The classification device 200 sums the weighted sum ⁇ ( ⁇ ⁇ over all input vectors x n for said given class 3 ⁇ 4. Step S104C is performed as part of step S104. Logarithmic values of the weighted sum ⁇ ( ⁇ ⁇ can be summed over all input vectors x n to determine a logarithmic value of the per-class posterior probabilities ⁇ ( ⁇ ⁇ for said given class 3 ⁇ 4. That is, to assign the optimal class o1 ⁇ 2* to the input sequence x, the classification device 200 can be configured to determine the log-probability of the entire input sequence x (over all n observations) as follows
  • the classification device 200 is then configured to determine the class ⁇ ⁇ that corresponds to the largest posterior as:
  • Mm* argmax ⁇ og(P(a) m ⁇ x)) ⁇ ⁇
  • the optimal class o1 ⁇ 2* is thus found as the class ⁇ ⁇ with largest posterior probability, given the input sequence x.
  • a training sequence y represents data that belong to one of two classes ⁇ , ⁇ 3 ⁇ 4 2 is generated, as illustrated in the probability density functions 100 and 110 in Fig. 1, where probability density function 100 represents class ⁇ , and where probability density function 110 represents class ⁇ 2 .
  • the cluster centers ⁇ 1 , ⁇ 2 , ⁇ 3 , and ⁇ 4 are identified at reference numerals 120a, 120b, 120c, and i2od and have values (from left to right in Fig. 1) 4.91, 8.89, 10.99, and !5-09, which are also reflected in the first column of Table 1.
  • the training stage of the presented algorithm obtains four clusters, and the corresponding posterior probabilities ⁇ ( ⁇ ⁇ ⁇ 1 ⁇ ) as provided in Table 1.
  • Table 1 Clusters and pre-stored posterior probabilities obtained from a training sequence.
  • a data point x n 4 (as identified by reference numeral 130 in Fig. 1) is observed and the task of the classification device 200 is to assign the data point x n to one of the classes ⁇ 1 and ⁇ 2 .
  • ⁇ ⁇ ) 0.1509.
  • Fig. 3 schematically illustrates, in terms of a number of functional units, the components of a classification device 200 according to an embodiment.
  • Processing circuitry 310 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), etc., capable of executing software instructions stored in a computer program product 710 (as in Fig. 7), e.g. in the form of a storage medium 330.
  • the processing circuitry 310 may further be provided as at least one application specific integrated circuit (ASIC), or field
  • the processing circuitry 310 is configured to cause the
  • the classification device 200 to perform a set of operations, or steps, S102-S106, as disclosed above.
  • the storage medium 330 may store the set of operations
  • the processing circuitry 310 may be configured to retrieve the set of operations from the storage medium 330 to cause the classification device 200 to perform the set of operations.
  • the set of operations may be provided as a set of executable instructions.
  • the processing circuitry 310 is thereby arranged to execute methods as herein disclosed.
  • the storage medium 330 may also comprise persistent storage, which, for example, can be any single one or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory.
  • the classification device 200 may further comprise a communications interface 320 configured for communications with another device, for example to obtain the MFCCs as in step S102 and to provide a result of the classification as performed in step S106. As such the
  • the communications interface 320 may comprise one or more transmitters and receivers, comprising analogue and digital components.
  • the processing circuitry 310 controls the general operation of the classification device 200 e.g. by sending data and control signals to the communications interface 320 and the storage medium 330, by receiving data and reports from the communications interface 320, and by retrieving data and instructions from the storage medium 330.
  • classification device 200 functionality, of the classification device 200 are omitted in order not to obscure the concepts presented herein.
  • Fig. 4 schematically illustrates, in terms of a number of functional modules, the components of a classification device 200 according to an embodiment.
  • the classification device 200 of Fig. 4 comprises a number of functional modules; an obtain module 310a configured to perform step S102, a determine module 310b configured to perform step S104, and a classify module 310c configured to perform step S106.
  • the classification device 200 of Fig. 4 may further comprises a number of optional functional modules, such as any of a determine module 3iod configured to perform step Si04a, a determine module 3ioe configured to perform step S104D, a sum module 3iof configured to perform step S104C, and an extract module 3iog configured to perform step Si02a.
  • each functional module 3ioa-3iog may in one embodiment be implemented only in hardware or and in another embodiment with the help of software, i.e., the latter embodiment having computer program instructions stored on the storage medium 330 which when run on the processing circuitry makes the classification device 200 perform the corresponding steps mentioned above in conjunction with Fig 4.
  • modules correspond to parts of a computer program, they do not need to be separate modules therein, but the way in which they are implemented in software is dependent on the programming language used.
  • one or more or all functional modules 3ioa-3iog may be implemented by the processing circuitry 310, possibly in cooperation with functional units 320 and/or 330.
  • the processing circuitry 310 may thus be configured to from the storage medium 330 fetch instructions as provided by a functional module 3ioa-3iog and to execute these instructions, thereby performing any steps as will be disclosed herein.
  • the classification device 200 may be provided as a standalone device or as a part of at least one further device.
  • the classification device 200 may be provided in an audio mining device.
  • functionality of the classification device 200 may be distributed between at least two devices, or nodes. These at least two nodes, or devices, may either be part of the same network part or may be spread between at least two such network parts.
  • a first portion of the instructions performed by the classification device 200 may be executed in a first device, and a second portion of the of the instructions performed by the classification device 200 may be executed in a second device; the herein disclosed embodiments are not limited to any particular number of devices on which the instructions performed by the classification device 200 may be executed.
  • the methods according to the herein disclosed embodiments are suitable to be performed by a classification device 200 residing in a cloud computational environment. Therefore, although a single processing circuitry 310 is illustrated in Fig. 3 the processing circuitry 310 may be distributed among a plurality of devices, or nodes. The same applies to the functional modules 3ioa-3iog of Fig. 4 and the computer program 720 of Fig. 7 (see below).
  • Fig. 7 shows one example of a computer program product 710 comprising computer readable storage medium 730.
  • a computer program 720 can be stored, which computer program 720 can cause the processing circuitry 310 and thereto operatively coupled entities and devices, such as the communications interface 320 and the storage medium 330, to execute methods according to embodiments described herein.
  • the computer program 720 and/or computer program product 710 may thus provide means for performing any steps as herein disclosed.
  • the computer program product 710 is illustrated as an optical disc, such as a CD (compact disc) or a DVD (digital versatile disc) or a Blu-Ray disc.
  • the computer program product 710 could also be embodied as a memory, such as a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM) and more particularly as a non-volatile storage medium of a device in an external memory such as a USB (Universal Serial Bus) memory or a Flash memory, such as a compact Flash memory.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

There is provided mechanisms for non-parametric audio classification. A method is performed by a classification device. The method comprises obtaining a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The method comprises determining per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The method comprises classifying the input sequence to belong to the class for which the per-class posterior probability is largest.

Description

NON-PARAMETRIC AUDIO CLASSIFICATION
TECHNICAL FIELD
Embodiments presented herein relate to a method, a classification device, a computer program, and a computer program product for non-parametric audio classification.
BACKGROUND
In general terms, audio mining is a technique by which the content of an audio signal (comprising an audio waveform) can be automatically analyzed and searched. It is commonly used in the field of automatic speech
recognition, where the analysis tries to identify any speech within the audio. The audio signal will typically be processed by a speech recognition system in order to identify word or phoneme units that are likely to occur in the spoken content. In turn, this information can be used to identify a language used in the audio signal, which speaker that is producing the audio waveform, the gender of the speaker producing the audio waveform, etc. This information may either be used immediately in pre-defined searches for keywords, languages, speakers, gender (a real-time word spotting system), or the output of the speech recognizer may be stored in an index file. One or more audio mining index files can then be loaded at a later date in order to run searches for any of the above parameters (keywords, languages, speakers, gender, etc.).
The parameters can be represented by classes. That is, assuming that the audio signal is to be classified in terms of language, there may be a set of classes where each class represents a unique language, and where the classification intends determine which one of these languages is used in the audio signal.
Generally, a probabilistic based technique attempt to discover the thus unknown class through estimate a probability density function. In general terms, there are two major classes of such techniques, namely parametric (also known as model-based) techniques and non-parametric technique. Generally, parametric techniques assume a known form of the underlying probability density function and adjust the model parameters to available training data. This technique requires low computational and storage requirements, and it could be applied with limited amount of training data. One disadvantage is that the form of underlying probability density function is not known in most practical application, and therefore, a mismatch between the assumed form of the probability density function and the true form of the probability density function might occur.
Generally, non-parametric techniques have no prior assumption of the form of underlying probability density function, and therefore do not suffer from the mentioned above model mismatch. For example, a /c-Nearest-Neighbor approach attempts to estimate posterior probabilities Ρ(ωπι
Figure imgf000004_0001
for an unlabeled observation point x from a set of pre-stored L labeled training samples
Figure imgf000004_0002
in the following way. In a first step, a cell is centered around x and grown until the cell captures k nearest neighbors of x. For example, k could be selected as VL. In a second step, if km of these samples are labeled ¾, then posterior probabilities are estimated as: k
Ρ(ύ½ |χ) = - m = l, - , M (1)
For a reliable estimate k has to be large and all k neighbors have to be close to x. This could be achieved by pre-storing large amounts of labeled training data, i.e., L→∞. Hence, non-parametric techniques come with
computational complexity and storage requirements, which are prohibited for many practical applications.
Hence, there is still a need for an improved classification of audio. SUMMARY
An object of embodiments herein is to provide efficient audio classification.
According to a first aspect there is presented a method for non-parametric audio classification. The method is performed by a classification device. The method comprises obtaining a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The method comprises determining per- class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The method comprises classifying the input sequence to belong to the class for which the per-class posterior probability is largest.
Advantageously this provides efficient audio classification.
Advantageously this method is flexible with respect to the probability density function of the short-term frequency representation of an audio waveform by using a non-parametric estimation technique whilst only requiring low complexity requirements comparable to the complexity requirements of parametric based estimation techniques.
According to a second aspect there is presented a classification device for non-parametric audio classification. The classification device comprises processing circuitry. The processing circuitry is configured to cause the classification device to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The processing circuitry is configured to cause the classification device to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The processing circuitry is configured to cause the classification device to classify the input sequence to belong to the class for which the per-class posterior probability is largest.
Advantageously the proposed classification device requires low
computational effort for performing the non-parametric audio classification and is therefore practically implementable. According to a third aspect there is presented a classification device for non- parametric audio classification. The classification device comprises processing circuitry and a computer program product. The computer program product stores instructions that, when executed by the processing circuitry, causes the classification device to perform steps, or operations. The steps, or operations, cause the classification device to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The steps, or operations, cause the classification device to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The steps, or operations, cause the classification device to classify the input sequence to belong to the class for which the per- class posterior probability is largest.
According to a fourth aspect there is presented a classification device for non- parametric audio classification. The classification device comprises an obtain module configured to obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence divided into input vectors. The classification device comprises a determine module configured to determine per-class posterior probabilities for at least two classes. Each per-class posterior probability is based on a weighted sum of pre-stored per-cluster posterior probabilities for the at least two classes. Each class represents a unique audio classification property. The classification device comprises a classify module configured to classify the input sequence to belong to the class for which the per-class posterior probability is largest.
According to a fifth aspect there is presented a computer program for non- parametric audio classification, the computer program comprising computer program code which, when run on classification device, causes classification device to perform a method according to the first aspect. According to a sixth aspect there is presented a computer program product comprising a computer program according to the fifth aspect and a computer readable storage medium on which the computer program is stored.
It is to be noted that any feature of the first, second, third, fourth, fifth and sixth aspects may be applied to any other aspect, wherever appropriate.
Likewise, any advantage of the first aspect may equally apply to the second, third, fourth, fifth, and/or sixth aspect, respectively, and vice versa. Other objectives, features and advantages of the enclosed embodiments will be apparent from the following detailed disclosure, from the attached dependent claims as well as from the drawings.
Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the element, apparatus, component, means, step, etc." are to be interpreted openly as referring to at least one instance of the element, apparatus, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.
BRIEF DESCRIPTION OF THE DRAWINGS
The inventive concept is now described, by way of example, with reference to the accompanying drawings, in which:
Fig. l is a schematic diagram illustrating probability density functions of two classes;
Fig. 2 is a schematic block diagram of a classification device according to an embodiment;
Fig. 3 is a schematic diagram showing functional units of a classification device according to an embodiment;
Fig. 4 is a schematic diagram showing functional modules of a classification device according to an embodiment; Figs. 5 and 6 are flowcharts of methods according to embodiments; and
Fig. 7 shows one example of a computer program product comprising computer readable storage medium according to an embodiment.
DETAILED DESCRIPTION
The inventive concept will now be described more fully hereinafter with reference to the accompanying drawings, in which certain embodiments of the inventive concept are shown. This inventive concept may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and will fully convey the scope of the inventive concept to those skilled in the art. Like numbers refer to like elements throughout the description. Any step or feature illustrated by dashed lines should be regarded as optional.
The embodiments disclosed herein relate to non-parametric audio
classification. In order to obtain non-parametric audio classification there is provided a classification device, a method performed by the classification device, a computer program product comprising code, for example in the form of a computer program, that when run on a classification device, causes the classification device 200 to perform the method.
As an introductory example, reference is now made to Fig. 1. Fig. 1 at 100 and 110 illustrate probability density functions of two classes ω1 and ω2. Assume that a data point xn = 4 (as identified by reference numeral 130 in Fig. 1) is observed. The task of the classification device is to assign the data point xn to one of the classes ω1 and ω2, without having any assumption of the underlying form of the probability density functions 100, 110. Further explanation of Fig. 1 will be provided below.
Reference is now made to Fig. 2. Fig. 2 is a schematic block diagram of a classification device 200 according to an embodiment. According to the embodiment of Fig. 2, the classification device 200 comprises a classification module 210 and an optional training module 220. In embodiments where the training module 220 is not present it is assumed that the classification module 210 is provided with values determined by an external training module 220
Figs. 5 and 6 are flow chart illustrating embodiments of methods for non- parametric audio classification. The methods are performed by the
classification device 200. The methods are advantageously provided as computer programs 320.
Reference is now made to Fig. 5 illustrating a method for non-parametric audio classification as performed by the classification device 200 according to an embodiment.
Step S102: The classification device 200 obtains a short-term frequency representation of an audio waveform. The short-term frequency
representation define an input sequence x which is assumed to be divided into input vectors xn. More particularly, the input sequence x can be assumed to consist of n, N-dimensional, vectors and hence be written as x = {xn}^=1.
Step S104: The classification device 200 determines per-class posterior probabilities Ρ(ω for at least two classes ¾. Each per-class posterior probability Ρ(ωπι
Figure imgf000009_0001
based on a weighted sum Ρ(ωπι of pre-stored per- cluster posterior probabilities Ρ(ωπι1{) for the at least two classes ¾. Each class ωτη (in the set of classes {%}*=1) represents a unique audio
classification property- Step S106: The classification device 200 classifies the input sequence x to belong to the class ωτη for which the per-class posterior probability Ρ(ωπι
Figure imgf000009_0002
is largest.
Embodiments relating to further details of non-parametric audio
classification as performed by the classification device 200 will now be disclosed. Reference is made to Fig. 6 illustrating methods for non- parametric audio classification as performed by the classification device 200 according to further embodiments. Steps S102, S104, S106 are performed as in Fig. 5 and a repeated description of those steps is therefore omitted.
According to an embodiment the weighted sum Ρ(ωπι
Figure imgf000010_0001
is determined using cluster contribution weights k. The cluster contribution weights Ak are defined by distances Ak between the input sequence x and a set of clusters ck. Each cluster ck belongs to one of at least two classes ¾. In general there are M classes, and this is formally denoted by a set of classes {wm}"=1. For a given class ωτη the weighted sum Ρ(ωπι
Figure imgf000010_0002
when summed over all input vectors xn, define the per-class posterior probability Ρ(ωπι
Figure imgf000010_0003
for this given class ωτη.
There could be different properties that the at least two classes ωτη represent. For example, the non-parametric audio classification can be performed to classify languages, speaker, and/or genders. Hence, according to an embodiment each class ωτη represents a unique language, a unique speaker, or a unique gender.
There could be different ways for the classification device 200 to obtain the short-term frequency representation of the multimedia signal as in step S102. According to an embodiment the short-term frequency representation is provided by mel-frequency cepstral components (MFCCs). In this respect, the MFCCs are coefficients that collectively make up a mel-frequency cepstrum (MFC). The MFC is a representation of the short-term power spectrum of the audio waveform, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. There could be different ways for the classification device 200 to obtain the MFCCs. According to one embodiment the MFCCs are made readily available to the classification device 200. Hence, this embodiment requires a module configured to provide the MFCCs from the audio waveform. According to another embodiment the classification device 200 receives the audio waveform and extracts the MFCCs from the audio waveform. Hence, according to an embodiment the classification device 200 is configured to perform step Si02a: Step Si02a: The classification device 200 extracts the MFCCs from the audio waveform. How to extracts MFCCs from an audio waveform is as such known in the art and further description thereof is therefore omitted. Step Si02a is performed as part of step S102. Each input vector xn can then correspond to a vector of MFCCs. Assuming that the audio waveform is composed of frames, there is then one vector of MFCCs per frame.
There could be different types of audio waveforms. According to an
embodiment the audio waveform represents a speech signal.
There could be different types of audio classification of which the herein disclosed methods for non-parametric audio classification could be part of. According to an embodiment the step S102 of obtaining, the step S104 of determining, and the step S106 of classifying are performed in an audio mining application.
According to an embodiment the methods for non-parametric audio classification comprise a training stage and a classification stage.
Aspects of the training stage, as implemented by the training module 220, will now be disclosed in detail.
According to an embodiment the per-cluster posterior probabilities Ρ(ωπι1{) are determined through training on a training sequence y of MFCCs.
The training could be based on /c-means clustering of the training sequence y. Further details thereof will now be disclosed. Let the training sequence y consists of I, L-dimensional, vectors y =
Figure imgf000011_0001
where each vector y is linked to its corresponding class ¾. First the training sequence y is by the classification device 200 organized in K clusters {cfc=1, by means of a k- means clustering algorithm. This results in a codebook with K, D dimensional centroids μ . In addition to its centroid, each cluster ck is determined by its squared median absolute deviation factor pk, which the classification device 200 determines as: pk = median(|Zfe— median(Zfe) |)2 (2)
Here, Zk denotes the set of points that belong to cluster ck (that is, Zk is a sub-set of y in which every point is closer to μ than to any other mean μ≠¾). In general, the median absolute deviation factor pk is a robust to outliers statistic that captures variations inside cluster ck. Hence, according to an embodiment where each cluster ck has a cluster centroid μ , each per-cluster posterior probability Ρ(ωπι1{) for a particular class ωτη and a particular cluster centroid μ represents a conditional probability of the particular class ωτη given the particular cluster centroid μ .
After the set of clusters ck≡ {μ , pk}, for k = Ι, .,. , Κ is determined, the classification device 200 links each of the clusters ck to a table of posterior probabilities for each of the M classes P (wm \μ ), m = 1, ... , M. In order to do so the classification device 200 implements the following operations: for k = 1, ... , K determine the size of Zk and store it in a variable Lk; for m = 1, ... , M: count the number of labels of class ωτη and denote it as L^m; and store Ρ{ωπιΙι) as: Ρ{ωπιΙι) = -f— end for end for
Aspects of the classification stage, as implemented by the classification module 210, will now be disclosed in detail.
The classification device 200 is configured to, during the classification stage, assign an unlabeled sequence, as defined by the input sequence x as obtained in step S102, to the correct class from the at least two classes {wm}"=1. This corrected class is denoted o½*. As above, the input sequence x is assumed to consists of n, D dimensional, vectors x = {xn}^=1.
The classification device 200 is configured to determine a set of distances Ak from each data point xn in the input sequence x to all clusters ck. According to an embodiment there is one cluster contribution weight Ak for each input vector xn. The classification device 200 is then configured to determine the cluster contribution weight Ak for input vector xn by performing step Si04a:
Si04a: The classification device 200 determines one distance Ak between the input vector xn and each of the clusters ck. Step Si04a is performed as part of step S104.
Further, according to an embodiment each of the distances Ak is made inversely proportional to the median absolute deviation factor pk relating to a spread of points inside cluster ck. According to some aspects the distance between point xn and cluster ck is defined as:
Figure imgf000013_0001
For data point xn this results in K distances {Ak}k=1. The inverse of these distances can be used to weigh the contribution of clusters in the probability density function estimation at point xn. The distances {Ak}k=1 from the data point xn to all clusters ck is used to create a weighted sum of the per-cluster posterior probabilities. This weighted sum serves as an estimate of posterior probabilities at the observation point xn. According to an embodiment the classification device 200 is configured to determine the weighted sum
Ρ(ωπι
Figure imgf000013_0002
by performing step Si04b:
Si04b: The classification device 200 determines one cluster contribution weight Ak for each distance Ak. The cluster contribution weight Ak is inversely proportional to the distance Ak from which it is determined. Step Si04b is performed as part of step S104. Further, each cluster contribution weight Afc can be inversely proportional to a sum based on all distances Ak. According to some aspects the cluster contribution weights Ak are determined as:
= K -η (4)
Here, η is an expansion constant. Values that could be used for the expansion constant η are between 2 and 8, and are based on what property the non- parametric audio classification is be performed to classify and dimensionality of the input sequence x.
According to some aspects the classification device 200 is configured to estimate the posterior probabilities at point xn from the pre-stored per- cluster posterior probabilities Ρ(ωπι1{) and use the class contribution weights Afc, as:
K
P{um \xn) = ^ λ Ρ{ωπι), m = l, ... , M (5)
k=l
It is here noted that Js— as determined during the training stage is used to
Lk
represent Ρ{ωπι | ¾).
Further, according to an embodiment there is one weighted sum Ρ(ωπι
Figure imgf000014_0001
per class ωτη and per input vector xn, and all weighted sums Ρ(ωπι
Figure imgf000014_0002
for one class ωτη are combined over all input vector xn to define one per-class posterior probability Ρ(ωπι
Figure imgf000014_0003
According to an embodiment the classification device 200 is configured to determine the per-class posterior probabilities Ρ(ωπι for a given class ωτη by performing step S104C:
S104C: The classification device 200 sums the weighted sum Ρ(ωπι
Figure imgf000014_0004
over all input vectors xn for said given class ¾. Step S104C is performed as part of step S104. Logarithmic values of the weighted sum Ρ(ωπι
Figure imgf000014_0005
can be summed over all input vectors xn to determine a logarithmic value of the per-class posterior probabilities Ρ(ωπι
Figure imgf000015_0001
for said given class ¾. That is, to assign the optimal class o½* to the input sequence x, the classification device 200 can be configured to determine the log-probability of the entire input sequence x (over all n observations) as follows
N
log(P(o m |^)) = ^ \og(P (_a)m \xn)) (6)
71 =1
The classification device 200 is then configured to determine the class ωτη that corresponds to the largest posterior as:
Mm* = argmax{\og(P(a)m \x))} ^
ωπι
The optimal class o½* is thus found as the class ωτη with largest posterior probability, given the input sequence x.
A non-limiting illustrative example of at least some of the herein disclosed embodiments will be provided next. A training sequence y represents data that belong to one of two classes ω , <¾2is generated, as illustrated in the probability density functions 100 and 110 in Fig. 1, where probability density function 100 represents class ω , and where probability density function 110 represents class ω2. The cluster centers μ1, μ2 , μ3 , and μ4 are identified at reference numerals 120a, 120b, 120c, and i2od and have values (from left to right in Fig. 1) 4.91, 8.89, 10.99, and !5-09, which are also reflected in the first column of Table 1.
The training stage of the presented algorithm obtains four clusters, and the corresponding posterior probabilities Ρ(ωπι1{) as provided in Table 1.
Clusters Posterior probabilities μ1 = 4.91 ! = 1.36 Ρ{ω^μι) =0.85 Ρ(ω2 | !) =0.15 c2: μ2 = 8.89 P2 = 1.71 Ρ(ω12) =0.43 Ρ(ω2 |2) = =0.57 3 = 10.99 p3 = 1.39 =0.56 Ρ(ω23) =0.44 c4: μ4 = 15.09 p4 = 1.71 Ρ(ω14) = 0.12 Ρ(ω24) =0.88
Table 1: Clusters and pre-stored posterior probabilities obtained from a training sequence.
For the classification stage it is assumed that a data point xn = 4 (as identified by reference numeral 130 in Fig. 1) is observed and the task of the classification device 200 is to assign the data point xn to one of the classes ω1 and ω2.
The classification device 200 determines distances to all clusters by implementing Equation (3), and thus obtains k(xn, ck) = {0.6089, 13.9837, 35.1512, 71.9229}. The classification device 200 determines the cluster contributions to the given point xn by implementing Equation (4) and obtains Afc = {0.9977, 0.0019, 0.0003, 0.0001}. The classification device 200 determines the two posterior probabilities at point xn by implementing Equation (5) and thus obtains Pteo xJ = 0.8491 and Ρ(ω2η) = 0.1509. The classification device 200 determines, by implementing Equations (6) and (7), that the point xn = 4 belongs to class ω .
Fig. 3 schematically illustrates, in terms of a number of functional units, the components of a classification device 200 according to an embodiment.
Processing circuitry 310 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), etc., capable of executing software instructions stored in a computer program product 710 (as in Fig. 7), e.g. in the form of a storage medium 330. The processing circuitry 310 may further be provided as at least one application specific integrated circuit (ASIC), or field
programmable gate array (FPGA). Particularly, the processing circuitry 310 is configured to cause the
classification device 200 to perform a set of operations, or steps, S102-S106, as disclosed above. For example, the storage medium 330 may store the set of operations, and the processing circuitry 310 may be configured to retrieve the set of operations from the storage medium 330 to cause the classification device 200 to perform the set of operations. The set of operations may be provided as a set of executable instructions.
Thus the processing circuitry 310 is thereby arranged to execute methods as herein disclosed. The storage medium 330 may also comprise persistent storage, which, for example, can be any single one or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory. The classification device 200 may further comprise a communications interface 320 configured for communications with another device, for example to obtain the MFCCs as in step S102 and to provide a result of the classification as performed in step S106. As such the
communications interface 320 may comprise one or more transmitters and receivers, comprising analogue and digital components. The processing circuitry 310 controls the general operation of the classification device 200 e.g. by sending data and control signals to the communications interface 320 and the storage medium 330, by receiving data and reports from the communications interface 320, and by retrieving data and instructions from the storage medium 330. Other components, as well as the related
functionality, of the classification device 200 are omitted in order not to obscure the concepts presented herein.
Fig. 4 schematically illustrates, in terms of a number of functional modules, the components of a classification device 200 according to an embodiment. The classification device 200 of Fig. 4 comprises a number of functional modules; an obtain module 310a configured to perform step S102, a determine module 310b configured to perform step S104, and a classify module 310c configured to perform step S106. The classification device 200 of Fig. 4 may further comprises a number of optional functional modules, such as any of a determine module 3iod configured to perform step Si04a, a determine module 3ioe configured to perform step S104D, a sum module 3iof configured to perform step S104C, and an extract module 3iog configured to perform step Si02a. In general terms, each functional module 3ioa-3iog may in one embodiment be implemented only in hardware or and in another embodiment with the help of software, i.e., the latter embodiment having computer program instructions stored on the storage medium 330 which when run on the processing circuitry makes the classification device 200 perform the corresponding steps mentioned above in conjunction with Fig 4.
It should also be mentioned that even though the modules correspond to parts of a computer program, they do not need to be separate modules therein, but the way in which they are implemented in software is dependent on the programming language used. Preferably, one or more or all functional modules 3ioa-3iog may be implemented by the processing circuitry 310, possibly in cooperation with functional units 320 and/or 330. The processing circuitry 310 may thus be configured to from the storage medium 330 fetch instructions as provided by a functional module 3ioa-3iog and to execute these instructions, thereby performing any steps as will be disclosed herein.
The classification device 200 may be provided as a standalone device or as a part of at least one further device. For example, the classification device 200 may be provided in an audio mining device. Alternatively, functionality of the classification device 200 may be distributed between at least two devices, or nodes. These at least two nodes, or devices, may either be part of the same network part or may be spread between at least two such network parts.
Thus, a first portion of the instructions performed by the classification device 200 may be executed in a first device, and a second portion of the of the instructions performed by the classification device 200 may be executed in a second device; the herein disclosed embodiments are not limited to any particular number of devices on which the instructions performed by the classification device 200 may be executed. Hence, the methods according to the herein disclosed embodiments are suitable to be performed by a classification device 200 residing in a cloud computational environment. Therefore, although a single processing circuitry 310 is illustrated in Fig. 3 the processing circuitry 310 may be distributed among a plurality of devices, or nodes. The same applies to the functional modules 3ioa-3iog of Fig. 4 and the computer program 720 of Fig. 7 (see below).
Fig. 7 shows one example of a computer program product 710 comprising computer readable storage medium 730. On this computer readable storage medium 730, a computer program 720 can be stored, which computer program 720 can cause the processing circuitry 310 and thereto operatively coupled entities and devices, such as the communications interface 320 and the storage medium 330, to execute methods according to embodiments described herein. The computer program 720 and/or computer program product 710 may thus provide means for performing any steps as herein disclosed. In the example of Fig. 7, the computer program product 710 is illustrated as an optical disc, such as a CD (compact disc) or a DVD (digital versatile disc) or a Blu-Ray disc. The computer program product 710 could also be embodied as a memory, such as a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM) and more particularly as a non-volatile storage medium of a device in an external memory such as a USB (Universal Serial Bus) memory or a Flash memory, such as a compact Flash memory. Thus, while the computer program 720 is here schematically shown as a track on the depicted optical disk, the computer program 720 can be stored in any way which is suitable for the computer program product 710.
The inventive concept has mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the inventive concept, as defined by the appended patent claims.

Claims

l8 CLAIMS
1. A method for non-parametric audio classification, the method being performed by a classification device (200), the method comprising:
obtaining (S102) a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors xn;
determining (S104) per-class posterior probabilities P(c>m \x) for at least two classes, wherein each per-class posterior probability P(c>m \x) is based on a weighted sum P( 6)m \xn) of pre-stored per-cluster posterior probabilities for the at least two classes, and wherein each class wm represents a unique audio classification property; and
classifying (S106) the input sequence x to belong to the class wm for which the per-class posterior probability P( 6)m \x) is largest.
2. The method according to claim 1, wherein the weighted sum P( 6)m \xn) is determined using cluster contribution weights λ¾ defined by distances Ak between the input sequence x and a set of clusters c¾.
3. The method according to any of the preceding claims, wherein, for a given class, the weighted sum P( 6)m \xn) when summed over all input vectors xn defines the per-class posterior probability P(c>m \x) for said given class.
4. The method according to any of the preceding claims, wherein there is one cluster contribution weight λ¾ for each input vector xn, and wherein determining the cluster contribution weight λ¾ for input vector xn comprises: determining (Si04a) one distance Ak between the input vector xn and each of the clusters c¾.
5. The method according to claim 4, wherein each of the distances Ak is made inversely proportional to a median absolute deviation factor p¾ relating to a spread of points inside cluster Ck.
6. The method according to claim 4 or 5, wherein determining the weighted sum P( 6)m \xn) of per-cluster posterior probabilities P( 6)m |μ¾9 comprises:
determining (Si04b) one cluster contribution weight λ¾ for each distance Ak, and wherein the cluster contribution weight λ¾ is inversely proportional to the distance Ak from which it is determined.
7. The method according to claim 6, wherein each cluster contribution weight k is inversely proportional to a sum based on all distances Ak.
8. The method according to claim 6 or 7, wherein there is one weighted sum Ρ(ωτη \xn) per class wm and per input vector xn, and wherein all weighted sums Ρ(ωτη \xn) for one class wm are combined over all input vector xn to define one per-class posterior probability P( 6)m \x).
9. The method according to claim 8, wherein determining the per-class posterior probabilities P(c>m \x) for a given class wm comprises:
summing (S104C) the weighted sum P( 6)m \xn) over all input vectors xn for said given class wm.
10. The method according to claim 9, wherein logarithmic values of the weighted sum P( 6)m \xn) are summed over all input vectors xn to determine a logarithmic value of the per-class posterior probabilities P(c>m \x) for said given class wm.
11. The method according to any of the preceding claims, wherein each cluster Ck has a cluster centroid μ¾, and wherein each per-cluster posterior probability P( 6)m |μ¾9 for a particular class wm and a particular cluster centroid μk represents a conditional probability of said particular class wm given said particular cluster centroid μ¾.
12. The method according to any of the preceding claims, wherein the per- cluster posterior probabilities Ρ(ωτη \μ^ are determined through training on a training sequence y.
13. The method according to claim 12, wherein the training is based on k- means clustering of the training sequence y.
14. The method according to any of the preceding claims, wherein each class Wm represents a unique language, a unique speaker, or a unique gender.
15. The method according to claim 1, wherein the short-term frequency representation is provided by mel-frequency cepstral components, MFCCs.
16. The method according to claim 15, further comprising:
extracting (Si02a) the MFCCs from the audio waveform.
17. The method according to claim 16, wherein each input vector xn
corresponds to a vector of MFCCs, wherein the audio waveform is composed of frames, wherein there is one vector of MFCCs per frame.
18. The method according to claim 16 or 17, wherein the audio waveform represents a speech signal.
19. The method according to any of the preceding claims, wherein said obtaining, said determining, and said classifying are performed in an audio mining application.
20. A classification device (200) for non-parametric audio classification, the classification device (200) comprising processing circuitry (310), the processing circuitry being configured to cause the classification device (200) to:
obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors xn;
determine per-class posterior probabilities
Figure imgf000022_0001
for at least two classes, wherein each per-class posterior probability
Figure imgf000022_0002
is based on a weighted sum ( 6)m \xn) of pre-stored per-cluster posterior probabilities
Ρ(ωτη \μι) for the at least two classes, and wherein each class wm represents a unique audio classification property; and classify the input sequence x to belong to the class wm for which the per- class posterior probability P( 6)m \x) is largest.
21. A classification device (200) for non-parametric audio classification, the classification device (200) comprising:
processing circuitry (310); and
a computer program product (710) storing instructions that, when executed by the processing circuitry (310), causes the classification device (200) to:
obtain a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors xn;
determine per-class posterior probabilitie for at least two classes, wherein each per-class posterior probability
Figure imgf000023_0001
s based on a weighted sum P( 6)m \xn) of pre-stored per-cluster posterior probabilities
Ρ(ωτη \μι) for the at least two classes, and wherein each class wm represents a unique audio classification property; and
classify the input sequence x to belong to the class wm for which the per-class posterior probability P( 6)m \x) is largest.
22. A classification device (200) for non-parametric audio classification, the classification device (200) comprising:
an obtain module (310a) configured to obtain a short-term frequency representation of an audio waveform, the short-term frequency
representation defining an input sequence x divided into input vectors xn;
a determine module (310b) configured to determine per-class posterior probabilities
Figure imgf000023_0002
for at least two classes, wherein each per-class posterior probability P( 6)m \x) is based on a weighted sum P( 6)m \xn) of pre- stored per-cluster posterior probabilities Ρ(ωτη \μι) for the at least two classes, and wherein each class wm represents a unique audio classification property; and
a classify module (310c) configured to classify the input sequence x to belong to the class wm for which the per-class posterior probability P( 6)m \x) is largest.
23. A computer program (720) for non-parametric audio classification, the computer program comprising computer code which, when run on processing circuitry (310) of a classification device (200), causes the classification device (200) to:
obtain (S102) a short-term frequency representation of an audio waveform, the short-term frequency representation defining an input sequence x divided into input vectors xn;
determine (S104) per-class posterior probabilities or at least two classes, wherein each per-class posterior probabilit
Figure imgf000024_0001
is based on a weighted sum ( 6)m \xn) of pre-stored per-cluster posterior probabilities Ρ(ωτη \μι) for the at least two classes, and wherein each class wm represents a unique audio classification property; and
classify (S106) the input sequence x to belong to the class wm for which the per-class posterior probability ( 6)m \x) is largest.
24. A computer program product (710) comprising a computer program (720) according to claim 23, and a computer readable storage medium (730) on which the computer program is stored.
PCT/EP2016/054586 2016-03-03 2016-03-03 Non-parametric audio classification WO2017148523A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2016/054586 WO2017148523A1 (en) 2016-03-03 2016-03-03 Non-parametric audio classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2016/054586 WO2017148523A1 (en) 2016-03-03 2016-03-03 Non-parametric audio classification

Publications (1)

Publication Number Publication Date
WO2017148523A1 true WO2017148523A1 (en) 2017-09-08

Family

ID=55484980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2016/054586 WO2017148523A1 (en) 2016-03-03 2016-03-03 Non-parametric audio classification

Country Status (1)

Country Link
WO (1) WO2017148523A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
CN111583963A (en) * 2020-05-18 2020-08-25 合肥讯飞数码科技有限公司 Method, device and equipment for detecting repeated audio and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040221163A1 (en) * 2003-05-02 2004-11-04 Jorgensen Jimi T. Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040221163A1 (en) * 2003-05-02 2004-11-04 Jorgensen Jimi T. Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
EDMONDO TRENTIN ET AL: "A survey of hybrid ANN/HMM models for automatic speech recognition", NEUROCOMPUTING, vol. 37, no. 1-4, 1 April 2001 (2001-04-01), pages 91 - 126, XP055131961, ISSN: 0925-2312, DOI: 10.1016/S0925-2312(00)00308-8 *
FRITSCH JÜRGEN ET AL: "Applying Divide and Conquer to Large Scale Pattern Recognition Tasks", 1 January 1901, CORRECT SYSTEM DESIGN; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, PAGE(S) 311 - 338, ISBN: 978-3-642-23953-3, ISSN: 0302-9743, XP047292588 *
MARIO BKASSINY ET AL: "A Survey on Machine-Learning Techniques in Cognitive Radios", IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, US, vol. 15, no. 3, 1 July 2013 (2013-07-01), pages 1136 - 1159, XP011523247, ISSN: 1553-877X, DOI: 10.1109/SURV.2012.100412.00017 *
SAMSUDIN N A ET AL: "Nearest neighbour group-based classification", PATTERN RECOGNITION, ELSEVIER, GB, vol. 43, no. 10, 1 October 2010 (2010-10-01), pages 3458 - 3467, XP027095435, ISSN: 0031-3203, [retrieved on 20100512] *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
CN108389576B (en) * 2018-01-10 2020-09-01 苏州思必驰信息科技有限公司 Method and system for optimizing compressed speech recognition model
CN111583963A (en) * 2020-05-18 2020-08-25 合肥讯飞数码科技有限公司 Method, device and equipment for detecting repeated audio and storage medium
CN111583963B (en) * 2020-05-18 2023-03-21 合肥讯飞数码科技有限公司 Repeated audio detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Nanni et al. Ensemble of convolutional neural networks to improve animal audio classification
Plinge et al. A bag-of-features approach to acoustic event detection
CN110362677B (en) Text data category identification method and device, storage medium and computer equipment
US10832685B2 (en) Speech processing device, speech processing method, and computer program product
CN109493881B (en) Method and device for labeling audio and computing equipment
KR102281676B1 (en) Audio classification method based on neural network for waveform input and analyzing apparatus
Sun et al. Ensemble softmax regression model for speech emotion recognition
US20210117733A1 (en) Pattern recognition apparatus, pattern recognition method, and computer-readable recording medium
CN108875487B (en) Training of pedestrian re-recognition network and pedestrian re-recognition based on training
Babaee et al. An overview of audio event detection methods from feature extraction to classification
US8369611B2 (en) Compact handwriting recognition
Carbonneau et al. Feature learning from spectrograms for assessment of personality traits
JP6923089B2 (en) Information processing equipment, methods and programs
Ji et al. Dictionary-based active learning for sound event classification
Nanni et al. Bird and whale species identification using sound images
US9330662B2 (en) Pattern classifier device, pattern classifying method, computer program product, learning device, and learning method
CN110226201B (en) Speech recognition with periodic indication
WO2017148523A1 (en) Non-parametric audio classification
US10891942B2 (en) Uncertainty measure of a mixture-model based pattern classifer
WO2016181474A1 (en) Pattern recognition device, pattern recognition method and program
Amid et al. Unsupervised feature extraction for multimedia event detection and ranking using audio content
Shah et al. Speech recognition using spectrogram-based visual features
Egas-López et al. Predicting a cold from speech using fisher vectors; svm and xgboost as classifiers
Xie et al. Investigation of acoustic and visual features for frog call classification
KR20230093826A (en) Video data labeling method and devicd for animal detection and classification

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16708395

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16708395

Country of ref document: EP

Kind code of ref document: A1