GB2523730A

GB2523730A - Processing audio data to produce metadata

Info

Publication number: GB2523730A
Application number: GB1401218.1A
Authority: GB
Inventors: David Marston; Chris Baume; Panos Kudumakis; Mathieu Barthet; Gyorgy Fazekas; Andrew Hill; Mark Sandler
Original assignee: BROADCHART INTERNAT Ltd; British Broadcasting Corp; Queen Mary University of London
Current assignee: BROADCHART INTERNAT Ltd; British Broadcasting Corp; Queen Mary University of London
Priority date: 2014-01-24
Filing date: 2014-01-24
Publication date: 2015-09-09
Also published as: WO2015110823A1; GB201401218D0

Abstract

In a method for automatic retrieval of audio content, features (eg. 63 different features such as tempo, loudness and spectral slope, Table 1) are derived from a sample audio file to produce an F-dimensional (eg. 32-D) vector which is converted into an M-dimensional (eg. 5-D) vector via a machine learning module acting on trained audio content (eg. 450 different mood keywords such as joyful, confident, exciting) in order to allow selection and retrieval of music audio files. A database of audio tracks having accurately chosen metadata may be used to train the machine learning in order to generate metadata (ie. estimate mood tags) without human input.

Description

I

PROCESSING AUDIO DATA TO PRODUCE METADATA

BACKGROUND OF THE INVENTION

This invention relates to a system and method for processing audio data to produce metadata and for controiling retrieval and output of music audio files.

Audio content, such as music, may be stored in a variety of formats and can have accompanying metadata describing the content that may be stored with the content or separately. Recorded music comprises tracks, movements, albums or other useful divisions. For simplicity, we will refer to a portion of music or other audio data, however divided, as audio content.

It is convenient to store metadata related to audio content to assist in the storage and retrieval of audio content from databases for use with guides. Such meladata may be represented graphcally for user seleobon or may be used by systems for processing the audio content. Example metadata includes the

contents title, textural description and genre.

There can be problems in appropriately deriving and using metadata. Curated music coUections require many manhours of work to maintain. Further, the use of keywords alone can be an inaccurate representation of complex aspects of music.

There can also be problems in the reliability of created metadata, particularly where the metadata requires some form of human intervention, rather than automated machine processing. If the metadata is not reliable, then the extraction process will again lead to poor results sets.

SUMMARY OF THE INVENTiON

We have appreciated the need to convert between audio content and metadata using techniques that improve on the abiUty for the metadata to represent the associated audio content to aVow improved retrieval and output.

In broad terms, the invention provides a system and method for converting between audio content and metadata using a conversion between an M-dimensional vector mood space, derived from metadata of audio content training data, and an F-dimensional vector feature space derived from features extracted from the audio content training data The use of mapping between the two vector spaces provides more accurate and faster searching and output techniques than prior arrangements.

The invention is defined in the claims to which reference is now directed.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described in more detail by way of example with reference to the drawings, in which: Fiqure 1 is a diagram of the main functional components of a system embodying the invention; Figure 2 is a diagrammatic representation of an algorithm embodying the invention; rgure 3 shows an overview of one specific use case of an embodiment of the invention; and Figure 4 is a graph showing an analysis of arcuray of the embodiment

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention may be embodied in a method and system for processing metadata related to audio data (which may also be referred to as audio content) to produce an output signal. The output signal may be used for controlling a display, initiating playback or controlling other hardware A system embodying the invention is shown in Figure 1. The system may be implemented as dedicated hardware or as a process within a larger system.

The system 100 embodying the invention has an input 2 for receiving audio content that is to be used for the purpose of deriving a machine learning conversion. The content input 2 may comprise an input from a database of audio content such as a music library of audio music tracks which may be gathered as single tracks or albums or otherwise categorised into a structured store. Each portion of audio content, such as a given music track, comprises audio data as well as metadata, particularly keywords used to tag the audio content. The audio content is first provided to a keyword analyser 4.

The keyword analyser 4 analyses the set of keywords found within all of the training audio content and derives a conversion between each keyword and a multidimeisional vector M so that each keyword may be easily converted to a ich a multidimersional vector In the example embodiment, the total set of keywords comprises 450 distinct words for the training content. The M dimensional space preferably has five dimensions and we will refer to this space as a "mood space.

Each axis of the mood space has no particular meaning and is simply a useful representation by which the keywords may be classified.

The M dimensional vectors may be derived in a variety of ways Lncludinq techniques based on user input in which users are asked to score words for similarity, automated processes in which words are compared using libraries or a thesaurus or a statistical process from which the relationship between the keywords can be inferred. The important point to note is that a large set of words is converted such that each word may be represented by a vector in an M dimensional space. The number of dimensions is less than the number of distinct words and preferably, tor computational reduction, is significantly less than the number of words. As noted above, the number of dimensions preferred, based on analysis, is M = 5 being considerably less than the 450 distinct training words.

The next component in the system 100 is a feature analyser 6 which also receives as an input the audio content at audio content input 2. The purpose of the feature analyser is to produce a feature vector in a feature vector space F of F dimensions-Features of audio content include matters such as tempo, loudness, spectral slope and other such aspects of audio understood in the art The feature analyser extracts a set of features from each portion of audio content and converts these to an F dimensional vector in the F dimensional space. In the example embodiment, 63 differing features of audio music, discussed later, are applied to the feature analyser which reduces the set of features down to an F dimensional vector. The referred number of dimensions for the F dimensional vector is 32.

The output from the two analyser stages, therefore, is an M dimensional vector representing the audio content in mood space and an F dimensional vector representing the content in feature space Each portion of content has one M dimensional vector and one F dimensional vector. It would be possible to extend this to have more than one such vector for each piece of content, but one is preferred in the embodiment for simplicity. The machine learning stage 8 receives the F dimensional and M dimensional vectors for each portion of audio content and, using machine learning techniques, derives a conversion between the M dimensional vectors and F dimensional vectors. The preferred machine learning operation is discussed in greater detail later. In broad terms, the analysis involves deriving a conversion or mapping between the M dimensional vector and F dimensional vector by looking for a correlation between the M dimensional vector and F dimensional vector in the training data set. The larger the training data size, up to a threshold, the more accurate the correlation between the two data sets. The output of the machine learning module 8 is a mapping or conversion between the M dimensional vectors and F dimensional vectors which may be used for analysis of specific samples of audio content.

The conversion or mapping is provided to a converter 10 which is arranged to operate a reverse process in which vectors in the F dimensional feature space can be converted to vectors in the M dmensional mood space. A given sample of audio content may then be tagged with one or more M dirrensioral vectors that represent the mood of the music. This representation of mood is a vector that does not directly have any meaning, but as it has been derived by a model that has used natural language and statistical processing from keywords that do have meaning; the mood vector will have a useful purpose in looking for similarities between portions of music.

A particular sample of audio content such as a music track may be input at sample input 12 to the converter 10 and the process of feature extraction operated within the converter 10 to derive the F dimensional vector. The conversion or mapping to an M dimensional vector is performed and an output asserted on output line 11 The output may comprise the M dimensional vector for that piece of audio content Alternatively, the converter may further convert the M dimensional vector to one or more keywords using an M dimensional vector to keyword conversion also provided From the learning module 8 Alternatively, the output 11 may assert a signal back to the content input 2 to retneve a similar piece of audio content directly from the body of audio content, or to a different music database comprising content that has either been tagged with keywords or that is automatically tagged with M dimensional vectors. In this way, the system may be implemented as part of a database system, a music player or other user device which can automatically retrieve, select, display in a list or play music tracks.

Figure 2 summarises the process operated by the above system that may be conveniently split into two stages. The first stage is a training stage comprising analysis of keywords 20, analysis of features 22 and the derivation of a conversion 24 from a training set of data. The second stage then extracts features from a specific sample of audio content at a featured extraction stage 26 converts those features to vectors at conversion stage 28 and asserts an output at output step 30 in the manner described above. The steps operated in Figure 2 may be operated within one device. Preferably, though, steps 20 to 24 are provkied in systems for deriving a conversion and steps 26 to 30 are provided in a user device such as a music player.

A particular use case for the techniques described will now be set out in relation to Agure 3 The exampki is for generating mood metadata for commercial music which itself has no metadata that can be used for creative purposes such as for music selection. The training data provided is so called production music, this is music that is typically not sold to the public but is mainly used in the film and TV industry. Databases of such production music are available that have been manually catalogued with high quality keywords and so is useful in providing training data. Commercial music is generally available music for purchase such as singles and albums available in hard copy or for download, this is not generally well catalogued and indexed with keywords.

As shown in Figure 3 the production music training database is sampled, audio features extracted and training and tests performed to derive the best audio features [or selection. Alongside this, the editorial tags such as keywords are retrieved, these filtered to remove redundancies in informatio' and a finai se' of editorial tags derived to create a mood.. The selected audio features and mood values are then input to a regressor training and testing module and this step process repeated until the best performing features found and final training models derived to give a track-to-mood mapping that may be used within a converter. Commercial music may then be input to a converter using the track-to-mood mapping which may be effectively operated in reverse to give estimated mood tags as a result.

The feature extraction process will now be described in greater detaiL There are many feature extraction software packages available, including MATLAB-based ones such as MIR Toolbox, MAToolbox and PsySound3, or open sourceC++/Python libraries such. as Marsyas, CLAM, LibXtract, Aubio and YAAFE However, it remains difficult to know whether audio features comouted by different audio feature extraction and analysis tools are mutually compatible or interchangeable Moreover, if differert tools were used in the same experiment, the outputs typically need conversion to some sort of common format, and for reproducibUity, this gftie code needs to evolve with the changes of the tools thems&ves.

Ta resolve these issues, we used the Vamp plugin architecture developed at QMLL as a standardised way of housing feature extraction algorithms A large number of algorithms are available as Vamp plugins, and as they can all be used from the command line using Sonic Annotator, t is easy to extract a wide variety of features. FiveVamp plugin collections were selected for use as part of the project -the BBC plugin set, the OMUL plugin set, NNLSChroma, Mazurka, and libxtract The plugins developed and used in this work were released as open source software available online.

We analysed 63 features computed from these Vamp plugins by using them as the input to a four mood classifier. The results showed that some of the features (e.g. spectral kurtosis and skewness) had no correlation with the four basic moods which were considered in the experiments, so these were not included in further extraction processes Of the remaining 47 algorithms, 40 were used with their default settings, while the remaining ones were set up with a variety of configurations, producing a total of 59 features. These are listed in Table 1.

The inventors used a high-performance computing cluster, which houses over 5,000 Intel Sandy Bridge cores, to extract the features From 128 024 music tracks. The tracks were first down-mixed to mono in order to save time when transferring the files over FTP. Since all of the algorithms use a mono input, this does not affect the result. Once the files were on the cluster, a separate task was run for each music track in which Sonic Annotator was used to extract each feature. As a result of parallelisation, all designated features were extracted from the collection in less than seven hours.

Example features are listed in Table 1 at the end of this description.

The next stage is feature selection to now be described: Feature selection is the process of selecting a subset of features for the purpose of removing redundant data. By utifising this pre-processing stage, the accuracy and speed of machine learning-based systems can be improved Generally speaking, feature selection involves choosing subsets of features and evaluating them until a stopping criterion is reached The different subset evaluation techniques broadly fall into three categories: the filter model, the wrapper model and the hybrid modeL The filter model relies on general characteristics of the data to evaluate feature subsets, whereas the wrapper model uses the performance of a predetermined algorithm (such as a support vector machine) as the evaluation criterion. The wrapper model gives superior performance as it finds features best suited to the chosen algorithm, but it is more computationafly expensive and specific to that algorithm. The hybrid model attempts to combine the advantages of both.

Most studies which have employed feature selection in the context of music mood/genre classification have used the filter model, with the RehefF algorithm being particularly popular Notably however, there are a couple of studies that have successfully used the wrapper method. Due to the superior performance of the wrapper method, this was chosen as the evaluation.

The music data used for evaluating the features was randomly selected from a production music library, with each track coning from a different album (to avoid skewing data with an album effect' due to similarity of tracks), being over 30 seconds in length (to avoid sound effects and shortened versions of tracks) and having explicitly labelled mood tags. This resulted in 1,760 tracks, whose features were scaled to fall between 0 and 1, before the tracks were randomly split into two-thirds training (1,173) and one-third testing (587). Where the feature is time-varying, the foflowing six metrics were used to sumniarise the output: mean, standard deviation minimum, maximum, median ano mode Although some of these statistics assume a Gaussian distributiori of audio features (an assumption which clearly does not hold in all cases), we found that the above combination of metrics provide a reasonable compromise. Using a bag-of4rames approach as an alternative would require storing large amounts of frame-wise feature data, which isn't practicai given the size of the target music collection in which our method will be applied.

The features were evaluated by using combinations of them as the input, and a five-dimens onal mood repnsentation as the output The mood model used is based on the structure of the keywords in the production music database.

The system was implemented by using five support vector regressors (SVR, each based on a polynomial kernel. The implementation used the scikit-learn Python module. Although the RBF kernel has often been used for MER, a recent study has shown the polynomial kernel to be faster and more accurate.

Two-fold cross-validation was used with each regressor to perform a grid search.

Each regressor was trained using the optimum parameters from cross-validation and evaluated against the test set, using absolute error as the performance metric.

As there are over 5 7x10' different possthle combinations of the 59 features it would have been impractical to perform an exhaustive search so a forward sequential search was chosen, where features are added successively.

To avod the problem of the subset size increasing exponentially, the following algorithm was developed: 1, Start with a set containing every combination of N features 2. Evaluate the performance of each and choose the M best combinations 3. Generate a new set of combinations by adding every one of the 59 features to each of the top M to make combinations with (N + 1) features 4. Repeat from step 2 To maximise the computing time available the parameters were set as N = 2 and M 12.

Figure 4 shows the best absolute error achieved for each regressor when combining up to 20 different features, with the minima marked as triangles. Table 2 shows which features were used to achieve those minima. The fufi results, which show the features that were used for every point of the graph in Figure 4.

The overall minimum mean error achieved by using the best feature combinations for each regressor was 0:1699, From the shape of the plots in Figure 2, we can see that using more features produces diminishing returns. In each case, the error reaches a minimum before reaching a baseiine. This could be avoided by using cross-validation, but that would make the process somewhere in the order of 100 times slower, which is prohibitively slow given the mildness of the over-fitting.

Table 2 shows that mood prediction benefits from a wide variety of features (32 in this case) from every category. However, some of the regressors were more reliant on certain categories than others. For example, SVR1 uses harmonic and rhythmic features more than spectral ones. This suggests that it may be advantageous to optimise the features for individual dimensions.

The design of specific mood models will now be described n greater detail In the same way that a measure of similarity between tracks can be derived from tag co-occurrence coun's, a measure of similarity between tags can bo derived from their co-occurrence over tracks.

In the case of curated editorial metadata, tracks are associated with a list of unique tags judged to be the mast appropriate by professional music experts.

Hence, a given tag is only attributed once to a track, unlike for social tags for which a large number of users sets tags to tracks. Initially the mood tags were cieaned by correctiriq missoellings (.100 errors out of 2,398 mood tags), removing duplicates (338 duplicates yielding 2,060 unique tags), and stripping white spaces and punctuation marks (e.g. ., I"). Instead of following a bag-of-words approach for which the meaning of certain tags with multiple words can be lost (e g guilty pleasure"), we colated words of alternate forms to further process them as single entities (using a hyphen between the words). The vocabulary used in the editorial annotations a composed of conventional words and does not have the idiosyncrasies of social tags which often include sentences, informal expressions (e.g. "good for dancing to in a goth bar", or artists' names. For this reason, we did not have to tokenise the tags with a stop-Ust (to remove words such as v', and" the', for instance) However we used a stemmer a'gnrdhm 3 to detect tags with similar base parts (e.g. oyful" and "joy"), as these refer to identical emotional concepts. 1873 mood-related stems were obtained out of the 2,060 unique mood tags. In order to reduce the size of the stem vector whfle maintaining the richness of the track descriptions, we only kept stems which were associated with at least 100 tracks in further analyses. This stem Filtering process yielded a list of 453 stems which provided at east one mood tag for each of the 183,176 tracks from the ILM dataset.

The associations between tracks and stems are provided in a document-term matrix Ii where:

I

The stem pairwise co-occurrences over tracks cij are then given by: where fxig is the set of tracks annotated with stem i and jj is the cardinahty operator The measure of d ss mdarity between stems sij s computed as follows where Max(cU) is the maximum of the pairwise stem cooccurrence in the ILM dataset (26,859).

Non-metric multidimensbnai scaling (MDS) analyses werethen applied to the stem dissimUarity matrix, S = (su).

Four outlier stems presenting a nufl or very small co-occurrence measure with aD the other sterns were discarded, not to bias the MOS analysis (this yielded a list of 449 stems).

We have frotted the evolution of Kruskal's stressi as the number of dimensions D increases from 1 to 13 Following a rule of thumb for MDS, acceptable, good and exce Vent representations are obtained for D = 3 (stress <0 2), D = 5 (stress 0.1) arid D = 11 (stress < 0.O5. Interestingly, five dimensions yield a good representation (elbow of tne scree plot) This result suggesfs that more than three dimensions are required to accurately categorise mood terms in the context of production mjsic, which contrasts with the classical tiree-dimensional emotion model (arousal, valence and dominance) n further analyses, we mapped the mood sterns back to mood tags to uncover the meaning of the dimensions.

Interestingly, analysis revealed that three out of the five MDS dimensions are significantly correlated with the arousal and/or valence and/or dominance dimensions, showng that the 5-D MDS configuration captures aspects of the core emotion dimensions.

We devised several methods to summarise the tags of a track in a given multidimensional mood space. Lets denote

I

H

the tag matrix representing the coordinates of the tags i of a track across the dimenstors j of the mood space For the methods described above, the taq

summary matrix

Y

is obtained by multiplying the tag matrix with a weight matrix = {Hw1 This method assumes that a track is best represented by the tag from fts set of tags wh'ch has the highest term frequency (TF) in the dataset The weight wi for the N tags Qf a (rack are as follows: f I f TP(t) \lar (1 [T 1 0 otherwse This method summarises the tags of a track by their centroid or geometric& mean in the mood space. The tag weights are hence given by Wi = The Term-Frequency Weighted Centroid (TFW) is a method that surnmarises the tags of a track by their centroid after attributing to each tag a weight proportional U). -_________ to its term frequency (Tfl: 7. Hence the centroid is attracted by the tag of highest term frequency.

Inverse Term-Frequency Weighted Centroid (ITE) Conversely, this method attributes more weight to the tag of lowest term frequency following the assumption that this Lag may convey more specific information about the a.:. song:

Rather than surnmarising the tags of a track by a point in the space, this method assumes that the tags can be represented by a Gaussian distribution. The tag

summary matrix V is given by the mean

and variance a(T) of the tag maLrix: Y = {t;a(T)}.

Mode! Derived from Mood Taxonomy (CLUST Popular mood key words were added to an initial selection provided by OMUL to create a USt of 355 mood words. Over 95% of the production music library contained at least one of these 355 words. Each of these words were placed in one of 27 categories, which became the starting point for a cluster-based model.

Each category was treated as a cluster containing several mood words. Many of these clusters could be considered to overlap in their mood, some were clearly opposites while others had little in common. To convert these clusters into dimensons, the overlapping ones were combired into single dimens ons, any opposite clusters were converted into negative (-ye) values of the dimension they were opposite to; and the non-overlapping dusters were treated as new dimensons Using this method, the 27 eftisters wore converted to 10 dimensions, giving each of the 355 mood words 10 dimensional mood values.

The choice of allocation of words to clusters and cluster to dimensions was performed on only one persons opinion. The choice of 10 dimensions was a compromise between combining clusters that are too dissimilar and having a too sparse model. To illustrate the process, the first three dimensions represent the following mood clusters: 1) Confident (+ve scale), Caution & Doubt (-ye scale), 2) Sad & Serious (i-va scale), Happy & Optimistic (-ye scale), and 3) Exciting (-i-vs scale), calm (-ye scale).

As each music track is associated with several mood tags, each mapped to 10 dimensional values, tags had to be combined. The most simple and obvious way would be to take the mean of all the mood values to generate a single 10 dimensiona' value for a track However, it was felt that a music track can be represented by moods that differ significantly, so combining them into a single mood would be too crude Therefore, a method (denoted PEA) to generate two mood values per track was devised. This method uses clustering of the 10-D scores where dose scores are combined together. The means of the two most significant clusters are then calculated, resulting in two 10-0 mood values for each track. A weight was assigned to each value according to the size of the cluster.

For the purposes of searching a database of tracks with mood values assigned to them, a distance measurement is required to find which tracks most closely match each other. For the MDS-based models, distance between tracks were obtained using either the Euclidean distance between tag summary vectors (methods MTF, CEN, TFW, ITF), or the Kuilback-Leibler (KL) divergence between the Gaussian representations of the tags (method MVA). As the model described above allocates two 10-D mood values per track (method PEA), a weighted Euclidean measure was used which exploited the weighting valued associated with each of the two 10-D mood values. This is shown in equation (5) where maO; K) is the mood of the seed track (where i is the value index, and k is the dimension index). mt(j; k) is the mood expressed by the test track (where j is the value index), ws(i) is the seed track weighting, and wtQ) is the test track weighting.

The embodying system may be implemented in hardware in a number of ways. In a preferred embodiment, the content store 2 may be an online service such as cable television or Internet TV. Simarly, the general user parameters store 6 may be an online service provided as part of cable television delivery, periodically downloaded or provided once to the metadata processor and then retained by the processor. The remaining functionality may be implemented within a client device such as a set top box, PC, TV or other client device for AV content.

In an alternative implementation, the client device 100 may include the content store 2 and the general user parameters store 6 so that the device can operate in a standalone mode to derive the output signal and optionally automatically retrieve AV content from the store 2.

Tetbe I SQl tI'(J cocih: c'H TQ 1 ccr.teru flc:h) ?!.

5' ) \jf 1 s' j e crer:vh S()3 -1ç? Spir»=i UItEfts 1 1 N1-.., haruhnlc cRuo:c SI. sç'rad [NJ 1N2.n-crnance 1.,2 Sxdrd sd de f Jfw. 4NL' C1r)1ftflflU 51.3 &.h 4;c. t' NNLS b.c < ura hrncnc v.9 HI. -r't rLinhila.

SLSr_ thai rk\hrws L' KU:v'cohd skmu!us fl SLo SxuUd Ci;. I! HU bird ffl.;bnwk:, NJ 51.2 Sp:.s1n rujo,1 F!: Q Wa (flUU.

SI.8 Szcin K V.! RI r: sL.':) s;.ct: ir'LL;iv J [? RU.? sutrr., ri SIjO Spc1r;.d ce;oid 9 RB R:uk cat' U.R'UIS RB? Rhythm:.»=wth SFi RB: t rt:flr' r'jli V SB? Spcch:d V&.tik', 7 \L'. n'1 L'eq 9 SB I ctr;d tdk'\ P nd,) ELI Id' 584 Stkch;.i pc»=K t'UJd 9 i:.Bi &n12r!S.' uc \%}hh.

5115 ScuI pe k.:.qd.9 EB2.RMS cnrgs 2'iw' vim:kr.x Sf36 Sp:dr& pt';*l jth.9 L: we:siv NJ Sf3'! 5; c;& n:ei 5.* I TI. 7. ro-. r scm r flx:' 588 Sx cnd FVan nud.9 F.2 igiarcc rl SF) SX'iifI t'R 1.1 1.I cOfl er: SB!O S cJrJ cfl T 4 hesviIue SRI Spcr; thx (U n im fl US Hhes vahx Sb'? iano L;ls [.9 TEe A.,va:*e e\hum 9 Sl4L I ter, 11 F3 1 r;TcaI;TJ'Cd H 5ifl4 ku'th rth ards V9 T C nick w ho I nt_i I r:i Ct en Sp'.tc e Taha 2 ce' ** »= > > 7°.

in in in C') in 1'T(' fl c.csfl'- 8Q2 I}(1(.2 02 coctis: SQ3 MR.C' 7 Nt Srt1I S.2cua dainc'c' a a S1..I ja S'eLk' a SEA S::cta it1ia a a 5! 3 s 1t:hnc-N a SLt S c nu a a 51.8 Sx-dn. iugJari (K a SI U Sx:cu, ccrtid a SL I Ba'k c&:cfl:knts * SBI!p.&irc! vftS hanrts a SB2 SxU:a %LUC: hn1" a SB3 SL2u;a VLUd) 9 hL.n.t a Si.i Incc & 3auc) 1 i:d a F-U) T nz:onie 4 a U02 tunclion a Hç:3 K a a UN NNLS I tnr'ic:kme a HN2 COBStflLthX 4 powcl sh'e a * * R.\12 Sc:ied x'oih pw-a a a a RB! a RB2 RPnihrn ctr3zffi * a R83 k4:.n),&i;icni c<'aL a Mcn:taset Irequeni'y a a a RQ Beat 4)ufflg a a EB KMS.nrgy It! t'n)-cr'cun rac a a a TL3 Non--zero co mt a akc a 11.5 thhet vakie a

Claims

CLAIMSL A system for automated controfling of retrieval and output of music audio files, comprising: -a training input for rereiing music audio files each having one or more associated keywords from a set of keywords; an analyser arranged to convert keywords to M dimensional vectors in a vector space M, where M is less than the total number of distinct keywords in ihe set of keywords; -an analyser arranged to sample features of the music audio files and to produce an F dimensional vector in a vector space F representing each music audio file; -a machine earning module arranged to derive a conversion between M dimensional vectors and F dimensional vectors; -a sample input arranged to receive a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F; -a converter arranged to convert the F dimensional vector to a denved M dimensional vector in vector space M using the derived conversion; -an output arranged to allow selection and retrieval of music audio files using the derived M dimensional vector.
2. A system according to claim 1, wherein the converter is further arranged to derive one or more keywords from the derived M dimensional vector.
3, A system according to claim 2, wherein the output is arranged to allow selection and retrieval of music audio files using the derived one or more keywords.
4. A system according to any preceding claim, wherein the output is arranged to control a dspay to produce a st of titles of audio cortent
5. A system according to any preceding claim, wherein the output is arranged to automatically retrieve audio content
6. A system for automated retrieval of audio content, comprising: a sample input arranged to receive a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F; -a converter arranged to convert the F dimensional vector to a derived M dimensional vector in vector space M using a stored derived conversion; -an output arranged to allow selection and retrieval of music audio files using the derived M dimensional vector; -wherein the stored derived conversion is derived using a machine learning module arranged to derive a conversion between M d mensional vectors and F dimensional vectors from traninq audio content.T* A method for automated retrieval and output of music audio files, comprising: -receiving music audio files each having one or more associated keywords from a set of keywords; -converting keywords to M dimensional vectors in a vector space M, where M is less than the total number of distinct keywords in the set of keywords; -sampling features of the music audio files and to produce an F dimensional vector in a vector space F representing each music audio file; -deriving a conversion between M dimensional vectors and F dimensional vectors; -receiving a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F; -converting the F dimensional vector to a derived M dimensional vector in vector space M using the derived conversion; and selecting arid retrieving of music audio files using the derived M dimensional vector.8. A method according to claim 1, comprising deriving one or more keywords from the derived M dimensional vector.9. A method according to claim 2, comprising providing selection and retrieval of music audio files using the derived one or more keywords.A method according to any of claims 7 to 9, compnsing controllirg a display to produce a list of titles of audio content.11 A method according to any preceding claim, compristng automatically retrieving audio content.12. A method for automated retrieval of audio content, comprising: -receiving a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F; -converting the F dimensional vector to a derived M dimensional vector in vector space M using a derived conversion; -providing selection and retrieval of music audio files using the derived M dimensional vector; -wherein the derived conversion is derived using a machine learning module arranged to denve a conversion between M dimensional vectors and F dimensional vectors from training audio content