CN116530009A - Automatic generation and selection of target profiles for dynamic equalization of audio content - Google Patents

Automatic generation and selection of target profiles for dynamic equalization of audio content Download PDF

Info

Publication number
CN116530009A
CN116530009A CN202180079841.4A CN202180079841A CN116530009A CN 116530009 A CN116530009 A CN 116530009A CN 202180079841 A CN202180079841 A CN 202180079841A CN 116530009 A CN116530009 A CN 116530009A
Authority
CN
China
Prior art keywords
audio content
cluster
processor
feature vector
reference audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180079841.4A
Other languages
Chinese (zh)
Inventor
G·琴加莱
N·L·恩格尔
P·W·斯坎内尔
D·斯卡伊尼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby International AB
Dolby Laboratories Licensing Corp
Original Assignee
Dolby International AB
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby International AB, Dolby Laboratories Licensing Corp filed Critical Dolby International AB
Priority claimed from PCT/US2021/059827 external-priority patent/WO2022115303A1/en
Publication of CN116530009A publication Critical patent/CN116530009A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In an embodiment, a method includes: filtering the reference audio content item to separate the reference audio content item into different frequency bands; extracting, for each frequency band, a first feature vector from at least a portion of each of the reference audio content items, wherein the first feature vector comprises at least one audio characteristic of the reference audio content item; obtaining at least one semantic tag from at least a portion of each of the reference audio content items; obtaining a second feature vector consisting of a first feature vector and at least one semantic tag for each frequency band; generating a cluster feature vector representing a centroid of the cluster based on the second feature vector; separating the reference audio content items according to the cluster feature vectors; and calculating an average target profile for each cluster based on the reference audio content items in the cluster.

Description

Automatic generation and selection of target profiles for dynamic equalization of audio content
Cross Reference to Related Applications
The present application claims priority from spanish patent application No. P202031189, filed 11/27 in 2020, and U.S. provisional application No. 63/145,017, filed 2/3 2021, the entire contents of which are incorporated herein by reference.
Technical Field
The present disclosure relates generally to audio signal processing, and more particularly to dynamic equalization of audio content.
Background
Dynamic Equalization (DEQ) is a technique that modifies the spectral profile and dynamic range of audio content (e.g., music or speech files) by applying time-dependent and frequency-dependent gains to the audio content so that the audio content matches the spectral profile and dynamic range of a particular reference audio content. In DEQ, the spectral profile and dynamic range of the audio content is represented by a set of component numbers of bits per frequency band that represent a statistical distribution of energy in the audio content. This set of bits is commonly referred to as a "profile" of the audio content. After obtaining the score of the reference audio content (hereinafter referred to as "target profile") and the score of the input audio content (hereinafter referred to as "input profile"), the DEQ calculates a gain applied to the input audio content such that the score of the input audio content is modified to match the reference audio content.
Existing DEQ techniques obtain a target profile by manually selecting a set of reference audio content from which the target profile may be generated. Once the target profile is generated, the selection of the desired target profile for a given input audio content is left to the user. Selecting a particular target profile from the plurality of target profiles has a significant impact on the subjective quality of the audio content after application of the DEQ. Selecting an inappropriate target profile may result in perceived degradation of the input audio content. Some relevant examples of perceptual degradation include processing classical music recordings (e.g., HPE wide dynamic range, mellow treble balance) to match Electronic Dance Music (EDM) recordings (e.g., small dynamic range, prominent treble), or processing male speech recordings to match Equalization (EQ) contours of female speech recordings.
When multiple reference audio content items are available, it is desirable to automatically generate multiple significantly different target profiles. Further, in the event that user input is not desired and multiple target profiles are available, it is desirable to automatically select the appropriate target profile.
Disclosure of Invention
Embodiments for automatically generating and selecting a target profile for a DEQ of audio content are disclosed.
In an embodiment, a method of automatically generating a target profile for dynamic equalization of audio content includes: obtaining a reference audio content item; filtering the reference audio content item to separate the reference audio content item into different frequency bands of a frequency spectrum of the reference audio content item; extracting, for each frequency band, a first feature vector from at least a portion of each of the reference audio content items, wherein the first feature vector comprises at least one audio characteristic of the reference audio content item; obtaining at least one semantic tag describing the reference audio content item from at least a portion of each of the reference audio content items; obtaining a second feature vector consisting of a first feature vector and at least one semantic tag for each frequency band; generating cluster feature vectors representing centroids of clusters based on the second feature vectors, wherein each reference audio content item is assigned to at least one cluster; separating the reference audio content items according to the cluster feature vectors; calculating an average target profile for each cluster based on the reference audio content items in the cluster; and storing the average target profile and the corresponding cluster feature vector for each cluster in a storage device.
In an embodiment, generating the cluster feature vector representing the centroid of the cluster includes generating the cluster feature vector using only at least one audio characteristic of the first feature vector.
In an embodiment, generating the cluster feature vector representing the centroid of the cluster includes generating the cluster feature vector using only at least one semantic tag of the second feature vector.
In an embodiment, a method of automatically generating a target profile for dynamic equalization of audio content includes: obtaining, with at least one processor, a first set of reference audio content items; filtering, with at least one processor, the first set of reference audio content items to separate the first set of reference audio content items into different frequency bands of a spectrum of the reference audio content items; extracting, with at least one processor, semantic tag feature vectors from the first set of reference audio content items, wherein the semantic tag feature vectors include semantic tags describing the first set of reference audio content items; generating, with the at least one processor and based on the semantic tag feature vectors, a first set of cluster feature vectors representing a first set of centroids of the first set of clusters; separating, with at least one processor, the first set of reference audio content items into a first set of clusters according to the first set of cluster feature vectors; for each cluster in the first set of clusters and for each frequency band: extracting, with at least one processor, an audio characteristic feature vector from the reference audio content items assigned to the cluster, wherein the audio characteristic feature vector comprises audio characteristics of a first set of reference audio content items; generating, with the at least one processor and based on the audio characteristic feature vectors, a second set of cluster feature vectors representing a second set of centroids of the second set of clusters; separating, with at least one processor, the reference audio content items into a second set of clusters according to a second set of cluster feature vectors; calculating, with the at least one processor, an average target profile for each cluster in the second set of clusters based on the reference audio content items in the clusters; and storing, with the at least one processor, the average target profile and the corresponding second set of cluster feature vectors in a storage device.
In an embodiment, a method of automatically generating a target profile for dynamic equalization of audio content includes: obtaining, with at least one processor, a reference audio content item; filtering, with the at least one processor, the reference audio content item to separate the reference audio content item into different frequency bands of a frequency spectrum of the reference audio content item; extracting, for each frequency band, an audio characteristic feature vector from the reference audio content item, wherein the audio characteristic feature vector comprises audio characteristics of the reference audio content item; generating, with the at least one processor and based on the audio characteristic feature vectors, cluster feature vectors representing centroids of clusters, wherein each reference audio content item is assigned to at least one cluster; separating, with at least one processor, the reference audio content items according to the cluster feature vectors; assigning a semantic tag to each cluster feature vector based on semantic tags associated with respective reference audio content items in the cluster; calculating, with at least one processor, an average target profile for each cluster based on semantic tags assigned to the clusters; and storing, with the at least one processor, each vector and the average target profile for each band and the corresponding cluster feature vector in a storage device.
In an embodiment, the at least one audio characteristic is average energy.
In an embodiment, the at least one audio characteristic is a dynamic range based on a difference between two percentiles in an energy distribution of the reference audio content item.
In an embodiment, the at least one audio characteristic is a spectral slope comprising a line fitting an average energy between two frequency bands.
In an embodiment, the at least one audio characteristic is spectral flux.
In an embodiment, the at least one audio characteristic is a crest factor.
In an embodiment, the at least one audio characteristic is a zero-crossing rate, which is used to effectively distinguish (e.g. music from speech) and thus improve clustering.
In an embodiment, k-means clustering is used to generate cluster feature vectors.
In an embodiment, the method further comprises: obtaining, with at least one processor, a plurality of unique style tags represented in the reference audio content; and setting the minimum number of clusters equal to the number of styles represented in the reference audio content.
In an embodiment, the number of unique style tags is obtained from at least one of an audio content classifier, metadata of a reference audio content item, or from a human listener.
In an embodiment, the semantic tags are obtained from at least one of an audio content classifier, metadata of a reference audio content item, or from a human listener.
In an embodiment, only a portion of the audio content item (e.g., the first 30 seconds) is used to calculate the audio characteristics and/or semantic tags.
In an embodiment, a method of automatically selecting a target profile for dynamic equalization of audio content includes: obtaining, with at least one processor, an input audio content item; filtering, with at least one processor, the input audio content item to separate the input audio content item into different frequency bands of a frequency spectrum of the input audio content item; extracting, for each frequency band, a first feature vector from the input audio content item, wherein the first feature vector comprises at least one audio characteristic of the input audio content item; obtaining at least one semantic tag describing an input audio content item; obtaining a second feature vector consisting of a first feature vector and at least one semantic tag for each frequency band; calculating, with the at least one processor, a distance measure between the second feature vector and a plurality of cluster feature vectors corresponding to the plurality of target profiles, wherein the plurality of cluster feature vectors each include at least one audio characteristic of the reference audio content item and at least one semantic tag describing the reference audio content item; selecting, with the at least one processor, a particular target profile from the plurality of target profiles that corresponds to a smallest distance metric of the calculated distance metrics; and applying, with the at least one processor, dynamic equalization to the input audio content item using the particular target profile.
In an embodiment, the distance metric is a euclidean distance metric.
In an embodiment, the method further comprises: determining, with the at least one processor, that the minimum distance is greater than a threshold; rejecting the selected target profile; selecting, with the at least one processor, a default target profile; and applying dynamic equalization to the input audio content item using a default target profile with the at least one processor.
In an embodiment, the default target profile is an average target profile calculated by averaging at least one of the plurality of target profiles or the plurality of reference audio content items, or may be a target profile for another cluster.
In an embodiment, a system includes: one or more processors; and a non-transitory computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to perform the operations of any of the methods described above.
In an embodiment, a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the operations of any of the methods described above.
Other embodiments disclosed herein are directed to systems, apparatuses, and computer readable media. The details of the disclosed embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Particular embodiments disclosed herein provide one or more of the following advantages. The disclosed automatic generation and selection of a target profile for a DEQ of audio content provides advantages over conventional manual selection of the target profile by a user by ensuring that the selection is appropriate for the audio content and that the target profile does not degrade the audio content when applied to the audio content during the DEQ.
Drawings
In the drawings, specific arrangements or sequences of illustrative elements, such as those representative devices, units, instruction blocks, and data elements, are shown for ease of description. However, those skilled in the art will appreciate that the specific order or arrangement of the illustrative elements in the drawings is not meant to imply that a particular order or sequence of processes or separation of processes is required. Furthermore, the inclusion of a schematic element in the figures is not meant to imply that this element is required in all embodiments, or that features represented by this element may not be included in or combined with other elements in some embodiments.
Furthermore, in the drawings, a connection element (such as a solid line or a dashed line or an arrow) is used to illustrate a connection, relationship, or association between or among two or more other illustrative elements, the absence of any such connection element is not meant to imply that no connection, relationship, or association exists. In other words, some connections, relationships, or associations between elements are not shown in the drawings, so as not to obscure the present disclosure. In addition, for ease of description, a single connection element is used to represent multiple connections, relationships, or associations between elements. For example, where a connection element represents a communication of signals, data, or instructions, those skilled in the art will understand that such element represents one or more signal paths that may be required to affect the communication.
Fig. 1 is a block diagram of a system for automatically generating a target profile for a DEQ of audio content according to an embodiment.
Fig. 2 illustrates clustering of feature vectors according to an embodiment.
FIG. 3 is a block diagram of a system for automatically generating a target profile using multiple clustering stages, according to an embodiment.
Fig. 4 is a block diagram of a system for automatically selecting a target profile for a DEQ of audio content, according to an embodiment.
Fig. 5 is a flowchart of a process of automatically generating a target profile for a DEQ of audio content, according to an embodiment.
Fig. 6 is a flowchart of a process of automatically selecting a target profile for a DEQ of audio content, according to an embodiment.
Fig. 7 is a block diagram of an example device architecture for implementing the features and processes described with reference to fig. 1-6, according to an embodiment.
The same reference symbols in the various drawings indicate the same elements.
Detailed Description
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the various described embodiments. It will be apparent to one of ordinary skill in the art that the various described embodiments may be practiced without these specific details. Well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments. Several features are described below, each of which may be used independently of the other or in combination with any combination of the other features.
Nomenclature of
As used herein, the term "comprising" and variants thereof are to be interpreted as open-ended terms, meaning "including, but not limited to. The term "or" will be read as "and/or" unless the context clearly indicates otherwise. The term "based on" will be read as "based at least in part on". The terms "one example embodiment" and "example embodiment" are to be interpreted as "at least one example embodiment". The term "another embodiment" will be read as "at least one other embodiment. The terms "determine," "determine," or "determine" are to be construed as obtaining, receiving, calculating, computing, estimating, predicting, or deriving. Furthermore, in the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
SUMMARY
Given a set of reference content (e.g., tens of well-recorded music or voice files), in order to apply DEQ to audio content, it is desirable to extract a plurality of significantly different but related target profiles instead of extracting a single target profile representing the average properties of the audio content. For example, a well-recorded set of songs may include songs from different music styles (e.g., classical, jazz, rock, EDM, hip-hop) or different times (e.g., 70 s versus 80 s) or played with different instruments (e.g., voice, guitar, piano, etc.). Creating different target profiles for each style, instrument, or period may result in multiple target profiles that are more likely to cover each particular use case than a single target profile that represents the average properties of the audio content. This is because different styles, times or instruments generally reflect different spectral and dynamic range characteristics.
For a given input audio content, generating a target profile and then selecting an appropriate target profile are two related processes. For a given input audio content, the features used to generate the target profile are typically the same features as used to select the appropriate target profile, where they are compared using the concept of "similarity". It is therefore important that during the generation of the target profile, features that yield a robust classification of music style, instrument or period are included in the selection of the target profile for the DEQ of the audio content.
Automatically generating a target profile
Fig. 1 is a block diagram of a system 100 for automatically generating a target profile for a DEQ of audio content, according to an embodiment. The system 100 comprises a reference content database 101, a filter bank 102, a feature extraction unit 103, a clustering unit 104, an average target profile generator unit 105 and a target profile database 106.
Each item in the reference audio content database 101 is analyzed by the filter bank 102, and a set of features F (i), where "F" is a frequency band and "i" is an i-th item, is extracted from the output items (F, i) of each item by the feature extraction unit 103. In an embodiment, the filter bank 102 is configured to output a frequency band suitable for a particular application. For example, for speech enhancement applications, the filter bank 102 outputs only frequency bands within the speech frequency range.
The feature extraction unit 103 receives the items (F, i) output by the filter bank 102, extracts features from each item, and concatenates the extracted features as a feature vector F (i) of the item. In an embodiment, features may include, but are not limited to: average energy per band (E): e (f, i), dynamic Range (DR) per band: DR (f, i) and Spectral Slope (SS): SS (f, i). DR (f, i) is defined as the difference between two predefined percentiles (e.g., the difference between 95 and 10 percentiles) in the energy distribution E (f) of term i of frequency band f, and spectral slope SS (f, i) is defined as the slope of the line of E (f, i) fitted between two predefined frequency bands fl and f2 (e.g., fl=100 Hz and f2=5 kHz).
Alternatively, or in addition to the average energy E (f, i) per band, other frequency-dependent, wideband or time-domain indicators may be used, such as spectral flux, peak root mean square (peak-root mean square) ratio (i.e., a "crest factor"), and Zero Crossing Rate (ZCR). ZCR is the ratio of the audio signal from positive to negative or from negative to positive, and ZCR may be used for pitch detection to classify a knocked sound, as well as Voice Activity Detection (VAD) to determine whether human speech is present in the audio signal. Spectral flux is a measure of how fast the power spectrum of an audio signal changes, calculated by comparing the power spectrum of one frame with the power spectrum from the previous frame. The peak root mean square ratio (crest factor) indicates the relative balance of the peak and average energy of the audio signal over a small period of time.
The clustering unit 104 applies a clustering algorithm (e.g., k-means clustering) to the feature vectors F (i) to generate k feature vectors FC (k), representing centroids of different clusters of the reference audio content item, as shown in fig. 2. The reference audio content items are separated according to the clusters to which they are assigned in the clustering process, and the content belonging to each cluster k is used by the average target profile generation unit 105 to calculate the average target profile TP (k) for that cluster. TP (k) is stored in the destination profile database 106 together with its corresponding feature vector FC (k) for the automatic destination profile selection process described with reference to fig. 4.
In an embodiment, the target profile is a spectrum profile of a reference song or track or a collection of reference songs/tracks. A target profile may also be built for a vocal or musical instrument track, for example, by using a collection of male singer recordings, bass recordings, etc. The term "song" or "track" is generally used to refer to each section of the collection. If the target profile is generated from more than one song/track, the songs/tracks are normalized so that they have the same loudness before the profile is calculated. In an embodiment, the loudness is calculated according to a standard specified by the standard european broadcast joint recommendation (EBU) R128. After normalization, statistics are built by analyzing the aggregate frames of all songs/tracks (as if all songs/tracks had been concatenated into a single piece).
Multiple target profiles for the DEQ process may be generated and stored, such as targets corresponding to different music styles, instrument tracks (e.g., human voice, bass, drum, etc.), movie material (e.g., dialog, effects, music, etc.). In some applications, multiple targeting profiles may be provided within the same group to allow the user to select and change the resulting output effect. For example, different vocal target profiles may be provided that represent different vocal mixing styles or techniques used by the content creator.
In an embodiment, a system for generating a target profile includes a frame generator, a window function, a filter bank, a level detector, and a quantile generator. The spectral profile of the input audio signal is a statistical distribution of the levels of each of its frequency bands calculated across the audio frame. The frame generator divides the input audio signal s (t) into frames of length samples (e.g., 4096 samples), with overlap between consecutive frames of n over lap samples (e.g., 2048 samples), where the input audio signal at frame n is referred to as s (n). A window function (e.g., fade-in, fade-out window) is applied to each frame n to ensure smooth interpolation between successive frames. In an embodiment, a hanning window is used. The filter bank divides the windowed signal s (t) into Nb bands (e.g., 83 bands or sub-bands), where in band f at the nth frame Is called s (n, f). The level detector calculates the level L of the input audio signal in each frequency band f at each frame n in (n, f). E (n, f) is the energy of the input audio signal in the frequency band at a given frame n. The level is the energy converted into dB:
[1]L in (n,f)=10·log10(E(n,f))。
in an embodiment, when calculating the energy of a new frame in each band, the results may be smoothed over time by using a first order low pass filter, for example, described by the following formula:
[2] esmoth (n, f) =esmoth (n-1, f) ·α+e (n, f) · (1- α), wherein the coefficient a can be selected among attack or release coefficients derived from different attack/release time constants, depending on whether the energy at the current frame is greater or less than the smoothed value at the previous frame, respectively.
The quantile generator generates a quantile curve corresponding to each spectral profile. For example, in each frequency band f, the xth quantile q of the level distribution x (f) Calculated as a value below which x% of the level of the cross-frame in the frequency band is contained. If the signal is multi-channel, the level in each band at a given frame n can be calculated, for example, as a Root Mean Square (RMS) average of the energy across the channels:
[3]
other options, such as taking the maximum value across channels, would typically lead to similar results, but may be preferred in certain fields (e.g., application to 5.1 tracks).
Fig. 2 shows clustering of feature vectors F (i) according to an embodiment. In the illustrated example, a k-means clustering algorithm is applied to the feature vector F (i), where k=5. The output of the k-means clustering algorithm is 5 clusters 201-205 of different sizes. Each of the clusters 201-205 has a centroid 201-1, 202-1, 203-1, 204-1 and 205-1, respectively, indicated by a solid triangle. In this example, the average energy E (F, i) of each band, the dynamic range DR (F, i) of each band, and the spectral slope SS (F, i) form part of the eigenvector F (i). The k-means clustering algorithm generates k initial means (centroids), generates k clusters, wherein k clusters are generated by assigning each feature vector F (i) to a cluster having the nearest mean (centroid) based on a distance calculation (e.g., euclidean distance), and updates or recalculates the mean (centroid) based on the feature vectors assigned to the clusters. The allocation and updating of feature vectors to clusters is repeated until convergence is reached (e.g., allocation no longer changes).
Although k-means clustering is described in the above examples, any suitable clustering algorithm may be used, including, but not limited to: k-medoids, fuzzy C-means, gaussian mixture model with expectation maximization training, K-means++, hartigan-Wong method of K-means, K-SVD, principal Component Analysis (PCA). The initial mean (centroid) may be generated using any suitable method, including, but not limited to: random partitioning, forgy, maximin, bradley and Fayyad.
In addition to audio characteristics, semantic tag features may be clustered, including but not limited to: tags generated by the audio content classifier, or tags retrieved from metadata, or tags provided by a human listener through, for example, a user interface. Some examples of semantic tag features include, but are not limited to, information about the style of the song (e.g., rock, jazz, etc.), a list of instruments presented in the song (e.g., voice, drum, guitar, etc.), information about the recording period of each song (e.g., 70 s, 80 s, etc.). These semantic tag features provide meaningful semantic information about the reference audio content that can help the system 400 make appropriate selections of target profiles in automatically selecting target profiles for the DEQ, as described with reference to fig. 4. For example, if a song has a prominent low frequency peak and the song is identified as hip-hop style, the song is more likely to correspond to the aesthetics of the hip-hop style than a classical musical piece with the same prominent peak, which is instead undesirable and must therefore be corrected.
In an embodiment, semantic tag features are used in clusters alone. For example, each song in the reference audio content is tagged by a genre, and songs sharing the same tag are clustered together. If a song is associated with more than one tag (e.g., pop and rock), the song may be used in two different clusters. In addition, the instrument tag is used to further subdivide each cluster into additional clusters, for example, the jazz cluster into a cluster including jazz songs with humans and a cluster including jazz songs without humans.
The process of "clustering" based on the semantic tags described above involves grouping content sharing the same style tags. For example, all songs with the label "rock" are averaged to calculate a target profile called "rock". Then, if the input song has the label "rock", the "rock" target profile is selected as the target profile for the DEQ. Semantic tags do not have to be provided by the user or embedded in the metadata; it is sufficient to run a labeler (e.g., an audio style classifier) on the input audio using the same labeling process used to label the reference content prior to processing.
The number k of clusters may be manually specified or automatically obtained. In an embodiment, k is determined by the counter 107, wherein the counter 107 counts how many unique style tags are present in the reference audio content database. For example, if the classifier assigns the labels "jazz", "classical", and "rock" in a set of ten songs, k is set to 3, with the aim that a reasonable choice of features will cause the clustering algorithm to rank songs related to the same genre into the same cluster. The unique style tag may be obtained from at least one of an audio content classifier, metadata of a reference audio content item, or from a human listener.
FIG. 3 is a block diagram of a system for automatically generating a target profile using multiple clustering stages, according to an embodiment. In an embodiment, the first clustering stage 104-1 is performed based on a first type of feature (e.g., audio characteristic feature (FL (i)) or semantic feature (FH (i)), also referred to as low-level feature and high-level feature, respectively) output by the feature extraction unit 103. Within each generated cluster FCL1 (k), a second aggregation phase 104-2 is performed based on a second type of feature (e.g., semantic tag feature or audio characteristic feature). The output FCL2 (k) of the second classification stage is input into the average target profile generator 105, the average target profile generator 105 generates an average target profile and stores the target profile in the target profile database 106. The two clustering stages 104-1, 104-2 are applied sequentially. This allows one type of feature to dominate the first clustering stage 104-1, then sub-clustering is performed on the first cluster output by the clustering stage 104-1, e.g., all rock songs are obtained (semantic tag features dominate), then the clustering stage 104-2 is used to create k clusters (audio characteristic features dominate) among previously clustered rock songs.
In another embodiment, clustering is performed on audio feature features and semantic tag features together, where semantic tag features are added to the audio feature vectors to create extended feature vectors, and distances between semantic tag features are defined (e.g., distance=0 if tags are the same, distance=c if tags are not the same). These expanded feature vectors are then fed to a clustering algorithm. The value of c determines the "purity" of each cluster. For example, for very small "c", clusters will ignore labels (i.e., audio-only feature clusters), while for very large "c", clusters will be grouped by style and ignore audio feature features.
In another embodiment, clustering is performed based on audio characteristic features (e.g., E (f, i), DR (f, i), SS (f, i)), and then the entire set of semantic tags (e.g., rock, jazz, hip-hop) is assigned to the clusters based on majority votes across individual songs in the clusters. For example, if the cluster returns a cluster of 10 songs, 7 of which are labeled "rock" and 3 of which are labeled "jazz", then the label "rock" is assigned to the cluster. The basic assumption of the above embodiment is that labels may be needed in the clusters, but the audio characteristics are more trustworthy when creating the clusters.
Automatic selection of a target profile
FIG. 4 is a block diagram of a system 400 for automatically selecting a target profile for a DEQ, according to an embodiment. Once multiple automatically generated target profiles are available, our basic embodiment for automatic selection will be implemented as follows.
The input content 401 (e.g. an audio file) is processed by the filter bank 402 and the feature extraction unit 403 to obtain feature vectors FI (i) for the input audio content item. In feature space, a distance metric d (k) (e.g., euclidean distance) is calculated between the feature vector FI and the feature vector FC (k) stored with the corresponding target profile TP (k) in the database 106, as described with reference to fig. 1.
The selected target profile STP (k) corresponding to the minimum distance d (k) is selected as the target profile for the DEQ. This is equivalent to selecting the target profile with the highest similarity to the input profile. In an embodiment, if the minimum distance D (k) is greater than the threshold D, the selection is rejected and the default target profile is used to process the input audio content; such a default target profile may be obtained by averaging at least one of the plurality of target profiles or the plurality of reference audio content items as a tradeoff to the out-of-distribution input audio content (i.e., audio content not represented in the reference audio content); alternatively, it may be a target profile for another cluster.
The above-described target profile generation and selection techniques are useful whenever the characteristics of the input audio content approach the audio characteristics of the target profile, so only minor adjustments to EQ and dynamic range are required. However, these techniques have limitations when applied to input audio content that require large EQ adjustments.
For example, consider the recording of hip-hop music, where the low frequency is too weak and the high frequency is too prominent due to some defects in the recording process. It is also assumed that two target profiles are available, which are obtained from good recordings of hip-hop music and acoustic music, respectively. The above-described technique based on similarity between the input audio content and the target profile assigns an "acoustic" target profile to the input audio content, thereby preventing the DEQ from restoring the desired spectral balance of the input audio content. In this example, tagging the input audio content and the target profile as "hip-hop" or "acoustic" will result in a more appropriate selection of the target profile for the DEQ. Thus, in an embodiment, only tags/labels are used to assign input content to target clusters. These tags/labels may be any combination of semantic tags/labels, such as style, tonality, musical instruments, etc. In other embodiments, the target profile is selected by combining the audio feature cluster selection with the semantic tag cluster selection. If there are different audio characteristic clusters and semantic clusters, the input audio content is processed with a target profile representing an average or weighted combination of the audio characteristic target profile and the semantic target profile.
In an embodiment, a partial subset of features is used to calculate the distance between the input audio content and the target cluster, wherein the partial subset of features may comprise a combination of low-level features and high-level features. The latter embodiment is motivated by maximizing two different objectives for each stage, which uses different feature sets for the generation and selection stages of the objective profile. For example, in the generation phase, as much information as possible needs to be utilized to ensure that there is a maximum difference between clusters, and similar content is assigned to the same cluster. On the other hand, in the selection phase, it may be necessary to prioritize certain features that are considered more relevant to the application at hand (e.g., more relevant to voice-only applications relative to music).
In an embodiment, a portion of the input audio content (e.g., the first 30 seconds) is used to calculate the feature vector instead of the entire input audio content for efficiency reasons.
Example procedure
FIG. 5 is a flowchart of a process 500 for automatically generating a target profile for a DEQ, according to an embodiment. Process 500 may be implemented using, for example, device architecture 700 described with reference to fig. 7.
The process 500 includes the steps of: feature vectors are extracted from the reference audio content items (501), the feature vectors are clustered (502), the reference content items are separated according to cluster allocation (503), an average target profile for each cluster is calculated based on the reference audio content allocated to the cluster (504), and the average target profile for automatically selecting the target profile of the input audio content is stored, as described with reference to fig. 6.
FIG. 6 is a flowchart of a process 600 for automatically selecting a target profile for a DEQ, according to an embodiment. Process 600 may be implemented using, for example, device architecture 700 as described with reference to fig. 7.
The process 600 includes the steps of: extracting feature vectors from the input audio content (601), calculating distances between the feature vectors and cluster feature vectors associated with the target profiles (602), selecting an average target profile corresponding to the minimum distance (603), and applying the average target profile to the input audio content during DEQ.
Example System architecture
Fig. 7 illustrates a block diagram of an example system 700 suitable for implementing the example embodiments described with reference to fig. 1-6. The system 700 includes a Central Processing Unit (CPU) 701 capable of executing various processes according to a program stored in, for example, a Read Only Memory (ROM) 702 or a program loaded from, for example, a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, data required when the CPU 701 executes various processes is also stored as needed. The CPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704, and an input/output (I/O) interface 705 is also connected to the bus 704.
The following components are connected to the I/O interface 705: an input unit 706, which may include a keyboard, a mouse, etc.; an output unit 707 that may include a display, such as a Liquid Crystal Display (LCD), and one or more speakers; a storage unit 708 comprising a hard disk or other suitable storage device; and a communication unit 709 including a network interface card such as a network card (e.g., wired or wireless).
In some embodiments, the input unit 706 includes one or more microphones in different locations (depending on the host device) to enable capturing of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).
In some embodiments, output unit 707 includes a system having a variety of numbers of speakers. The output unit 707 may reproduce audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).
The communication unit 709 is configured to communicate with other devices (e.g., via a network). The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash memory drive, or other suitable removable medium, is installed on the drive 710 so that a computer program read therefrom is installed in the storage unit 708 as needed. Those skilled in the art will appreciate that while system 700 is described as including the components described above, some of these components may be added, removed, and/or replaced in practical applications and all such modifications or alterations are within the scope of the present disclosure.
According to example embodiments of the disclosure, the above-described processes may be implemented as a computer software program or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing a method. In such an embodiment, the computer program may be downloaded and installed from a network via the communication unit 709, and/or installed from a removable medium 711, as shown in fig. 7.
In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits (e.g., control circuits), software, logic or any combination thereof. For example, the elements discussed above may be performed by control circuitry (e.g., a CPU in combination with other components of fig. 7), and thus, the control circuitry may perform the actions described in this disclosure. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device (e.g., control circuitry). While various aspects of the example embodiments of the disclosure are illustrated and described in block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Additionally, the various blocks shown in the flowcharts may be considered method steps, and/or as operations resulting from the operation of computer program code, and/or as a plurality of coupled logic circuit elements configured to perform the associated functions. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program containing program code configured to perform the method as described above.
In the context of this disclosure, a machine-readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may be non-transitory and may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program code may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus with control circuitry, such that the program code, when executed by the processor of the computer or other programmable data processing apparatus, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and/or servers.
Although this document contains many specific embodiment details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. The logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided or steps may be removed from the described flows, and other components may be added to or removed from the described systems. Accordingly, other embodiments are within the scope of the following claims.

Claims (20)

1. A method of automatically generating a target profile for dynamic equalization of audio content, the method comprising:
obtaining, with at least one processor, a reference audio content item;
filtering, with the at least one processor, the reference audio content item to separate the reference audio content item into different frequency bands of a frequency spectrum of the reference audio content item;
for each of the frequency bands,
extracting, with the at least one processor, a first feature vector from at least a portion of each of the reference audio content items, wherein the first feature vector includes at least one audio characteristic of the reference audio content item;
obtaining, with the at least one processor, at least one semantic tag describing each of the reference audio content items from at least a portion of the reference audio content items;
obtaining, with the at least one processor, a second feature vector consisting of the first feature vector and the at least one semantic tag for each frequency band;
generating, with the at least one processor and based on the second feature vector, a cluster feature vector representing a centroid of a cluster, wherein each reference audio content item is assigned to at least one cluster;
Separating, with the at least one processor, the reference audio content items according to the cluster feature vector;
calculating, with the at least one processor, an average target profile for each cluster based on the reference audio content items in the cluster; and
the average target profile and corresponding cluster feature vector for each cluster are stored in a storage device with the at least one processor.
2. The method of claim 1, wherein generating a cluster feature vector representing a centroid of a cluster comprises generating a cluster feature vector using only the at least one audio characteristic of the first feature vector.
3. The method of claim 1, wherein generating a cluster feature vector representing a centroid of a cluster comprises generating a cluster feature vector using only the at least one semantic tag of the second feature vector.
4. A method of automatically generating a target profile for dynamic equalization of audio content, the method comprising:
obtaining, with at least one processor, a first set of reference audio content items;
filtering, with the at least one processor, the first set of reference audio content items to separate the first set of reference audio content items into different frequency bands of a frequency spectrum of the reference audio content items;
Extracting, with the at least one processor, a semantic tag feature vector from at least a portion of each of the first set of reference audio content items, wherein the semantic tag feature vector comprises a semantic tag describing the reference audio content item;
generating, with the at least one processor and based on the semantic tag feature vectors, a first set of cluster feature vectors representing a first set of centroids of the first set of clusters;
separating, with the at least one processor, the first set of reference audio content items into a first set of clusters according to the first set of cluster feature vectors;
for each cluster in the first set of clusters:
for each frequency band:
extracting, with the at least one processor, an audio characteristic feature vector from the reference audio content items assigned to the cluster, wherein the audio characteristic feature vector comprises audio characteristics of the first set of reference audio content items;
generating, with the at least one processor and based on the audio characteristic feature vectors, a second set of cluster feature vectors representing a second set of centroids of a second set of clusters;
separating, with the at least one processor, the reference audio content items into a second set of clusters according to the second set of cluster feature vectors;
Calculating, with the at least one processor, an average target profile for each cluster in the second set of clusters based on the reference audio content items in the clusters; and
the average target profile and the corresponding second set of cluster feature vectors are stored in a storage device with the at least one processor.
5. A method of automatically generating a target profile for dynamic equalization of audio content, the method comprising:
obtaining, with at least one processor, a reference audio content item;
filtering, with the at least one processor, the reference audio content item to separate the reference audio content item into different frequency bands of a frequency spectrum of the reference audio content item;
for each of the frequency bands,
extracting, with the at least one processor, an audio characteristic feature vector from at least a portion of each of the reference audio content items, wherein the audio characteristic feature vector comprises audio characteristics of the reference audio content items;
generating, with the at least one processor and based on the audio characteristic feature vector, a cluster feature vector representing a centroid of a cluster, wherein each reference audio content item is assigned to at least one cluster;
Separating, with the at least one processor, the reference audio content items according to the cluster feature vector;
assigning a semantic tag to each cluster feature vector based on semantic tags associated with respective reference audio content items in the cluster;
calculating, with the at least one processor, an average target profile for each cluster based on the semantic tags assigned to the clusters; and
the average target profile and corresponding cluster feature vector for each cluster are stored in a storage device with the at least one processor.
6. The method of any of the preceding claims 1-5, wherein the at least one audio characteristic is average energy.
7. The method of any of the preceding claims 1-5, wherein the at least one audio characteristic is a dynamic range based on a difference between two percentiles in an energy distribution of the reference audio content item.
8. The method of any of the preceding claims 1-5, wherein the at least one audio characteristic is a spectral slope comprising a line fitting average energy between two frequency bands.
9. The method of any of the preceding claims 1-5, wherein the at least one audio characteristic is spectral flux, crest factor.
10. The method of any of the preceding claims 1-5, wherein the at least one audio characteristic is a zero-crossing rate.
11. The method of any of the preceding claims 1-10, wherein cluster feature vectors are generated using k-means clustering.
12. The method of any of the preceding claims 1-11, further comprising:
obtaining, with the at least one processor, a number of unique style tags represented in the reference audio content; and
the minimum number of clusters is set equal to the number of styles represented in the reference audio content.
13. The method of claim 12, wherein the number of unique style tags is obtained from at least one of an audio content classifier, metadata of the reference audio content item, or from a human listener.
14. The method of any of the preceding claims 1-13, wherein the semantic tags are obtained from at least one of an audio content classifier, metadata of the reference audio content item, or a human listener.
15. A method of automatically selecting a target profile for dynamic equalization of audio content, the method comprising:
Obtaining, with at least one processor, an input audio content item;
filtering, with the at least one processor, the input audio content item to separate the input audio content item into different frequency bands of a frequency spectrum of the input audio content item;
for each of the frequency bands,
extracting, with the at least one processor, a first feature vector from at least a portion of the input audio content item, wherein the first feature vector comprises at least one audio characteristic of the input audio content item;
obtaining at least one semantic tag describing the input audio content item from at least a portion of the input audio content item;
obtaining, with the at least one processor, a second feature vector consisting of the first feature vector and the at least one semantic tag for each frequency band;
calculating, with the at least one processor, a distance measure between the second feature vector and a plurality of cluster feature vectors corresponding to a plurality of target profiles, wherein the plurality of cluster feature vectors are each associated with a cluster of reference audio content items and include at least one audio characteristic of the reference audio content items and at least one semantic tag describing the reference audio content items;
Selecting, with the at least one processor, a particular target profile from the plurality of target profiles that corresponds to a smallest distance metric of the calculated distance metrics; and
dynamic equalization is applied to the input audio content item using the particular target profile with the at least one processor.
16. The method of claim 15, wherein the distance metric is a euclidean distance metric.
17. The method of claim 15 or 16, further comprising:
determining, with the at least one processor, that the minimum distance is greater than a threshold;
rejecting the selected target profile;
selecting, with the at least one processor, a default target profile; and
dynamic equalization is applied to the input audio content item using the default target profile with the at least one processor.
18. The method according to any of the preceding claims 15-17, wherein the default target profile is an average target profile calculated by averaging at least one of the plurality of target profiles or the plurality of reference audio content items, or a target profile for another cluster.
19. A system for processing audio, comprising:
one or more processors; and
a non-transitory computer-readable medium storing instructions, wherein the instructions, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with any one of claims 1-18.
20. A non-transitory computer-readable medium storing instructions, wherein the instructions, when executed by the one or more processors, cause the one or more processors to perform operations of any of claims 1-18.
CN202180079841.4A 2020-11-27 2021-11-18 Automatic generation and selection of target profiles for dynamic equalization of audio content Pending CN116530009A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
ESP202031189 2020-11-27
US202163145017P 2021-02-03 2021-02-03
US63/145,017 2021-02-03
PCT/US2021/059827 WO2022115303A1 (en) 2020-11-27 2021-11-18 Automatic generation and selection of target profiles for dynamic equalization of audio content

Publications (1)

Publication Number Publication Date
CN116530009A true CN116530009A (en) 2023-08-01

Family

ID=87398058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180079841.4A Pending CN116530009A (en) 2020-11-27 2021-11-18 Automatic generation and selection of target profiles for dynamic equalization of audio content

Country Status (1)

Country Link
CN (1) CN116530009A (en)

Similar Documents

Publication Publication Date Title
Manilow et al. Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity
JP7150939B2 (en) Volume leveler controller and control method
US9532136B2 (en) Semantic audio track mixer
US11915725B2 (en) Post-processing of audio recordings
EP2979267B1 (en) 1apparatuses and methods for audio classifying and processing
CN104079247B (en) Balanced device controller and control method and audio reproducing system
JP6585049B2 (en) System and method for automatic audio generation using semantic data
JP2016157136A (en) System and method for automatically producing haptic events from digital audio signal
WO2015114216A2 (en) Audio signal analysis
Sharma et al. On the Importance of Audio-Source Separation for Singer Identification in Polyphonic Music.
US20170047094A1 (en) Audio information processing
Kirchhoff et al. Evaluation of features for audio-to-audio alignment
CN111243618B (en) Method, device and electronic equipment for determining specific voice fragments in audio
US20240038258A1 (en) Audio content identification
US20240022224A1 (en) Automatic generation and selection of target profiles for dynamic equalization of audio content
CN116530009A (en) Automatic generation and selection of target profiles for dynamic equalization of audio content
KR101780644B1 (en) Method and apparatus for classifying music genre based on frequency
CN115171632A (en) Audio processing method, computer device and computer program product
WO2021124919A1 (en) Information processing device and method, and program
Kim et al. Car audio equalizer system using music classification and loudness compensation
Shelke et al. An Effective Feature Calculation For Analysis & Classification of Indian Musical Instruments Using Timbre Measurement
Loni et al. Extracting acoustic features of singing voice for various applications related to MIR: A review
Karasavvidis et al. Recognition of Greek Orthodox Hymns Using Audio Fingerprint Techniques
Kroher et al. Improving accompanied Flamenco singing voice transcription by combining vocal detection and predominant melody extraction.
Ezzaidi et al. Voice singer detection in polyphonic music

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination