CN105469807A - Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device - Google Patents

Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device Download PDF

Info

Publication number
CN105469807A
CN105469807A CN201511023725.3A CN201511023725A CN105469807A CN 105469807 A CN105469807 A CN 105469807A CN 201511023725 A CN201511023725 A CN 201511023725A CN 105469807 A CN105469807 A CN 105469807A
Authority
CN
China
Prior art keywords
frequency
fundamental frequency
function
time
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201511023725.3A
Other languages
Chinese (zh)
Other versions
CN105469807B (en
Inventor
刘文举
江巍
王天正
李�杰
梁基重
李艳鹏
乔利玮
刘元华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Zhenzhong Electric Power Co ltd
Institute of Automation of Chinese Academy of Science
Electric Power Research Institute of State Grid Shanxi Electric Power Co Ltd
Original Assignee
Shanxi Zhenzhong Electric Power Co ltd
Institute of Automation of Chinese Academy of Science
Electric Power Research Institute of State Grid Shanxi Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Zhenzhong Electric Power Co ltd, Institute of Automation of Chinese Academy of Science, Electric Power Research Institute of State Grid Shanxi Electric Power Co Ltd filed Critical Shanxi Zhenzhong Electric Power Co ltd
Priority to CN201511023725.3A priority Critical patent/CN105469807B/en
Publication of CN105469807A publication Critical patent/CN105469807A/en
Application granted granted Critical
Publication of CN105469807B publication Critical patent/CN105469807B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a multi-fundamental frequency extraction method and a multi-fundamental frequency extraction device based on empirical mode decomposition and a hidden Markov model. The method comprises steps: an auditory filter bank is used for filtering a speech signal, and framing is carried out on the signal after filtering; an auto-correlation function is calculated on each time frequency unit for an auditory spectrum; on the basis of an intrinsic mode function obtained through the empirical mode decomposition, the instantaneous frequency of each time frequency unit dominant sound source is calculated; on the basis of each instantaneous frequency, a frequency matching function is calculated; the frequency matching function is used for building the likelihood probability of each fundamental frequency state, and a corpus is used for counting the transition probability between each fundamental frequency state and a fundamental frequency value; and the likelihood probability of each fundamental frequency state is enhanced, the enhanced likelihood probability is combined with the corresponding transition probability, and the hidden Markov model is used for extracting a multi-fundamental frequency track of the speech signal.

Description

A kind of many fundamental frequencies extracting method and device
Technical field
The present invention relates to digital signal processing empirical mode decomposition, voice signal filter bank analysis, voice signal fundamental frequency extract, the structure of hidden Markov model likelihood probability and transition probability.
Background technology
The extraction of fundamental frequency (pitch) and the tracking of track thereof to all more voices and Audio Signal Processing technology all significant, such as audio retrieval and classification, the identification of Chinese intonation and single-channel voice isolation technics etc.There are now some performances well for detecting pure or having the fundamental frequency extraction algorithm of the single fundamental frequency in the voice of a small amount of noise.But the hypothesis of single fundamental frequency uses when making this kind of algorithm cannot there is multiple fundamental frequency in voice simultaneously, the situation having music to exist in the situation that such as two speakers speak simultaneously or speaker's background, and be devoted to solve in Computational auditory scene analysis (CASA) technology of cocktail party problem, many fundamental frequencies extract and follow the trail of the important foundation of usually effectively carrying out voice segmentation and tissue especially.
Hidden Markov Model (HMM) is used to the continuity of following the trail of pitch contour very early.HMM model was just once utilized to the fundamental frequency number of adjudicating every frame in mixing voice in the eighties; 2003, the distance between true pitch period and the peak value of autocorrelation function was used to the likelihood function modeling of fundamental frequency state and has occurred a many Pitch-Synchronous OLA algorithm based on hidden Markov model.2013, the elevation information of autocorrelation function peak value was used directly to the potential function building fundamental frequency distribution, and the performance of many fundamental frequencies extraction algorithm is able to further lifting.The common feature of above-mentioned algorithm is that the middle level expression characteristic that uses in many fundamental frequencies tracing process is all based on sense of hearing spectrogram (cochleagram), specifically, the likelihood probability function of fundamental frequency state all extracts from auto-correlation spectrogram (correlogram), and the local feature near the mainly peak value of their use.
The principal feature of sense of hearing spectrogram is that low frequency resolution is higher; high frequency resolution is lower; multiple harmonic in the usual voice responsive simultaneously of hf channel of auditoiy filterbank, therefore the response of hf channel can be subject to the impact of amplitude modulation effect (AM) usually.When single fundamental frequency extracts, the peak value of the amplitude envelops that the amplitude modulation effect in hf channel produces and envelope both provides the information of former fundamental frequency.But, when many fundamental frequencies extract, but may there is energy close but belong to the higher hamonic wave of different fundamental frequency in a time frequency unit simultaneously, its amplitude modulation(PAM) rate will be caused so not belong to the situation of the harmonic wave of any one fundamental frequency, this skew that the peak height and peak that make corresponding autocorrelation function are made a mistake, thus passive impact is produced on the extraction of many fundamental frequencies.
Summary of the invention
In view of this, in order to overcome recurrent doubling time Problem-Error and the above-mentioned other problems mentioned in fundamental frequency leaching process, the present invention proposes a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model.
According to an aspect of the present invention, provide a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model, it is characterized in that, comprise the following steps:
Step 1: carry out filtering to voice signal by auditoiy filterbank, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;
Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Step 4: calculated rate adaptation function on the basis of each instantaneous frequency;
Step 5: the likelihood probability building each fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Step 6: the likelihood probability of described each fundamental frequency state strengthened, then combines the described likelihood probability after enhancing with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
According to a further aspect of the invention, provide a kind of many fundamental frequencies extraction element based on empirical mode decomposition and hidden Markov model, it is characterized in that, comprising:
Pretreatment module, it carries out filtering by auditoiy filterbank to voice signal, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Autocorrelation function computing module, it calculates autocorrelation function in each time frequency unit of sense of hearing spectrum;
Instantaneous frequency computing module, it carries out empirical mode decomposition to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Frequency matching function computation module, it is calculated rate adaptation function on the basis of each instantaneous frequency;
Likelihood probability and transition probability computing module, it builds the likelihood probability of each fundamental frequency state with frequency matching function, and uses corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Trajectory extraction module, the likelihood probability of described each fundamental frequency state strengthens by it, then the described likelihood probability after enhancing is combined with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
The such scheme that the present invention proposes, in order to suppress the unfavorable amplitude modulation effect occurred in the hf channel of gammatone bank of filters in many fundamental frequencies leaching process, instead of autocorrelation function with frequency matching function in the process calculating the fundamental tone state likelihood probability in hidden Markov model.On the other hand, compare with peak height with the peak of autocorrelation function, the average instantaneous frequency of time frequency unit is not more vulnerable to the impact of noise and amplitude modulation effect, the frequency matching function that the present invention extracts on average instantaneous frequency basis shows more reliable in the leaching process of many fundamental frequencies, finally makes getting a promotion of many fundamental frequencies extraction algorithm.
In addition, doubling time mistake is a kind of mistake often run in fundamental frequency leaching process, for this problem, the said method that the present invention proposes reduces the peak height of frequency matching function in doubling time position by a kind of process that strengthens, the likelihood probability of doubling time candidate point is reduced, thus reduces the probability of doubling time mistake generation.
In a word, the present invention, by the suppression to unfavorable amplitude modulation effect and doubling time error probability, is combined in fundamental tone state transition probability corpus added up and obtains, and adopts the mode of hidden Markov model decoding to obtain the track of double-basis voice frequency.
Accompanying drawing explanation
The further characteristic of the present invention and advantage are described below with reference to illustrative accompanying drawing.
Fig. 1 is the process flow diagram of the many fundamental frequencies extracting method based on empirical mode decomposition and Hidden Markov Model (HMM) that the present invention proposes;
Fig. 2 is the process flow diagram of the autocorrelation function of an each time frequency unit of calculating in prior art;
Fig. 3 is the process flow diagram of the frequency matching function of an each time frequency unit of calculating in the present invention;
Fig. 4 is a process flow diagram utilizing frequency matching function to build fundamental frequency state likelihood probability;
Fig. 5 is a process flow diagram strengthened fundamental frequency state likelihood function;
Fig. 6 is a process flow diagram utilizing hidden Markov model to carry out the extraction of many fundamental frequencies.
Embodiment
Should be appreciated that the following detailed description of different example and accompanying drawing is not be intended to the present invention to be limited to special illustrative embodiment; The illustrative embodiment be described is only illustration each step of the present invention, and its scope is defined by the claim of adding.
The present invention, by carrying out empirical mode decomposition to the autocorrelation function of time frequency unit in voice two dimension sense of hearing spectrogram, obtains leading instantaneous frequency, and calculated rate adaptation function on its basis.Compared with autocorrelation function, frequency matching function can to overcome when many fundamental frequencies extract disadvantageous amplitude modulation effect in high frequency gammatone filter bank channel, and the fundamental frequency state likelihood function therefore built on frequency matching functional foundations is more stable and reliable.Utilize in the fundamental frequency state likelihood function and fundamental frequency corpus built and add up the fundamental frequency state transition function obtained, the mode can decoded by Viterbi, carries out the extraction of many pitch contours by hidden Markov model.
As described in Figure 1, the present invention proposes a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model, the concrete steps of the method are as follows:
Step 1: filtering is carried out to voice signal by auditoiy filterbank, and two-dimentional time-frequency expression and the sense of hearing spectrum that framing obtains voice signal is carried out to filtered voice signal;
Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;
Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the first essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Step 4: calculated rate adaptation function on the basis of instantaneous frequency;
Step 5: the likelihood probability building each frame fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each frame fundamental frequency state and fundamental frequency numerical value;
Step 6: likelihood probability carried out strengthening to carry out reduction doubling time mistake, is then combined the likelihood probability after enhancing with transition probability, and uses the many pitch contours of hidden Markov model to current speech to extract.
Wherein, by auditoiy filterbank, filtering is carried out to voice signal described in step 1, and obtain two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal, by auditoiy filterbank filtering is carried out to the voice signal of one dimension and windowing framing obtains two-dimentional time-frequency expresses, the time dimension (corresponding voice frame number) of the wherein one-dimensional representation voice signal that described two-dimentional time-frequency is expressed, another one-dimensional representation frequency dimension (respective channel sequence number).
Described auditoiy filterbank is a kind of model imitating cochlea Auditory Perception mechanism, and the response of the time domain impulse of this wave filter is for such as formula shown in (1), and wherein filter centre frequency is distributed between 0Hz to 3000Hz.
g ( t , f ) = t l - 1 exp ( - 2 π b t ) c o s ( 2 π f t ) i f t > 0 0 e l s e - - - ( 1 )
Wherein, t represents the time, and filter order l=4, f are filter centre frequency, and b is equivalent rectangular bandwidth.
As shown in Figure 2, voice signal is through above-mentioned bank of filters filtering, and in bank of filters, the output of each wave filter is the time-domain signal identical with primary speech signal length.Windowing process is carried out to the output of each filtering channel, typical window is long is embodied as 20ms, just the two-dimentional time-frequency expression of primary speech signal and sense of hearing spectrogram and cochleagram can be obtained thus, C (c can be used, m) represent, wherein c represents filter channel sequence number, and m represents speech frame sequence number.
In each time frequency unit of sense of hearing spectrum, calculate autocorrelation function described in step 2, computing formula is such as formula shown in (2).
A H ( c , m , τ ) = Σ n = 0 W h ( c , m · T + n ) × h ( c , m · T + n + τ ) Σ n = 0 W h 2 ( c , m · T + n ) Σ n = 0 W h 2 ( c , m · T + n + τ ) - - - ( 2 )
Wherein, h (c) is the output of corresponding gammatone wave filter in c passage, and m is speech frame sequence number, and n represents discrete time point, and τ is some time delay, and T represents the sampling number that every frame voice signal is corresponding, and W represents discrete point number.
Respective filter due to different passage is different, and the delay that therefore output signal of each path filter produces is different.By the calculating of autocorrelation function, the effect of each channel phases alignment can be played.
Described in step 3, empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the first essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source, specifically comprise the steps:
Utilize Hilbert-Huang transform that original autocorrelation function is decomposed into a series of essential mode function, and according to auditory masking effect, choose the instantaneous frequency of frequency as sound source leading in this time frequency unit of the essential mode function that first decomposites.
As shown in Figure 3, described in step 4 on the basis of instantaneous frequency calculated rate adaptation function, this function is expressed as the middle level in fundamental frequency leaching process, can describe the average instantaneous frequency of current time frequency unit and the degree of each candidate pitch frequency matching, its computing formula is such as formula shown in (3).
F ( c , m , τ ) = 1 - 2 · [ f ‾ ( c , m ) · τ - int ( f ‾ ( c , m ) · τ ) ] - - - ( 3 )
Wherein, represent the average instantaneous frequency being positioned at the time frequency unit of c passage of m frame, τ represents the pitch period (some time delay namely in described limit of consideration) of candidate, and int () is bracket function, returns nearest round values.
As shown in Figure 4, described in step 5, build the likelihood probability of each frame fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value, specifically comprise the steps:
First, the basis of frequency matching function builds the likelihood probability of each fundamental frequency state, i.e. observation probability, in time frequency unit u (c, m), there is single pitch period τ 1likelihood probability such as formula shown in (4).There are two each and every one pitch period τ in time frequency unit u (c, m) simultaneously 1and τ 2likelihood probability such as formula shown in (5).
p ( x | ω 1 ) = Σ c ∈ Φ c F ( c , m , τ 1 ) · L ( c , m ) Σ C ∈ Φ c L ( c , m ) - - - ( 4 )
g ( x | ω 2 ) = Σ c ∈ Φ c m a x { F ( c , m , τ 1 ) , F ( c , m , τ 2 ) } · L ( c , m ) Σ c ∈ Φ c L ( c , m ) - - - ( 5 )
Wherein, x represents the voice signal observed, ω 1, ω 2be respectively single fundamental frequency state and double-basis voice frequency state; The normalization loudness that L (c, m) is each time frequency unit; Φ cfor the channel position set in two-dimentional time-frequency expression;
In above-mentioned formula, L (c, m) is the normalization loudness of each time frequency unit, and its computing formula is as follows:
L n ( c , m ) = 2 log E ( c , m ) Σ c = 1 N 2 log E ( c , m ) - - - ( 6 )
Wherein, E (c, m) represents the energy of time frequency unit u (c, m), and N is the channel number of bank of filters.
Secondly, the state of the fundamental frequency of every frame may be present among three kinds of spaces, i.e. zero-base frequency, single fundamental frequency and double-basis space frequently:
Ω=Ω 0∪Ω 1∪Ω 2
Transition probability between three fundamental frequency states is that the statistics of database by marking with fundamental frequency obtains:
Wherein, Ω ifundamental frequency state space, p ijrepresent from fundamental frequency state space Ω ito fundamental frequency state space Ω jtransition probability.
Likelihood probability carried out described in step 6 strengthening to carry out reduction doubling time mistake, then the likelihood probability after enhancing be combined with transition probability, and use the many pitch contours of hidden Markov model to current speech to extract, as shown in Figure 5, step comprises:
First, strengthen single fundamental frequency likelihood probability function, formula is for shown in formula (7):
p e n ( x | { τ 1 } ) = p ( x | { τ 1 } ) - α · m a x m = 2 , ... , 7 p ( x | { τ 1 / m } ) - - - ( 7 )
Wherein, the scope of m is 2 ~ 7, and represent that the doubling time mistake only caused for the harmonic wave of 2 ~ 7 times strengthens, α is pre-determined factor, adopts the numerical value between 0.6 ~ 0.8.
The meaning that this formula represents strengthens the likelihood probability of a single fundamental frequency state, the candidate pitch period τ of the present invention in formula (4) 11/m position functional value in find that maximum value, a coefficient is multiplied by this numerical value and using result of product as frequency matching adjusted value, described coefficient can adopt the numerical value between 0.6 ~ 0.8, then deduct on the frequency matching functional value at former candidate pitch period point place adjusted value be enhanced after frequency matching functional value.
Secondly, to double-basis frequently likelihood probability function strengthen, for this reason, first double-basis frequently likelihood probability be written as two functions to add with form such as formula shown in (8):
g(x|{τ 1,τ 2})=p(x|{τ 1})+p r1,τ 2)(8)
Wherein,
p r ( τ 1 , τ 2 ) = Σ c ∈ Φ c ( F ( c , m , τ 2 ) - F ( c , m , τ 1 ) ) · L ( c , m ) Σ c ∈ Φ c L ( c , m ) - - - ( 9 )
In above-mentioned formula, τ 1and τ 2be the pitch period point of two candidates, g (x|{ τ 1, τ 2) represent the likelihood probability observing voice signal x on these two candidate pitch period points.
Then, two functions are strengthened by the method for formula (7) respectively, and obtain the result of formula (10):
g en(x|{τ 1,τ 2})=p en(x|{τ 1})+p r_en1,τ 2)(10)
Wherein, p en(x|{ τ 1) represent p (x|{ τ 1) strengthen after result, p r_en1, τ 2) represent p r1, τ 2) result after enhancing.
As shown in Figure 6, the likelihood probability after the enhancing obtained in this step is combined with the three kinds of fundamental frequency state transition probabilities obtained in step 5, just can by the Vit of hidden Markov model erbi decode procedure obtains many pitch contours.It should be noted that, fundamental frequency state space one has three kinds of states, but this step only calculates the likelihood probability of wherein two states, this is because the zero condition of fundamental frequency (namely not having the state of fundamental frequency) likelihood probability is a constant preset, calculates without the need to carrying out and strengthen.
According to this instructions, the present invention further modifications and variations is apparent for the technician in described field.Therefore, this explanation will be regarded as illustrative and its objective is to one of ordinary skill in the art lecture be used for performing conventional method of the present invention.Should be appreciated that this instructions illustrates and the form of the present invention that describes just is counted as current preferred embodiment.

Claims (8)

1., based on many fundamental frequencies extracting method of empirical mode decomposition and hidden Markov model, it is characterized in that, comprise the following steps:
Step 1: carry out filtering to voice signal by auditoiy filterbank, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;
Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Step 4: calculated rate adaptation function on the basis of each instantaneous frequency;
Step 5: the likelihood probability building each fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Step 6: the likelihood probability of described each fundamental frequency state strengthened, then combines the described likelihood probability after enhancing with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
2. the method for claim 1, it is characterized in that, auditoiy filterbank is utilized to carry out filtering to voice signal in step 1, the output of each wave filter in described auditoiy filterbank is the time-domain signal identical with described voice signal length, to the output windowing sub-frame processing of described each wave filter, the two-dimentional time-frequency obtaining described voice signal is expressed.
3. the method for claim 1, is characterized in that, calculates described in step 2 at the autocorrelation function of each time frequency unit of sense of hearing spectrum by following formula:
A H ( c , m , τ ) = Σ n = 0 W h ( c , m · T + n ) × h ( c , m · T + n + τ ) Σ n = 0 W h 2 ( c , m · T + n ) Σ n = 0 W h 2 ( c , m · T + n + τ ) - - - ( 2 )
Wherein, h (c) is the output of respective filter in described auditoiy filterbank in c filter channel, and m is speech frame sequence number, and n represents discrete time point, and τ is some time delay.
4. the method for claim 1, is characterized in that, on the autocorrelation function of each time frequency unit, carry out empirical mode decomposition described in step 3, step comprises:
Utilize Hilbert-Huang transform that described autocorrelation function is decomposed into a series of essential mode function, and according to auditory masking effect, the instantaneous frequency of frequency leading sound source in this time frequency unit of the essential mode function that first is decomposited.
5. the method for claim 1, is characterized in that, the function of frequency matching described in step 4 is for the degree of the average instantaneous frequency and each candidate pitch frequency matching that describe current time frequency unit, and its computing formula is as follows:
F ( c , m , τ ) = 1 - 2 · [ f ‾ ( c , m ) · τ - int ( f ‾ ( c , m ) · τ ) ] - - - ( 3 )
Wherein, represent the average instantaneous frequency being positioned at the time frequency unit of m frame c passage, τ represents the pitch period of candidate, and int () is bracket function, returns nearest round values.
6. the method for claim 1, is characterized in that, step 5 specifically comprises:
First, the basis of frequency matching function builds the likelihood probability of each fundamental frequency state, in the time frequency unit u (c, m) of c passage of m frame, there is single pitch period τ 1likelihood probability as follows:
p ( x | ω 1 ) = Σ c ∈ Φ c F ( c , m , τ 1 ) · L ( c , m ) Σ c ∈ Φ c L ( c , m ) - - - ( 4 )
There are two each and every one pitch period τ in time frequency unit u (c, m) simultaneously 1and τ 2likelihood probability as follows:
g ( x | ω 2 ) = Σ c ∈ Φ c m a x { F ( c , m , τ 1 ) , F ( c , m , τ 2 ) } · L ( c , m ) Σ c ∈ Φ c L ( c , m ) - - - ( 5 )
Wherein, x represents voice signal, ω 1, ω 2be respectively single fundamental frequency state and double-basis voice frequency state; The normalization loudness that L (c, m) is each time frequency unit; Φ cfor the channel position set in two-dimentional time-frequency expression, F (c, m, τ 1) be described frequency matching function;
In above-mentioned formula, L (c, m) is the normalization loudness of each time frequency unit, and its computing formula is as follows:
L n ( c , m ) = 2 log E ( c , m ) Σ c = 1 N 2 log E ( c , m ) - - - ( 6 )
Wherein, E (c, m) represents the energy of time frequency unit u (c, m), and N is the channel number of bank of filters;
Secondly, the state of the fundamental frequency of every frame may be present among three kinds of spaces, i.e. zero-base frequency, single fundamental frequency and double-basis space frequently:
Ω=Ω 0∪Ω 1∪Ω 2
Transition probability between three fundamental frequency states is that the statistics of database by marking with fundamental frequency obtains:
Wherein, Ω ifundamental frequency state space, p ijrepresent from fundamental frequency state space Ω ito fundamental frequency state space Ω jtransition probability.
7. the method for claim 1, is characterized in that, strengthen the likelihood probability of each fundamental frequency state in step 6, concrete steps are as follows:
First, strengthen single fundamental frequency likelihood probability function, formula is:
p e n ( x | { τ 1 } ) = p ( x | { τ 1 } ) - α · m a x m = 2 , ... , 7 p ( x | { τ 1 / m } ) - - - ( 7 )
Wherein, the scope of m is 2 ~ 7, and represent that the doubling time mistake only caused for the harmonic wave of 2 ~ 7 times strengthens, α is pre-determined factor, p (x|{ τ 1) expression pitch period is τ 1time observe single fundamental frequency likelihood probability of current speech signal x; p en(x|{ τ 1) be the single fundamental frequency likelihood probability after enhancing;
Secondly, the likelihood probability function of double-basis frequency state is strengthened, for this reason, first the likelihood probability of double-basis frequency state is written as two function p (x|{ τ 1) and p r1, τ 2) add and form:
g(x|{τ 1,τ 2})=p(x|{τ 1})+p r1,τ 2)(8)
p r ( τ 1 , τ 2 ) = Σ c ∈ Γ ( F ( c , m , τ 2 ) - F ( c , m , τ 1 ) ) · L ( c , m ) Σ c ∈ Γ L ( c , m ) - - - ( 9 )
Wherein, p (x|{ τ 1) be the likelihood probability of single fundamental frequency state, F (c, m, τ i) be frequency matching function, L (c, m) is the normalization loudness of each time frequency unit, and c is channel number;
Then, to two function p (x|{ τ 1) and p r1, τ 2) strengthen by the method for formula (7) respectively, and obtain the likelihood probability of double-basis frequency state:
g en(x|{τ 1,τ 2})=p en(x|{τ 1})+p r_en1,τ 2)(10)
Wherein, g en(x|{ τ 1, τ 2) be the likelihood probability of the frequently state of the double-basis after strengthening, p en(x|{ τ 1) and p r_en1, τ 2) be respectively p (x|{ τ 1) and p r1, τ 2) value after enhancing.
8., based on many fundamental frequencies extraction element of empirical mode decomposition and hidden Markov model, it is characterized in that, comprising:
Pretreatment module, it carries out filtering by auditoiy filterbank to voice signal, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Autocorrelation function computing module, it calculates autocorrelation function in each time frequency unit of sense of hearing spectrum;
Instantaneous frequency computing module, it carries out empirical mode decomposition to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Frequency matching function computation module, it is calculated rate adaptation function on the basis of each instantaneous frequency;
Likelihood probability and transition probability computing module, it builds the likelihood probability of each fundamental frequency state with frequency matching function, and uses corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Trajectory extraction module, the likelihood probability of described each fundamental frequency state strengthens by it, then the described likelihood probability after enhancing is combined with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
CN201511023725.3A 2015-12-30 2015-12-30 A kind of more fundamental frequency extracting methods and device Expired - Fee Related CN105469807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511023725.3A CN105469807B (en) 2015-12-30 2015-12-30 A kind of more fundamental frequency extracting methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511023725.3A CN105469807B (en) 2015-12-30 2015-12-30 A kind of more fundamental frequency extracting methods and device

Publications (2)

Publication Number Publication Date
CN105469807A true CN105469807A (en) 2016-04-06
CN105469807B CN105469807B (en) 2019-04-02

Family

ID=55607432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511023725.3A Expired - Fee Related CN105469807B (en) 2015-12-30 2015-12-30 A kind of more fundamental frequency extracting methods and device

Country Status (1)

Country Link
CN (1) CN105469807B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205638A (en) * 2016-06-16 2016-12-07 清华大学 A kind of double-deck fundamental tone feature extracting method towards audio event detection
CN106448630A (en) * 2016-09-09 2017-02-22 腾讯科技(深圳)有限公司 Method and device for generating digital music file of song
CN107316653A (en) * 2016-04-27 2017-11-03 南京理工大学 A kind of fundamental detection method based on improved experience wavelet transformation
CN109036376A (en) * 2018-10-17 2018-12-18 南京理工大学 A kind of the south of Fujian Province language phoneme synthesizing method
CN109839272A (en) * 2019-03-25 2019-06-04 湖南工业大学 It is extracted and the average Method for Bearing Fault Diagnosis of auto-correlated population based on failure impact
CN111048110A (en) * 2018-10-15 2020-04-21 杭州网易云音乐科技有限公司 Musical instrument identification method, medium, device and computing equipment
CN111312258A (en) * 2019-12-16 2020-06-19 随手(北京)信息技术有限公司 User identity authentication method, device, server and storage medium
CN114897236A (en) * 2022-05-09 2022-08-12 中南大学 Hidden Markov inference method for rock pulp channel entrance under survey data constraint

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001282267A (en) * 2000-03-29 2001-10-12 Mega Chips Corp Speech processing system and speech processing method
US7092881B1 (en) * 1999-07-26 2006-08-15 Lucent Technologies Inc. Parametric speech codec for representing synthetic speech in the presence of background noise
CN101567188A (en) * 2009-04-30 2009-10-28 上海大学 Multi-pitch estimation method for mixed audio signals with combined long frame and short frame
CN104036785A (en) * 2013-03-07 2014-09-10 索尼公司 Speech signal processing method, speech signal processing device and speech signal analyzing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7092881B1 (en) * 1999-07-26 2006-08-15 Lucent Technologies Inc. Parametric speech codec for representing synthetic speech in the presence of background noise
JP2001282267A (en) * 2000-03-29 2001-10-12 Mega Chips Corp Speech processing system and speech processing method
CN101567188A (en) * 2009-04-30 2009-10-28 上海大学 Multi-pitch estimation method for mixed audio signals with combined long frame and short frame
CN104036785A (en) * 2013-03-07 2014-09-10 索尼公司 Speech signal processing method, speech signal processing device and speech signal analyzing system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YANG SHAO,DELIANG WANG: "Co-channel speaker identification using usable speech extraction based on multi-pitch tracking", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING》 *
李仕涛: "多基音检测算法研究", 《中国优秀硕士学位论文全文数据库》 *
李鹏,关勇,刘文举,徐波: "基于多基音跟踪的单声道混合语音分离", 《计算机应用研究》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316653B (en) * 2016-04-27 2020-06-26 南京理工大学 Improved empirical wavelet transform-based fundamental frequency detection method
CN107316653A (en) * 2016-04-27 2017-11-03 南京理工大学 A kind of fundamental detection method based on improved experience wavelet transformation
CN106205638B (en) * 2016-06-16 2019-11-08 清华大学 A kind of double-deck fundamental tone feature extracting method towards audio event detection
CN106205638A (en) * 2016-06-16 2016-12-07 清华大学 A kind of double-deck fundamental tone feature extracting method towards audio event detection
CN106448630A (en) * 2016-09-09 2017-02-22 腾讯科技(深圳)有限公司 Method and device for generating digital music file of song
US10923089B2 (en) 2016-09-09 2021-02-16 Tencent Technology (Shenzhen) Company Limited Method and apparatus for generating digital score file of song, and storage medium
CN106448630B (en) * 2016-09-09 2020-08-04 腾讯科技(深圳)有限公司 Method and device for generating digital music score file of song
CN111048110A (en) * 2018-10-15 2020-04-21 杭州网易云音乐科技有限公司 Musical instrument identification method, medium, device and computing equipment
CN109036376A (en) * 2018-10-17 2018-12-18 南京理工大学 A kind of the south of Fujian Province language phoneme synthesizing method
CN109839272A (en) * 2019-03-25 2019-06-04 湖南工业大学 It is extracted and the average Method for Bearing Fault Diagnosis of auto-correlated population based on failure impact
CN109839272B (en) * 2019-03-25 2021-01-08 湖南工业大学 Bearing fault diagnosis method based on fault impact extraction and self-correlation ensemble averaging
CN111312258A (en) * 2019-12-16 2020-06-19 随手(北京)信息技术有限公司 User identity authentication method, device, server and storage medium
CN114897236A (en) * 2022-05-09 2022-08-12 中南大学 Hidden Markov inference method for rock pulp channel entrance under survey data constraint
CN114897236B (en) * 2022-05-09 2024-06-07 中南大学 Hidden Markov inference method for magma channel entrance under investigation data constraint

Also Published As

Publication number Publication date
CN105469807B (en) 2019-04-02

Similar Documents

Publication Publication Date Title
CN105469807A (en) Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device
US11030998B2 (en) Acoustic model training method, speech recognition method, apparatus, device and medium
CN109147796B (en) Speech recognition method, device, computer equipment and computer readable storage medium
Wang et al. Channel pattern noise based playback attack detection algorithm for speaker recognition
CN104835498B (en) Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter
Hu et al. Pitch‐based gender identification with two‐stage classification
CN108520753B (en) Voice lie detection method based on convolution bidirectional long-time and short-time memory network
Mitra et al. Medium-duration modulation cepstral feature for robust speech recognition
Umesh et al. Scale transform in speech analysis
CN104900235A (en) Voiceprint recognition method based on pitch period mixed characteristic parameters
Dua et al. Performance evaluation of Hindi speech recognition system using optimized filterbanks
Wanli et al. The research of feature extraction based on MFCC for speaker recognition
CN103077728B (en) A kind of patient's weak voice endpoint detection method
CN102436809A (en) Network speech recognition method in English oral language machine examination system
Müller et al. Contextual invariant-integration features for improved speaker-independent speech recognition
CN106373559A (en) Robustness feature extraction method based on logarithmic spectrum noise-to-signal weighting
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
GROZDIĆ et al. Comparison of Cepstral Normalization Techniques in Whispered Speech Recognition.
Adam et al. Wavelet cesptral coefficients for isolated speech recognition
Singhal et al. Automatic speech recognition for connected words using DTW/HMM for English/Hindi languages
CN104064197A (en) Method for improving speech recognition robustness on basis of dynamic information among speech frames
Patel et al. Development and implementation of algorithm for speaker recognition for gujarati language
Zouhir et al. Speech Signals Parameterization Based on Auditory Filter Modeling
Bharali et al. Zero crossing rate and short term energy as a cue for sex detection with reference to Assamese vowels
Seman et al. Evaluating endpoint detection algorithms for isolated word from Malay parliamentary speech

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190402

Termination date: 20211230

CF01 Termination of patent right due to non-payment of annual fee