CN105469807A - Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device - Google Patents
Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device Download PDFInfo
- Publication number
- CN105469807A CN105469807A CN201511023725.3A CN201511023725A CN105469807A CN 105469807 A CN105469807 A CN 105469807A CN 201511023725 A CN201511023725 A CN 201511023725A CN 105469807 A CN105469807 A CN 105469807A
- Authority
- CN
- China
- Prior art keywords
- frequency
- fundamental frequency
- function
- time
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 claims abstract description 33
- 238000005311 autocorrelation function Methods 0.000 claims abstract description 28
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 24
- 230000007704 transition Effects 0.000 claims abstract description 22
- 238000001228 spectrum Methods 0.000 claims abstract description 14
- 238000001914 filtration Methods 0.000 claims abstract description 12
- 238000009432 framing Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 54
- 230000002708 enhancing effect Effects 0.000 claims description 10
- 230000006978 adaptation Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 5
- 238000005728 strengthening Methods 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 description 15
- 108091006146 Channels Proteins 0.000 description 11
- 230000000694 effects Effects 0.000 description 7
- 239000000284 extract Substances 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000002386 leaching Methods 0.000 description 5
- 230000009467 reduction Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000003477 cochlea Anatomy 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention discloses a multi-fundamental frequency extraction method and a multi-fundamental frequency extraction device based on empirical mode decomposition and a hidden Markov model. The method comprises steps: an auditory filter bank is used for filtering a speech signal, and framing is carried out on the signal after filtering; an auto-correlation function is calculated on each time frequency unit for an auditory spectrum; on the basis of an intrinsic mode function obtained through the empirical mode decomposition, the instantaneous frequency of each time frequency unit dominant sound source is calculated; on the basis of each instantaneous frequency, a frequency matching function is calculated; the frequency matching function is used for building the likelihood probability of each fundamental frequency state, and a corpus is used for counting the transition probability between each fundamental frequency state and a fundamental frequency value; and the likelihood probability of each fundamental frequency state is enhanced, the enhanced likelihood probability is combined with the corresponding transition probability, and the hidden Markov model is used for extracting a multi-fundamental frequency track of the speech signal.
Description
Technical field
The present invention relates to digital signal processing empirical mode decomposition, voice signal filter bank analysis, voice signal fundamental frequency extract, the structure of hidden Markov model likelihood probability and transition probability.
Background technology
The extraction of fundamental frequency (pitch) and the tracking of track thereof to all more voices and Audio Signal Processing technology all significant, such as audio retrieval and classification, the identification of Chinese intonation and single-channel voice isolation technics etc.There are now some performances well for detecting pure or having the fundamental frequency extraction algorithm of the single fundamental frequency in the voice of a small amount of noise.But the hypothesis of single fundamental frequency uses when making this kind of algorithm cannot there is multiple fundamental frequency in voice simultaneously, the situation having music to exist in the situation that such as two speakers speak simultaneously or speaker's background, and be devoted to solve in Computational auditory scene analysis (CASA) technology of cocktail party problem, many fundamental frequencies extract and follow the trail of the important foundation of usually effectively carrying out voice segmentation and tissue especially.
Hidden Markov Model (HMM) is used to the continuity of following the trail of pitch contour very early.HMM model was just once utilized to the fundamental frequency number of adjudicating every frame in mixing voice in the eighties; 2003, the distance between true pitch period and the peak value of autocorrelation function was used to the likelihood function modeling of fundamental frequency state and has occurred a many Pitch-Synchronous OLA algorithm based on hidden Markov model.2013, the elevation information of autocorrelation function peak value was used directly to the potential function building fundamental frequency distribution, and the performance of many fundamental frequencies extraction algorithm is able to further lifting.The common feature of above-mentioned algorithm is that the middle level expression characteristic that uses in many fundamental frequencies tracing process is all based on sense of hearing spectrogram (cochleagram), specifically, the likelihood probability function of fundamental frequency state all extracts from auto-correlation spectrogram (correlogram), and the local feature near the mainly peak value of their use.
The principal feature of sense of hearing spectrogram is that low frequency resolution is higher; high frequency resolution is lower; multiple harmonic in the usual voice responsive simultaneously of hf channel of auditoiy filterbank, therefore the response of hf channel can be subject to the impact of amplitude modulation effect (AM) usually.When single fundamental frequency extracts, the peak value of the amplitude envelops that the amplitude modulation effect in hf channel produces and envelope both provides the information of former fundamental frequency.But, when many fundamental frequencies extract, but may there is energy close but belong to the higher hamonic wave of different fundamental frequency in a time frequency unit simultaneously, its amplitude modulation(PAM) rate will be caused so not belong to the situation of the harmonic wave of any one fundamental frequency, this skew that the peak height and peak that make corresponding autocorrelation function are made a mistake, thus passive impact is produced on the extraction of many fundamental frequencies.
Summary of the invention
In view of this, in order to overcome recurrent doubling time Problem-Error and the above-mentioned other problems mentioned in fundamental frequency leaching process, the present invention proposes a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model.
According to an aspect of the present invention, provide a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model, it is characterized in that, comprise the following steps:
Step 1: carry out filtering to voice signal by auditoiy filterbank, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;
Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Step 4: calculated rate adaptation function on the basis of each instantaneous frequency;
Step 5: the likelihood probability building each fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Step 6: the likelihood probability of described each fundamental frequency state strengthened, then combines the described likelihood probability after enhancing with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
According to a further aspect of the invention, provide a kind of many fundamental frequencies extraction element based on empirical mode decomposition and hidden Markov model, it is characterized in that, comprising:
Pretreatment module, it carries out filtering by auditoiy filterbank to voice signal, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Autocorrelation function computing module, it calculates autocorrelation function in each time frequency unit of sense of hearing spectrum;
Instantaneous frequency computing module, it carries out empirical mode decomposition to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Frequency matching function computation module, it is calculated rate adaptation function on the basis of each instantaneous frequency;
Likelihood probability and transition probability computing module, it builds the likelihood probability of each fundamental frequency state with frequency matching function, and uses corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Trajectory extraction module, the likelihood probability of described each fundamental frequency state strengthens by it, then the described likelihood probability after enhancing is combined with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
The such scheme that the present invention proposes, in order to suppress the unfavorable amplitude modulation effect occurred in the hf channel of gammatone bank of filters in many fundamental frequencies leaching process, instead of autocorrelation function with frequency matching function in the process calculating the fundamental tone state likelihood probability in hidden Markov model.On the other hand, compare with peak height with the peak of autocorrelation function, the average instantaneous frequency of time frequency unit is not more vulnerable to the impact of noise and amplitude modulation effect, the frequency matching function that the present invention extracts on average instantaneous frequency basis shows more reliable in the leaching process of many fundamental frequencies, finally makes getting a promotion of many fundamental frequencies extraction algorithm.
In addition, doubling time mistake is a kind of mistake often run in fundamental frequency leaching process, for this problem, the said method that the present invention proposes reduces the peak height of frequency matching function in doubling time position by a kind of process that strengthens, the likelihood probability of doubling time candidate point is reduced, thus reduces the probability of doubling time mistake generation.
In a word, the present invention, by the suppression to unfavorable amplitude modulation effect and doubling time error probability, is combined in fundamental tone state transition probability corpus added up and obtains, and adopts the mode of hidden Markov model decoding to obtain the track of double-basis voice frequency.
Accompanying drawing explanation
The further characteristic of the present invention and advantage are described below with reference to illustrative accompanying drawing.
Fig. 1 is the process flow diagram of the many fundamental frequencies extracting method based on empirical mode decomposition and Hidden Markov Model (HMM) that the present invention proposes;
Fig. 2 is the process flow diagram of the autocorrelation function of an each time frequency unit of calculating in prior art;
Fig. 3 is the process flow diagram of the frequency matching function of an each time frequency unit of calculating in the present invention;
Fig. 4 is a process flow diagram utilizing frequency matching function to build fundamental frequency state likelihood probability;
Fig. 5 is a process flow diagram strengthened fundamental frequency state likelihood function;
Fig. 6 is a process flow diagram utilizing hidden Markov model to carry out the extraction of many fundamental frequencies.
Embodiment
Should be appreciated that the following detailed description of different example and accompanying drawing is not be intended to the present invention to be limited to special illustrative embodiment; The illustrative embodiment be described is only illustration each step of the present invention, and its scope is defined by the claim of adding.
The present invention, by carrying out empirical mode decomposition to the autocorrelation function of time frequency unit in voice two dimension sense of hearing spectrogram, obtains leading instantaneous frequency, and calculated rate adaptation function on its basis.Compared with autocorrelation function, frequency matching function can to overcome when many fundamental frequencies extract disadvantageous amplitude modulation effect in high frequency gammatone filter bank channel, and the fundamental frequency state likelihood function therefore built on frequency matching functional foundations is more stable and reliable.Utilize in the fundamental frequency state likelihood function and fundamental frequency corpus built and add up the fundamental frequency state transition function obtained, the mode can decoded by Viterbi, carries out the extraction of many pitch contours by hidden Markov model.
As described in Figure 1, the present invention proposes a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model, the concrete steps of the method are as follows:
Step 1: filtering is carried out to voice signal by auditoiy filterbank, and two-dimentional time-frequency expression and the sense of hearing spectrum that framing obtains voice signal is carried out to filtered voice signal;
Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;
Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the first essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Step 4: calculated rate adaptation function on the basis of instantaneous frequency;
Step 5: the likelihood probability building each frame fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each frame fundamental frequency state and fundamental frequency numerical value;
Step 6: likelihood probability carried out strengthening to carry out reduction doubling time mistake, is then combined the likelihood probability after enhancing with transition probability, and uses the many pitch contours of hidden Markov model to current speech to extract.
Wherein, by auditoiy filterbank, filtering is carried out to voice signal described in step 1, and obtain two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal, by auditoiy filterbank filtering is carried out to the voice signal of one dimension and windowing framing obtains two-dimentional time-frequency expresses, the time dimension (corresponding voice frame number) of the wherein one-dimensional representation voice signal that described two-dimentional time-frequency is expressed, another one-dimensional representation frequency dimension (respective channel sequence number).
Described auditoiy filterbank is a kind of model imitating cochlea Auditory Perception mechanism, and the response of the time domain impulse of this wave filter is for such as formula shown in (1), and wherein filter centre frequency is distributed between 0Hz to 3000Hz.
Wherein, t represents the time, and filter order l=4, f are filter centre frequency, and b is equivalent rectangular bandwidth.
As shown in Figure 2, voice signal is through above-mentioned bank of filters filtering, and in bank of filters, the output of each wave filter is the time-domain signal identical with primary speech signal length.Windowing process is carried out to the output of each filtering channel, typical window is long is embodied as 20ms, just the two-dimentional time-frequency expression of primary speech signal and sense of hearing spectrogram and cochleagram can be obtained thus, C (c can be used, m) represent, wherein c represents filter channel sequence number, and m represents speech frame sequence number.
In each time frequency unit of sense of hearing spectrum, calculate autocorrelation function described in step 2, computing formula is such as formula shown in (2).
Wherein, h (c) is the output of corresponding gammatone wave filter in c passage, and m is speech frame sequence number, and n represents discrete time point, and τ is some time delay, and T represents the sampling number that every frame voice signal is corresponding, and W represents discrete point number.
Respective filter due to different passage is different, and the delay that therefore output signal of each path filter produces is different.By the calculating of autocorrelation function, the effect of each channel phases alignment can be played.
Described in step 3, empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the first essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source, specifically comprise the steps:
Utilize Hilbert-Huang transform that original autocorrelation function is decomposed into a series of essential mode function, and according to auditory masking effect, choose the instantaneous frequency of frequency as sound source leading in this time frequency unit of the essential mode function that first decomposites.
As shown in Figure 3, described in step 4 on the basis of instantaneous frequency calculated rate adaptation function, this function is expressed as the middle level in fundamental frequency leaching process, can describe the average instantaneous frequency of current time frequency unit and the degree of each candidate pitch frequency matching, its computing formula is such as formula shown in (3).
Wherein,
represent the average instantaneous frequency being positioned at the time frequency unit of c passage of m frame, τ represents the pitch period (some time delay namely in described limit of consideration) of candidate, and int () is bracket function, returns nearest round values.
As shown in Figure 4, described in step 5, build the likelihood probability of each frame fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value, specifically comprise the steps:
First, the basis of frequency matching function builds the likelihood probability of each fundamental frequency state, i.e. observation probability, in time frequency unit u (c, m), there is single pitch period τ
1likelihood probability such as formula shown in (4).There are two each and every one pitch period τ in time frequency unit u (c, m) simultaneously
1and τ
2likelihood probability such as formula shown in (5).
Wherein, x represents the voice signal observed, ω
1, ω
2be respectively single fundamental frequency state and double-basis voice frequency state; The normalization loudness that L (c, m) is each time frequency unit; Φ
cfor the channel position set in two-dimentional time-frequency expression;
In above-mentioned formula, L (c, m) is the normalization loudness of each time frequency unit, and its computing formula is as follows:
Wherein, E (c, m) represents the energy of time frequency unit u (c, m), and N is the channel number of bank of filters.
Secondly, the state of the fundamental frequency of every frame may be present among three kinds of spaces, i.e. zero-base frequency, single fundamental frequency and double-basis space frequently:
Ω=Ω
0∪Ω
1∪Ω
2
Transition probability between three fundamental frequency states is that the statistics of database by marking with fundamental frequency obtains:
Wherein, Ω
ifundamental frequency state space, p
ijrepresent from fundamental frequency state space Ω
ito fundamental frequency state space Ω
jtransition probability.
Likelihood probability carried out described in step 6 strengthening to carry out reduction doubling time mistake, then the likelihood probability after enhancing be combined with transition probability, and use the many pitch contours of hidden Markov model to current speech to extract, as shown in Figure 5, step comprises:
First, strengthen single fundamental frequency likelihood probability function, formula is for shown in formula (7):
Wherein, the scope of m is 2 ~ 7, and represent that the doubling time mistake only caused for the harmonic wave of 2 ~ 7 times strengthens, α is pre-determined factor, adopts the numerical value between 0.6 ~ 0.8.
The meaning that this formula represents strengthens the likelihood probability of a single fundamental frequency state, the candidate pitch period τ of the present invention in formula (4)
11/m position functional value in find that maximum value, a coefficient is multiplied by this numerical value and using result of product as frequency matching adjusted value, described coefficient can adopt the numerical value between 0.6 ~ 0.8, then deduct on the frequency matching functional value at former candidate pitch period point place adjusted value be enhanced after frequency matching functional value.
Secondly, to double-basis frequently likelihood probability function strengthen, for this reason, first double-basis frequently likelihood probability be written as two functions to add with form such as formula shown in (8):
g(x|{τ
1,τ
2})=p(x|{τ
1})+p
r(τ
1,τ
2)(8)
Wherein,
In above-mentioned formula, τ
1and τ
2be the pitch period point of two candidates, g (x|{ τ
1, τ
2) represent the likelihood probability observing voice signal x on these two candidate pitch period points.
Then, two functions are strengthened by the method for formula (7) respectively, and obtain the result of formula (10):
g
en(x|{τ
1,τ
2})=p
en(x|{τ
1})+p
r_en(τ
1,τ
2)(10)
Wherein, p
en(x|{ τ
1) represent p (x|{ τ
1) strengthen after result, p
r_en(τ
1, τ
2) represent p
r(τ
1, τ
2) result after enhancing.
As shown in Figure 6, the likelihood probability after the enhancing obtained in this step is combined with the three kinds of fundamental frequency state transition probabilities obtained in step 5, just can by the Vit of hidden Markov model
erbi decode procedure obtains many pitch contours.It should be noted that, fundamental frequency state space one has three kinds of states, but this step only calculates the likelihood probability of wherein two states, this is because the zero condition of fundamental frequency (namely not having the state of fundamental frequency) likelihood probability is a constant preset, calculates without the need to carrying out and strengthen.
According to this instructions, the present invention further modifications and variations is apparent for the technician in described field.Therefore, this explanation will be regarded as illustrative and its objective is to one of ordinary skill in the art lecture be used for performing conventional method of the present invention.Should be appreciated that this instructions illustrates and the form of the present invention that describes just is counted as current preferred embodiment.
Claims (8)
1., based on many fundamental frequencies extracting method of empirical mode decomposition and hidden Markov model, it is characterized in that, comprise the following steps:
Step 1: carry out filtering to voice signal by auditoiy filterbank, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;
Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Step 4: calculated rate adaptation function on the basis of each instantaneous frequency;
Step 5: the likelihood probability building each fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Step 6: the likelihood probability of described each fundamental frequency state strengthened, then combines the described likelihood probability after enhancing with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
2. the method for claim 1, it is characterized in that, auditoiy filterbank is utilized to carry out filtering to voice signal in step 1, the output of each wave filter in described auditoiy filterbank is the time-domain signal identical with described voice signal length, to the output windowing sub-frame processing of described each wave filter, the two-dimentional time-frequency obtaining described voice signal is expressed.
3. the method for claim 1, is characterized in that, calculates described in step 2 at the autocorrelation function of each time frequency unit of sense of hearing spectrum by following formula:
Wherein, h (c) is the output of respective filter in described auditoiy filterbank in c filter channel, and m is speech frame sequence number, and n represents discrete time point, and τ is some time delay.
4. the method for claim 1, is characterized in that, on the autocorrelation function of each time frequency unit, carry out empirical mode decomposition described in step 3, step comprises:
Utilize Hilbert-Huang transform that described autocorrelation function is decomposed into a series of essential mode function, and according to auditory masking effect, the instantaneous frequency of frequency leading sound source in this time frequency unit of the essential mode function that first is decomposited.
5. the method for claim 1, is characterized in that, the function of frequency matching described in step 4 is for the degree of the average instantaneous frequency and each candidate pitch frequency matching that describe current time frequency unit, and its computing formula is as follows:
Wherein,
represent the average instantaneous frequency being positioned at the time frequency unit of m frame c passage, τ represents the pitch period of candidate, and int () is bracket function, returns nearest round values.
6. the method for claim 1, is characterized in that, step 5 specifically comprises:
First, the basis of frequency matching function builds the likelihood probability of each fundamental frequency state, in the time frequency unit u (c, m) of c passage of m frame, there is single pitch period τ
1likelihood probability as follows:
There are two each and every one pitch period τ in time frequency unit u (c, m) simultaneously
1and τ
2likelihood probability as follows:
Wherein, x represents voice signal, ω
1, ω
2be respectively single fundamental frequency state and double-basis voice frequency state; The normalization loudness that L (c, m) is each time frequency unit; Φ
cfor the channel position set in two-dimentional time-frequency expression, F (c, m, τ
1) be described frequency matching function;
In above-mentioned formula, L (c, m) is the normalization loudness of each time frequency unit, and its computing formula is as follows:
Wherein, E (c, m) represents the energy of time frequency unit u (c, m), and N is the channel number of bank of filters;
Secondly, the state of the fundamental frequency of every frame may be present among three kinds of spaces, i.e. zero-base frequency, single fundamental frequency and double-basis space frequently:
Ω=Ω
0∪Ω
1∪Ω
2
Transition probability between three fundamental frequency states is that the statistics of database by marking with fundamental frequency obtains:
Wherein, Ω
ifundamental frequency state space, p
ijrepresent from fundamental frequency state space Ω
ito fundamental frequency state space Ω
jtransition probability.
7. the method for claim 1, is characterized in that, strengthen the likelihood probability of each fundamental frequency state in step 6, concrete steps are as follows:
First, strengthen single fundamental frequency likelihood probability function, formula is:
Wherein, the scope of m is 2 ~ 7, and represent that the doubling time mistake only caused for the harmonic wave of 2 ~ 7 times strengthens, α is pre-determined factor, p (x|{ τ
1) expression pitch period is τ
1time observe single fundamental frequency likelihood probability of current speech signal x; p
en(x|{ τ
1) be the single fundamental frequency likelihood probability after enhancing;
Secondly, the likelihood probability function of double-basis frequency state is strengthened, for this reason, first the likelihood probability of double-basis frequency state is written as two function p (x|{ τ
1) and p
r(τ
1, τ
2) add and form:
g(x|{τ
1,τ
2})=p(x|{τ
1})+p
r(τ
1,τ
2)(8)
Wherein, p (x|{ τ
1) be the likelihood probability of single fundamental frequency state, F (c, m, τ
i) be frequency matching function, L (c, m) is the normalization loudness of each time frequency unit, and c is channel number;
Then, to two function p (x|{ τ
1) and p
r(τ
1, τ
2) strengthen by the method for formula (7) respectively, and obtain the likelihood probability of double-basis frequency state:
g
en(x|{τ
1,τ
2})=p
en(x|{τ
1})+p
r_en(τ
1,τ
2)(10)
Wherein, g
en(x|{ τ
1, τ
2) be the likelihood probability of the frequently state of the double-basis after strengthening, p
en(x|{ τ
1) and p
r_en(τ
1, τ
2) be respectively p (x|{ τ
1) and p
r(τ
1, τ
2) value after enhancing.
8., based on many fundamental frequencies extraction element of empirical mode decomposition and hidden Markov model, it is characterized in that, comprising:
Pretreatment module, it carries out filtering by auditoiy filterbank to voice signal, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;
Autocorrelation function computing module, it calculates autocorrelation function in each time frequency unit of sense of hearing spectrum;
Instantaneous frequency computing module, it carries out empirical mode decomposition to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;
Frequency matching function computation module, it is calculated rate adaptation function on the basis of each instantaneous frequency;
Likelihood probability and transition probability computing module, it builds the likelihood probability of each fundamental frequency state with frequency matching function, and uses corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;
Trajectory extraction module, the likelihood probability of described each fundamental frequency state strengthens by it, then the described likelihood probability after enhancing is combined with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511023725.3A CN105469807B (en) | 2015-12-30 | 2015-12-30 | A kind of more fundamental frequency extracting methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511023725.3A CN105469807B (en) | 2015-12-30 | 2015-12-30 | A kind of more fundamental frequency extracting methods and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105469807A true CN105469807A (en) | 2016-04-06 |
CN105469807B CN105469807B (en) | 2019-04-02 |
Family
ID=55607432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511023725.3A Expired - Fee Related CN105469807B (en) | 2015-12-30 | 2015-12-30 | A kind of more fundamental frequency extracting methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105469807B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106205638A (en) * | 2016-06-16 | 2016-12-07 | 清华大学 | A kind of double-deck fundamental tone feature extracting method towards audio event detection |
CN106448630A (en) * | 2016-09-09 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Method and device for generating digital music file of song |
CN107316653A (en) * | 2016-04-27 | 2017-11-03 | 南京理工大学 | A kind of fundamental detection method based on improved experience wavelet transformation |
CN109036376A (en) * | 2018-10-17 | 2018-12-18 | 南京理工大学 | A kind of the south of Fujian Province language phoneme synthesizing method |
CN109839272A (en) * | 2019-03-25 | 2019-06-04 | 湖南工业大学 | It is extracted and the average Method for Bearing Fault Diagnosis of auto-correlated population based on failure impact |
CN111048110A (en) * | 2018-10-15 | 2020-04-21 | 杭州网易云音乐科技有限公司 | Musical instrument identification method, medium, device and computing equipment |
CN111312258A (en) * | 2019-12-16 | 2020-06-19 | 随手(北京)信息技术有限公司 | User identity authentication method, device, server and storage medium |
CN114897236A (en) * | 2022-05-09 | 2022-08-12 | 中南大学 | Hidden Markov inference method for rock pulp channel entrance under survey data constraint |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001282267A (en) * | 2000-03-29 | 2001-10-12 | Mega Chips Corp | Speech processing system and speech processing method |
US7092881B1 (en) * | 1999-07-26 | 2006-08-15 | Lucent Technologies Inc. | Parametric speech codec for representing synthetic speech in the presence of background noise |
CN101567188A (en) * | 2009-04-30 | 2009-10-28 | 上海大学 | Multi-pitch estimation method for mixed audio signals with combined long frame and short frame |
CN104036785A (en) * | 2013-03-07 | 2014-09-10 | 索尼公司 | Speech signal processing method, speech signal processing device and speech signal analyzing system |
-
2015
- 2015-12-30 CN CN201511023725.3A patent/CN105469807B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7092881B1 (en) * | 1999-07-26 | 2006-08-15 | Lucent Technologies Inc. | Parametric speech codec for representing synthetic speech in the presence of background noise |
JP2001282267A (en) * | 2000-03-29 | 2001-10-12 | Mega Chips Corp | Speech processing system and speech processing method |
CN101567188A (en) * | 2009-04-30 | 2009-10-28 | 上海大学 | Multi-pitch estimation method for mixed audio signals with combined long frame and short frame |
CN104036785A (en) * | 2013-03-07 | 2014-09-10 | 索尼公司 | Speech signal processing method, speech signal processing device and speech signal analyzing system |
Non-Patent Citations (3)
Title |
---|
YANG SHAO,DELIANG WANG: "Co-channel speaker identification using usable speech extraction based on multi-pitch tracking", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING》 * |
李仕涛: "多基音检测算法研究", 《中国优秀硕士学位论文全文数据库》 * |
李鹏,关勇,刘文举,徐波: "基于多基音跟踪的单声道混合语音分离", 《计算机应用研究》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107316653B (en) * | 2016-04-27 | 2020-06-26 | 南京理工大学 | Improved empirical wavelet transform-based fundamental frequency detection method |
CN107316653A (en) * | 2016-04-27 | 2017-11-03 | 南京理工大学 | A kind of fundamental detection method based on improved experience wavelet transformation |
CN106205638B (en) * | 2016-06-16 | 2019-11-08 | 清华大学 | A kind of double-deck fundamental tone feature extracting method towards audio event detection |
CN106205638A (en) * | 2016-06-16 | 2016-12-07 | 清华大学 | A kind of double-deck fundamental tone feature extracting method towards audio event detection |
CN106448630A (en) * | 2016-09-09 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Method and device for generating digital music file of song |
US10923089B2 (en) | 2016-09-09 | 2021-02-16 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating digital score file of song, and storage medium |
CN106448630B (en) * | 2016-09-09 | 2020-08-04 | 腾讯科技(深圳)有限公司 | Method and device for generating digital music score file of song |
CN111048110A (en) * | 2018-10-15 | 2020-04-21 | 杭州网易云音乐科技有限公司 | Musical instrument identification method, medium, device and computing equipment |
CN109036376A (en) * | 2018-10-17 | 2018-12-18 | 南京理工大学 | A kind of the south of Fujian Province language phoneme synthesizing method |
CN109839272A (en) * | 2019-03-25 | 2019-06-04 | 湖南工业大学 | It is extracted and the average Method for Bearing Fault Diagnosis of auto-correlated population based on failure impact |
CN109839272B (en) * | 2019-03-25 | 2021-01-08 | 湖南工业大学 | Bearing fault diagnosis method based on fault impact extraction and self-correlation ensemble averaging |
CN111312258A (en) * | 2019-12-16 | 2020-06-19 | 随手(北京)信息技术有限公司 | User identity authentication method, device, server and storage medium |
CN114897236A (en) * | 2022-05-09 | 2022-08-12 | 中南大学 | Hidden Markov inference method for rock pulp channel entrance under survey data constraint |
CN114897236B (en) * | 2022-05-09 | 2024-06-07 | 中南大学 | Hidden Markov inference method for magma channel entrance under investigation data constraint |
Also Published As
Publication number | Publication date |
---|---|
CN105469807B (en) | 2019-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105469807A (en) | Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device | |
US11030998B2 (en) | Acoustic model training method, speech recognition method, apparatus, device and medium | |
CN109147796B (en) | Speech recognition method, device, computer equipment and computer readable storage medium | |
Wang et al. | Channel pattern noise based playback attack detection algorithm for speaker recognition | |
CN104835498B (en) | Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter | |
Hu et al. | Pitch‐based gender identification with two‐stage classification | |
CN108520753B (en) | Voice lie detection method based on convolution bidirectional long-time and short-time memory network | |
Mitra et al. | Medium-duration modulation cepstral feature for robust speech recognition | |
Umesh et al. | Scale transform in speech analysis | |
CN104900235A (en) | Voiceprint recognition method based on pitch period mixed characteristic parameters | |
Dua et al. | Performance evaluation of Hindi speech recognition system using optimized filterbanks | |
Wanli et al. | The research of feature extraction based on MFCC for speaker recognition | |
CN103077728B (en) | A kind of patient's weak voice endpoint detection method | |
CN102436809A (en) | Network speech recognition method in English oral language machine examination system | |
Müller et al. | Contextual invariant-integration features for improved speaker-independent speech recognition | |
CN106373559A (en) | Robustness feature extraction method based on logarithmic spectrum noise-to-signal weighting | |
CN111798846A (en) | Voice command word recognition method and device, conference terminal and conference terminal system | |
GROZDIĆ et al. | Comparison of Cepstral Normalization Techniques in Whispered Speech Recognition. | |
Adam et al. | Wavelet cesptral coefficients for isolated speech recognition | |
Singhal et al. | Automatic speech recognition for connected words using DTW/HMM for English/Hindi languages | |
CN104064197A (en) | Method for improving speech recognition robustness on basis of dynamic information among speech frames | |
Patel et al. | Development and implementation of algorithm for speaker recognition for gujarati language | |
Zouhir et al. | Speech Signals Parameterization Based on Auditory Filter Modeling | |
Bharali et al. | Zero crossing rate and short term energy as a cue for sex detection with reference to Assamese vowels | |
Seman et al. | Evaluating endpoint detection algorithms for isolated word from Malay parliamentary speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190402 Termination date: 20211230 |
|
CF01 | Termination of patent right due to non-payment of annual fee |