CN109308894A

CN109308894A - Voice modeling method based on Bloomfield's model

Info

Publication number: CN109308894A
Application number: CN201811122154.2A
Authority: CN
Inventors: 王磊; 姚昌华; 贾永兴; 潘晨; 徐煜华; 余晓晗; 张广纯; 张晓博; 缪华; 张宏苏
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2019-02-05

Abstract

The invention discloses a speech modeling method based on a Bloomfield's model. The method comprises the following steps: firstly, establishing a time domain Bloomfield's model, and analyzing the time domain characteristics of the Bloomfield's model; then establishing a Bloomfield's model of the voice signal, and extracting characteristic parameters of the Bloomfield's model; finally, the speech signal is processed using the Bloomfield's model. The invention introduces the Bloomfield mathematical model into the speech processing field, establishes a new speech analysis method around the model, represents the speech signal characteristics by using less parameters, and has wide application prospect in the fields of speech signal analysis and recognition, especially the field of speech coding.

Description

One kind being based on the pronunciation modeling method of Bloomfield ' s model

Technical field

The present invention relates to computer application field, especially a kind of pronunciation modeling side for being based on Bloomfield ' s model Method.

Background technique

Speech processing is information superhighway, multimedia technology, office automation, modern communications and intelligence system etc. One of the core technology of emerging field application, is an emerging subject, while being comprehensive multidisciplinary field again and be related to The very wide cross discipline in face.The Fourier analysis that French mathematician Fourier in 1822 proposes has extensively in numerous scientific domains Using, but the threshold of application with higher due to algorithm complexity.A series of numbers that middle 1960s are formed The theory and algorithm of signal processing, if nineteen sixty-five Cooley and Turkey proposes fast fourier transform algorithm, so that DFT Operand greatly reduces, so that Fourier analysis is really used widely, these achievements are voice signal digital processings Theory and technology basis.With the rapid development of information science technology, Speech processing achieves great progress. G.Fant proposes the Linear system model that famous voice generates, also referred to as derivative glottal flow model or sound source excitation -1960 Sound channel Filtering Model, the model are used till today always, are most successful model for speech production.Into after the seventies, propose For the Information Compression of voice signal and the linear forecasting technology (LPC) of feature extraction, and it is most strong to have become Speech processing Strong tool is widely used in the analysis, synthesis and each application field of voice signal, and for inputting voice and reference The dynamic programming method of time match between sample；The beginning of the eighties, a kind of new efficient data based on clustering compressed skill Art-vector quantization (VQ) is applied in Speech processing；And voice signal process is described with hidden Markov model (HMM) Generation be the voice process technology eighties significant development, HMM has constituted the weight of modern speech Study of recognition at present Want foundation stone.After the nineties, artificial neural network (ANN), wavelet analysis divide the new technologies such as shape and chaos to lead in Speech processing It is quickly applied in domain.Currently, the research of Speech processing has been increased to one by the development of artificial intelligence and big data technology A new height.

There are two types of methods for voice signal founding mathematical models: the first is mechanism analysis method, exactly according to voice signal Production principle carries out mathematical modeling, and such as linear prediction (LP) model, modeling process is clear, explicit physical meaning, but is modeling Many hypothesis and constraint in journey are there is a possibility that deviation occurs in model；Second method is data analysis method, computer disposal Voice signal is actually the data of one group of record sound pressure variations, and the method for directlying adopt data fitting carries out mathematical modeling, should Method relies solely on data itself, and model is more accurate to the fitting of data, but lacks convincingness in the physical sense, and mould There may be theoretic defects for the universality of type.

Summary of the invention

The purpose of the present invention is to provide a kind of pronunciation modeling methods for being based on Bloomfield ' s model, have succinctly Voice signal model parameter, and can accurately characterize the amplitude spectrum of voice signal.

The technical solution for realizing the aim of the invention is as follows: one kind being based on the pronunciation modeling side of Bloomfield ' s model Method, comprising the following steps:

Step 1 establishes time domain Bloomfield ' s model；

Step 2, analysis Bloomfield ' s model time domain specification；

Step 3, Bloomfield ' the s model for establishing voice signal；

Step 4, the characteristic parameter for extracting Bloomfield ' s model；

Step 5 uses Bloomfield ' s model treatment voice signal.

Further, time domain Bloomfield ' s model is established described in step 1, specific as follows:

If zero-mean stationary time series X_tWith spectrum density:

S (w)=(2 π)^-1σ²|f(e^-iw)|²

Then X_tIt is represented by X_t=f (z) ε_tOr f^-1(z)X_t=ε_t；

Wherein, ε_tIt is white noise；

Model

Wherein ε_tFor white noise, P is the order of setting, γ₁,γ₂,…,γ_p,σ²It is unknown real parameter constant, σ²≥0；

And meet the following conditions:

(I) (0)=1 f；

(II)f(z)≠0,f(z)≠∞,|z|≤1；

(III) f (z) exists | z | parsing, f (e in≤1^iλ)∈L₁, L₁For complex plane domain；

Then exist | z | on≤1, the Taylor expansion of f (z) are as follows:

Wherein z is rearward displacement because of operator, is defined shift operator f (z) are as follows:

Wherein X_tFor stationary time series；

(IV)(1/n！)f⁽ⁿ⁾(0)=o (n^-s), s > 1, then above-mentioned infinite series f (z) is to X_tEvery track restrain, Therefore X_tFor a stationary linear zero-mean time series, spectrum density is identical as the spectrum density of BF model, model

With BF model equivalency, i.e., to same white noise sequence ε_tDetermine same time series X_t；

Therefore the BF model in time domain are as follows:

Equivalence model are as follows:

Further, Bloomfield ' the s model of voice signal is established described in step 3, specific as follows:

If X₁,X₂,…X_NFor voice sequence X_tN number of sample, to voice sequence X_tCarry out Bloomfield ' s model prediction Fitting, i.e. estimation parameter γ₁,γ₂,…,γ_p,σ²；

When known to rank p, calculateCyclic graph:

γ_jEstimation are as follows:

Wherein N₀=[(N-1)/2]；

And σ²Estimation are as follows:

Wherein 0.57722 constant is drawn to be outstanding；

When rank p is unknown, then minimization AIC (s) are as follows:

Minimal point s₀For an estimation of rank p, I_N() isCyclic graph, λ is that phase, s are variables.

Further, the characteristic parameter of extraction Bloomfield ' s model described in step 4, specific as follows:

The characteristic parameter of voice signal includes linear prediction residue error LPCC, and coefficient of part correlation is also referred to as reflection coefficient REFL, wherein k_i, i=1,2 ..., p, log-area is than coefficient LAR, wherein g_i, i=1,2 ..., p, arcsine coefficient ARCSIN With line spectral frequencies LSF；,

LP analysis is carried out to a frame voice, one group of LP parameter is found out, to export the characteristic parameter of voice signal；

The estimation of Bloomfield ' s model is the parametric technique based on voice number of sampling points evidence, and parameter cannot characterize voice spy Sign, by aspect of model γ_i,σ²Derive the predictive coefficient { a for being equivalent to LP₁,a₂,…a_pAnd other characteristic parameters；

By BF modelCarry out Taylor expansion:

Wherein z is rearward displacement operator, then the expansion of above formula is a kind of Infinite Order linear temporal logic X_t+A₁X_t+1+ A₂X_t+2+A₃X_t+3+…A_nX_t+n+ ...=ε_t, wherein ε_tIt is white noise；Determine parameter γ₁,γ₂, it is determined that A₁,A₂,A₃,… A_n,…；

LP model finds out predictive coefficient { a by linear prediction analysis under minimum mean square error criterion₁,a₂,…a_p}；

It predicts error ε (n) are as follows:

Wherein, s (n) is time series；

Compare BF model and LP model, LP model seeks linear predictor coefficient { a₁,a₂,…a_pCriterion be lowest mean square Criterion, when BF model is launched into similar LP model form, predictive coefficient is by seeking BF model parameter γ_jCome what is sought, I.e. predictive coefficient is γ_jDerivative form, the criterion sought is the criterion of model itself；

Access time and model order, in the allowed band of error, predictive coefficient striked by two kinds of criterion tends to one It causes, obtains:

a₁=γ₁,

Further, Bloomfield ' s model treatment voice signal is used described in step 5, specific as follows:

Step 5.1, voice restore:

The sample frequency of voice signal is 8kHZ, successively uses 1-0.975z to input voice^-1Filter progress preemphasis, Adding window, end-point detection choose 128 points of frame length, 64 points of frame shifting, calculate frame by frame according to BF model and LP model and carry out voice signal weight Structure；

Step 5.2, the pitch Detection based on improved BF model:

Numerical filters are introduced in the front end of inverse filter, the pretreatment of framing is carried out to original signal first, will locate in advance Each frame signal after reason is transmitted to low-pass filter, is filtered to voice signal, removal third and the 4th high frequency Formant and high-frequency noise；Signal after low-pass filtering is sent into numerical filters；Again by the inverse filter based on BF model, Auto-correlation computation is carried out to signal, finds out the auto-correlation function R (n) of every frame signal, and find out the auto-correlation letter in addition to zero point Several first maximal peak points, the peak value position are the pitch period of this corresponding frame signal；Then to having found out from phase It closes function and carries out operation again, find out R_max/ R (0), wherein R_maxThe peak value for being auto-correlation function in addition to zero point, R (0) are from phase The value of function R (n) as n=0 is closed, clear, turbid court verdict is obtained according to decision rule；According to court verdict, then if voiced sound Pitch period is exported, then exports pitch period zero setting if voiceless sound；

The formant extraction of step 5.3, voice signal

Formant peak estimation is carried out by voice signal logarithmic spectrum peak detection, after seeking model spectra, carries out logarithm fortune It calculates, obtains logarithmic spectrum distribution map；Using phase-width characteristic and logarithm width-frequency feature extraction voice signal formant；

LPC coefficient is replaced using BF model coefficient, and is expressed as { a_k；K=1,2 ... p }；

The formant of voice signal is estimated by digital transfer function H (z), polynomial rooting is carried out to H (z), by institute The root asked judges formant or spectral shape pole, according to these pole phase distributions determines corresponding formant frequency, counts The frequency of k-th of the formant calculated is F_k=θ_k/ 2 π T, θ_kIt is the period for k-th of pole, T.

Compared with prior art, the present invention its remarkable advantage are as follows: (1) Bloomfield model is introduced into speech processes neck Domain, and around a kind of new speech analysis method of the model foundation, new model has succinct voice signal model parameter, can use Less parameter characterization phonic signal character；(2) amplitude spectrum that can accurately characterize voice signal can be applied to voice knowledge Not, speech synthesis, voice coding, speaker Recognition Technology field, have a wide range of applications.

Detailed description of the invention

Fig. 1 is that the present invention is based on Bloomfield ' flow chart of the Pitch Detection of Speech Signals algorithm of s model.

Fig. 2 is the result schematic diagram that 10 rank LP models and 1,2 rank BF models of the invention carry out voice recovery.

Fig. 3 is algorithm effect figure of the invention, wherein (a) is voice spectrum figure and the clear of two methods, voiced sound judgement knot Fruit compares figure, (b) the pitch contour figure extracted for voice time domain waveform and two methods.

Fig. 4 is that formant extracts the effect picture that logarithmic spectrum peak detection is used in experiment, wherein (a) is 12 rank LP of male voice Model log-magnitude spectrum, 12 rank CBF model log-magnitude spectrums compare figure, (b) are 12 rank LP model log-magnitude spectrum of male voice, 16 ranks CBF model log-magnitude spectrum compares figure, (c) is 12 rank LP model log-magnitude spectrum of female voice, 12 rank CBF model log-magnitude spectrum ratios Compared with figure, (d) be 12 rank LP model log-magnitude spectrum of female voice, 16 rank CBF model log-magnitude spectrums compare figure.

Fig. 5 is that formant extracts in experiment using phase-frequency detection method effect picture.

Specific embodiment

The present invention is further illustrated with reference to the accompanying drawings and examples.

LP analysis method based on arma modeling is the core technology in Speech processing, it is in speech recognition, voice Synthesis, voice coding, Speaker Identification etc. are all successfully applied, the voice conversion (voice studied in recent years Conversion it in), also obtains using extensively and successfully.Its importance is the provision of one group of succinct voice signal model Parameter, relatively accurately characterizes the amplitude spectrum of voice signal, and the calculation amount needed for analyzing them and little.In real work It was found that have another kind of of great value linear session series model independently of arma modeling other than, i.e. Bloomfield ' s model Power estimation method.The present invention provides a kind of pronunciation modeling method for being based on Bloomfield ' s model, with BF model and LP model It calculates frame by frame and carries out voice signal reconstruct.

The present invention is based on Bloomfield ' the pronunciation modeling method of s model, utilize Bloomfield ' the s spectrum of voice signal Feature, establishes time domain Bloomfield ' s model, and Bloomfield ' the s model parameter estimation based on voice signal derives LP And other speech parameters, voice signal modeling is carried out, the characteristic parameter of voice signal is extracted, carries out answering for Speech processing With, specifically includes the following steps:

Step 1 establishes time domain Bloomfield ' s model；

Step 2, analysis Bloomfield ' s model time domain specification；

Step 3, Bloomfield ' the s model for establishing voice signal；

Step 4, the characteristic parameter for extracting Bloomfield ' s model；

Step 5, using Bloomfield ' s model treatment voice signal.

If zero-mean stationary time series X_tWith spectrum density:

S (w)=(2 π)^-1σ²|f(e^-iw)|² (1)

Then X_tIt may be expressed as:

X_t=f (z) ε_tOr f^-1(z)X_t=ε_t (2)

Wherein ε_tIt is white noise；

Model is considered below

And meet the following conditions:

(I) (0)=1 f；

(II)f(z)≠0,f(z)≠∞,|z|≤1；

Then exist | z | on≤1, the Taylor expansion of f (z) are as follows:

Wherein z is rearward displacement because of operator, can define shift operator f (z) are as follows:

Wherein X_tFor stationary time series；

(IV)(1/n！)f⁽ⁿ⁾(0)=o (n^-s), s > 1, then above-mentioned infinite series f (z) is to X_tEvery track restrain, Therefore the X determined by formula (3)_tFor a stationary linear zero-mean time series, spectrum density is identical as the spectrum density of BF model, formula (3) X determined_tModel and BF model equivalency, i.e., to same white noise sequence ε_tDetermine same time series X_t, then wushu (3) is true Fixed X_tModel is known as the BF model in time domain, equivalence model are as follows:

When known to rank p, it can calculateCyclic graph:

γ_jEstimation are as follows:

Wherein N₀=[(N-1)/2]；

And σ²Estimation it is desirable are as follows:

Wherein 0.57722 constant is drawn to be outstanding；

When rank p is unknown, then minimization AIC (s) are as follows:

Minimal point s₀For an estimation of rank p, I_N() isCyclic graph, λ is that phase, s are variables；

Using Green function calculating formula, any prediction formula, the formula that the present invention uses can be established are as follows:

WhereinThe l step prediction that data until indicating to use t moment are done forward, a_jIt can be led to by Green function Above-mentioned formula is crossed to calculate.

The characteristic parameter of voice signal includes linear prediction residue error LPCC, and coefficient of part correlation is also referred to as reflection coefficient REFL, wherein k_i, i=1,2 ..., p, log-area is than coefficient LAR, wherein g_i, i=1,2 ..., p, arcsine coefficient ARCSIN With line spectral frequencies LSF；

LP analysis is carried out to a frame voice, one group of LP parameter can be found out, so that the characteristic parameter of voice signal is exported, it can In terms of being applied to different Speech processings.Including linear prediction residue error (LPCC), coefficient of part correlation is also referred to as Reflection R EFL (k_i, i=1,2 ..., p), log-area is than coefficient LAR (g_i, i=1,2 ..., p), arcsine coefficient (ARCSIN) and line spectral frequencies (LSF) etc..

The estimation of Bloomfield ' s model is a kind of parametric technique based on voice number of sampling points evidence, and parameter cannot characterize Phonetic feature, can be by aspect of model γ_i,σ²Derive the predictive coefficient { a for being equivalent to LP₁,a₂,...a_pAnd other features ginseng Number；

Compare { a that the mean-square error criteria of time domain BF modular form (6) and formula (12) acquires₁,a₂,...a_p, find BF The current sample value of Model in Time Domain and LP model exists with past sample value to be contacted, and LP model is linear, and BF model is Exponential type can be similar to LP model form if BF model is carried out Taylor expansion at current sample value；

By BF model:

Carry out Taylor expansion:

Wherein z is rearward displacement operator, then the expansion of above formula is a kind of Infinite Order linear temporal logic:

X_t+A₁X_t+1+A₂X_t+2+A₃X_t+3+…A_nX_t+n+ ...=ε_t

Wherein ε_tIt is white noise；Determine parameter γ₁,γ₂, it is determined that A₁,A₂,A₃,…A_n,…；

LP model core thought is exactly the presence of very big correlation between adjacent sample values, and the signal at certain moment is largely On can use the prediction of past sampled value is obtained, i.e., each sampled value can pass through the sampled value of several time in the past Linear combination approaches；

The purpose of linear prediction analysis is exactly that predictive coefficient { a is found out under minimum mean square error criterion₁,a₂,...a_p}。

It predicts error ε (n) are as follows:

Wherein, s (n) is time series；

Compare BF model and LP model, LP model seeks linear predictor coefficient { a₁,a₂,...a_pCriterion be lowest mean square Criterion, when BF model is launched into similar LP model form, predictive coefficient is by seeking BF model parameter γ_jCome what is sought, I.e. predictive coefficient is γ_jDerivative form, the criterion sought is the criterion of model itself；

There is correlation in voice signal, choose reasonable time and model order in a short time, in the permission model of error In enclosing, predictive coefficient striked by two kinds of criterion reaches unanimity, through being derived by:

Step 5.1, voice restore:

Speech samples are female voice " 1234567890 " in experiment, and the sample frequency of voice signal is 8kHZ, to input voice Successively use 1-0.975z^-1Filter carries out preemphasis, adding window, and end-point detection chooses 128 points of frame length, and frame moves at 64 points, according to BF Model and LP model calculate frame by frame carries out voice signal reconstruct；

Step 5.2, the pitch Detection based on improved BF model:

Numerical filters are introduced in the front end of inverse filter, enhances the periodicity of voiced speech signal, overcomes pure BF model The disadvantage that half frequency mistake of fundamental tone can not be improved, carries out the pretreatment such as framing to original signal first, will treated each frame signal It is transmitted to the low-pass filter of a 800Hz, voice signal is filtered, removal third and the 4th high-frequency resonance Signal after low-pass filtering is sent into numerical filters, the periodicity of prominent voiced speech signal by peak and high-frequency noise；By in short-term The definition of auto-correlation function is it is found that for quasi-periodic signal, and short-time autocorrelation function is on each integral multiple point of pitch period There is very big peak value；Again by the inverse filter based on BF model, auto-correlation computation is carried out to signal, finds out every frame signal oneself Correlation function R (n), and the first maximal peak point of auto-correlation function in addition to zero point is found out, which corresponds to The pitch period of this frame signal；Then operation again is carried out to the auto-correlation function found out, finds out R_max/ R (0), wherein R_maxThe peak value for being auto-correlation function in addition to zero point, obtains clear, turbid court verdict according to decision rule；According to court verdict, if Voiced sound then exports pitch period, then exports pitch period zero setting if voiceless sound；

In conjunction with Fig. 1, the present invention is based on Bloomfield ' flow chart of the Pitch Detection of Speech Signals algorithm of s model, including it is following Step:

(1) average value processing is removed；

(2) voice sub-frame processing；Each user's income calculation；

(3) low-pass filtering reduces the influence at high-frequency resonance peak and external high-frequency noise；

(4) it determines numerical filtering, highlights the periodicity of voiced speech signal, keep pitch evaluation reliable；

(5) it determines inverse filter, strengthens the periodic structure of Voiced signal, the envelope of the damped sine waveform of signal is more Smoothly, so that the periodicity of voice signal further enhances, it is more advantageous to pitch Detection；

(6) voicing decision carries out pitch Detection.

The formant extraction of step 5.3, voice signal

The theoretical basis of usual speech processes is its mathematical model.In terms of model, the process of speech production is believed in excitation Under the action of number, sound wave is through resonant cavity, i.e. sound channel, by mouth or nose radiative acoustic wave.Pole in channel transfer frequency response is claimed Be formant, and the formant frequency of voice, the i.e. distribution character of pole frequency decide the tone color of voice.Extract formant Method, formant peak estimation is exactly carried out by voice signal logarithmic spectrum peak detection.This method seeks model respectively Logarithm is asked to it after spectrum, the logarithmic spectrum distribution map of formant distribution can preferably be showed by obtaining it；By to LPC, that is, linear prediction Coding method extract voice signal formant studies have shown that, can be mentioned using phase-width characteristic is same as logarithm width-frequency characteristic Take voice signal formant；

In the LPC model of voice signal, voice signal sample s (n) can be indicated by (13), and corresponding digital filter passes Delivery function H (z) are as follows:

Formula (15) can be expressed as the cascade form of p pole:

Wherein, z_k=r_kexp(jθ_k) it is k-th pole of the H (z) on z-.If the formula is stable, then its is all Pole is all in the unit circle of its z- plane.

BF model be when carrying out speech analysis come as a kind of linear prediction model using.According to above formula, BF mould is utilized Type coefficient goes to replace LPC coefficient, and is expressed as { a_k；K=1,2 ... p }.

The resonance peak energy of voice signal is estimated that mode is to carry out multinomial to H (z) by digital transfer function H (z) Rooting is judged formant or spectral shape pole by required root, is derived according to these pole phase distributions corresponding total Vibration peak frequency.By using formula (16), the frequency of k-th of the formant calculated is F_k=θ_k/ 2 π T, θ_kFor k-th of pole, T is the period.

Embodiment

Effectiveness of the invention is verified below by simulation example.

1. voice restores experiment.By the sample value X of known stationary time series x (t)₁,X₂,…X_NIt sets out, provides to the sequence The Bloomfield models fitting of column.Optional five sections of voices utilize 12 rank LP models and 3 ranks, 5 ranks, 10 rank Bloomfield models Error of fitting quadratic sum value after carrying out voice signal curve matching compares, and the results are shown in Table 1.

1 data error of fitting of table compares

LP model have the defects that one be difficult to overcome be exactly be to be sought most using prediction error minimum in established model Good predictive coefficient, since existing algorithm is very easy to fall into local optimum search, so as to cause the optimum prediction system of linear prediction Number is not necessarily exactly global optimum.And the data error of fitting in table 1 is as the result is shown: when rank is 3, model carries out signal fitting Effect it is best, error of fitting is minimum.As can also be seen from Table 1, Bloomfield model order is got higher, and error of fitting does not subtract It is small, and for selected several model orders, error is respectively less than the error of fitting of 12 rank LP models.It is respectively shown in Fig. 2 " 1234567890 " voice signal of 10 rank LP models, 1,2 rank BF model reconstructions.The result shows that compared with passing logical LP model, BF model has restored voice signal with few parameter generation under the premise of the distortion factor is lesser.

2. pitch Detection is tested.Fig. 3 (a)~(b) the result shows that, method proposed by the present invention show two methods extract Pitch contour, the middle and upper part Fig. 3 (a) is divided into voice spectrum figure, and clear, the voiced sound court verdict that lower part is divided into two methods compare, The middle and upper part Fig. 3 (b) is divided into voice time domain waveform diagram, and lower part is divided into the pitch contour figure of two methods extraction.It sees on the whole, this The pitch contour that two methods are extracted is substantially similar, but when the time domain waveform of voice signal shows as the base of clear, turbid transition portion When the unobvious segment of sound, the generally existing difference of the result of two pitch contours, at most apparent 200th frame of testing result difference For voice, by finding clearly, after the hand dipping operations of voiced sound court verdict difference section to two kinds, which is Typical schwa signal；Other difference sections are mainly some voiced sound tail portions to the transition portion of schwa, the mistake of schwa to voiced sound The non-voiced part such as part and pure schwa is crossed, by hand dipping, the voice signal overwhelming majority of these parts should be judged to It is more appropriate for non-voiced.From Table 2, it can be seen that utilizing the base of the method for the present invention extraction for more traditional correlation method Voice synthesized by voice frequency and the error amount of raw tone are generally smaller, to further demonstrate improved pitch Detection side Method has a degree of raising than pure BF method in testing result.

2 two kinds of algorithm performances of table compare

	Improved BF algorithm	BF algorithm
			EDR (%)	1.71	2.23
Algorithm complexity (about 1500 frame)	13.6S	11.45S

3. formant test experience.Logarithmic spectrum peak detection is respectively adopted and phase-frequency method extracts formant.Experimental selection Male voice, female voice each one section of voice pass through H (z)=1- α z first^-1The preemphasis to voice signal is realized, in order to frequency Spectrum analysis or channel parameters analysis；Then framing is carried out with frame length 32ms, step-length 16ms to each voice segments respectively, and to each voice Frame carries out clear, turbid judgement, seeks its logarithm Spectral structure to unvoiced frame.Fig. 4 is the LP model of male voice and female voice and BF model is sought Logarithmic spectrum compares figure.As seen from Figure 4, BF model log-magnitude spectral method for male voice the 2nd, the description energy of 3 formants Power is stronger, and then weaker for the 1st formant expressive ability, and with the increase of model order, this problem can be gradually no longer prominent Out.For female voice this problem then without the protrusion of male voice performance, and it shows the abilities of the 3rd, 4 formants and is better than LP Model.Fig. 5 is the effect picture that formant is extracted using phase-frequency method.

Above-mentioned emulation demonstrates validity, the reasonability of the algorithm that the present invention is proposed.The present invention by Bloomfield this Mathematical model is introduced into speech processes field, and less around the model foundation a kind of new speech analysis method, use Parameter characterization phonic signal character, have widely in speech signal analysis, identification field, especially voice coding field Application prospect.

Claims

1. the pronunciation modeling method that one kind is based on Bloomfield ' s model, which comprises the following steps:

Step 1 establishes time domain Bloomfield ' s model；

Step 2, analysis Bloomfield ' s model time domain specification；

Step 3, Bloomfield ' the s model for establishing voice signal；

Step 4, the characteristic parameter for extracting Bloomfield ' s model；

Step 5 uses Bloomfield ' s model treatment voice signal.

2. the pronunciation modeling method according to claim 1 based on Bloomfield ' s model, which is characterized in that step 1 It is described to establish time domain Bloomfield ' s model, specific as follows:

If zero-mean stationary time series X_tWith spectrum density:

S (w)=(2 π)^-1σ²|f(e^-iw)|²

Then X_tIt is represented by X_t=f (z) ε_tOr f^-1(z)X_t=ε_t；

Wherein, ε_tIt is white noise；

Model

And meet the following conditions:

(I) (0)=1 f；

(II)f(z)≠0,f(z)≠∞,|z|≤1；

Then exist | z | on≤1, the Taylor expansion of f (z) are as follows:

Wherein X_tFor stationary time series；

(IV)(1/n！)f⁽ⁿ⁾(0)=o (n^-s), s > 1, then above-mentioned infinite series f (z) is to X_tEvery track restrain, therefore X_t For a stationary linear zero-mean time series, spectrum density is identical as the spectrum density of BF model, model

Therefore the BF model in time domain are as follows:

Equivalence model are as follows:

3. the pronunciation modeling method according to claim 2 based on Bloomfield ' s model, which is characterized in that step 3 Bloomfield ' the s model for establishing voice signal, specific as follows:

If X₁,X₂,…X_NFor voice sequence X_tN number of sample, to voice sequence X_tBloomfield ' s model prediction fitting is carried out, Estimate parameter γ₁,γ₂,…,γ_p,σ²；

When known to rank p, calculateCyclic graph:

γ_jEstimation are as follows:

Wherein N₀=[(N-1)/2]；

And σ²Estimation are as follows:

Wherein 0.57722 constant is drawn to be outstanding；

When rank p is unknown, then minimization AIC (s) are as follows:

4. the pronunciation modeling method according to claim 1 based on Bloomfield ' s model, which is characterized in that step 4 The characteristic parameter of extraction Bloomfield ' the s model, specific as follows:

The estimation of Bloomfield ' s model is the parametric technique based on voice number of sampling points evidence, and parameter cannot characterize phonetic feature, By aspect of model γ_i,σ²Derive the predictive coefficient { a for being equivalent to LP₁,a₂,…a_pAnd other characteristic parameters；

By BF modelCarry out Taylor expansion:

Wherein z is rearward displacement operator, then the expansion of above formula is a kind of Infinite Order linear temporal logic X_t+A₁X_t+1+A₂X_t+2+ A₃X_t+3+…A_nX_t+n+ ...=ε_t, wherein ε_tIt is white noise；Determine parameter γ₁,γ₂, it is determined that A₁,A₂,A₃,…A_n,…；

It predicts error ε (n) are as follows:

Wherein, s (n) is time series；

Compare BF model and LP model, LP model seeks linear predictor coefficient { a₁,a₂,…a_pCriterion be lowest mean square criterion, When BF model is launched into similar LP model form, predictive coefficient is by seeking BF model parameter γ_jCome what is sought, that is, predict Coefficient is γ_jDerivative form, the criterion sought is the criterion of model itself；

Access time and model order, in the allowed band of error, predictive coefficient striked by two kinds of criterion reaches unanimity, and obtains It arrives:

a₁=γ₁,

5. the pronunciation modeling method according to claim 1 based on Bloomfield ' s model, which is characterized in that step 5 Use Bloomfield ' the s model treatment voice signal, specific as follows:

Step 5.1, voice restore:

The sample frequency of voice signal is 8kHZ, successively uses 1-0.975z to input voice^-1Filter carries out preemphasis, adds Window, end-point detection choose 128 points of frame length, 64 points of frame shifting, calculate frame by frame according to BF model and LP model and carry out voice signal weight Structure；

Step 5.2, the pitch Detection based on improved BF model:

Numerical filters are introduced in the front end of inverse filter, the pretreatment of framing are carried out to original signal first, after pretreatment Each frame signal be transmitted to low-pass filter, voice signal is filtered, removal third and the 4th high-frequency resonance Peak and high-frequency noise；Signal after low-pass filtering is sent into numerical filters；Again by the inverse filter based on BF model, to letter Number auto-correlation computation is carried out, finds out the auto-correlation function R (n) of every frame signal, and find out the auto-correlation function in addition to zero point the One maximal peak point, the peak value position are the pitch period of this corresponding frame signal；Then to the auto-correlation letter found out Number carries out operation again, finds out R_max/ R (0), wherein R_maxThe peak value for being auto-correlation function in addition to zero point, R (0) are auto-correlation letter Value of number R (n) as n=0, obtains clear, turbid court verdict according to decision rule；According to court verdict, then exported if voiced sound Pitch period then exports pitch period zero setting if voiceless sound；

The formant extraction of step 5.3, voice signal

Formant peak estimation is carried out by voice signal logarithmic spectrum peak detection, after seeking model spectra, logarithm operation is carried out, obtains Obtain logarithmic spectrum distribution map；Using phase-width characteristic and logarithm width-frequency feature extraction voice signal formant；

The formant of voice signal is estimated by digital transfer function H (z), polynomial rooting is carried out to H (z), by required Root judges formant or spectral shape pole, determines corresponding formant frequency according to these pole phase distributions, calculates The frequency of k-th of the formant come is F_k=θ_k/ 2 π T, θ_kIt is the period for k-th of pole, T.