CN109308894A - Voice modeling method based on Bloomfield's model - Google Patents

Voice modeling method based on Bloomfield's model Download PDF

Info

Publication number
CN109308894A
CN109308894A CN201811122154.2A CN201811122154A CN109308894A CN 109308894 A CN109308894 A CN 109308894A CN 201811122154 A CN201811122154 A CN 201811122154A CN 109308894 A CN109308894 A CN 109308894A
Authority
CN
China
Prior art keywords
model
bloomfield
voice
voice signal
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811122154.2A
Other languages
Chinese (zh)
Inventor
王磊
姚昌华
贾永兴
潘晨
徐煜华
余晓晗
张广纯
张晓博
缪华
张宏苏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN201811122154.2A priority Critical patent/CN109308894A/en
Publication of CN109308894A publication Critical patent/CN109308894A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a speech modeling method based on a Bloomfield's model. The method comprises the following steps: firstly, establishing a time domain Bloomfield's model, and analyzing the time domain characteristics of the Bloomfield's model; then establishing a Bloomfield's model of the voice signal, and extracting characteristic parameters of the Bloomfield's model; finally, the speech signal is processed using the Bloomfield's model. The invention introduces the Bloomfield mathematical model into the speech processing field, establishes a new speech analysis method around the model, represents the speech signal characteristics by using less parameters, and has wide application prospect in the fields of speech signal analysis and recognition, especially the field of speech coding.

Description

One kind being based on the pronunciation modeling method of Bloomfield ' s model
Technical field
The present invention relates to computer application field, especially a kind of pronunciation modeling side for being based on Bloomfield ' s model Method.
Background technique
Speech processing is information superhighway, multimedia technology, office automation, modern communications and intelligence system etc. One of the core technology of emerging field application, is an emerging subject, while being comprehensive multidisciplinary field again and be related to The very wide cross discipline in face.The Fourier analysis that French mathematician Fourier in 1822 proposes has extensively in numerous scientific domains Using, but the threshold of application with higher due to algorithm complexity.A series of numbers that middle 1960s are formed The theory and algorithm of signal processing, if nineteen sixty-five Cooley and Turkey proposes fast fourier transform algorithm, so that DFT Operand greatly reduces, so that Fourier analysis is really used widely, these achievements are voice signal digital processings Theory and technology basis.With the rapid development of information science technology, Speech processing achieves great progress. G.Fant proposes the Linear system model that famous voice generates, also referred to as derivative glottal flow model or sound source excitation -1960 Sound channel Filtering Model, the model are used till today always, are most successful model for speech production.Into after the seventies, propose For the Information Compression of voice signal and the linear forecasting technology (LPC) of feature extraction, and it is most strong to have become Speech processing Strong tool is widely used in the analysis, synthesis and each application field of voice signal, and for inputting voice and reference The dynamic programming method of time match between sample;The beginning of the eighties, a kind of new efficient data based on clustering compressed skill Art-vector quantization (VQ) is applied in Speech processing;And voice signal process is described with hidden Markov model (HMM) Generation be the voice process technology eighties significant development, HMM has constituted the weight of modern speech Study of recognition at present Want foundation stone.After the nineties, artificial neural network (ANN), wavelet analysis divide the new technologies such as shape and chaos to lead in Speech processing It is quickly applied in domain.Currently, the research of Speech processing has been increased to one by the development of artificial intelligence and big data technology A new height.
There are two types of methods for voice signal founding mathematical models: the first is mechanism analysis method, exactly according to voice signal Production principle carries out mathematical modeling, and such as linear prediction (LP) model, modeling process is clear, explicit physical meaning, but is modeling Many hypothesis and constraint in journey are there is a possibility that deviation occurs in model;Second method is data analysis method, computer disposal Voice signal is actually the data of one group of record sound pressure variations, and the method for directlying adopt data fitting carries out mathematical modeling, should Method relies solely on data itself, and model is more accurate to the fitting of data, but lacks convincingness in the physical sense, and mould There may be theoretic defects for the universality of type.
Summary of the invention
The purpose of the present invention is to provide a kind of pronunciation modeling methods for being based on Bloomfield ' s model, have succinctly Voice signal model parameter, and can accurately characterize the amplitude spectrum of voice signal.
The technical solution for realizing the aim of the invention is as follows: one kind being based on the pronunciation modeling side of Bloomfield ' s model Method, comprising the following steps:
Step 1 establishes time domain Bloomfield ' s model;
Step 2, analysis Bloomfield ' s model time domain specification;
Step 3, Bloomfield ' the s model for establishing voice signal;
Step 4, the characteristic parameter for extracting Bloomfield ' s model;
Step 5 uses Bloomfield ' s model treatment voice signal.
Further, time domain Bloomfield ' s model is established described in step 1, specific as follows:
If zero-mean stationary time series XtWith spectrum density:
S (w)=(2 π)-1σ2|f(e-iw)|2
Then XtIt is represented by Xt=f (z) εtOr f-1(z)Xtt
Wherein, εtIt is white noise;
Model
Wherein εtFor white noise, P is the order of setting, γ12,…,γp2It is unknown real parameter constant, σ2≥0;
And meet the following conditions:
(I) (0)=1 f;
(II)f(z)≠0,f(z)≠∞,|z|≤1;
(III) f (z) exists | z | parsing, f (e in≤1)∈L1, L1For complex plane domain;
Then exist | z | on≤1, the Taylor expansion of f (z) are as follows:
Wherein z is rearward displacement because of operator, is defined shift operator f (z) are as follows:
Wherein XtFor stationary time series;
(IV)(1/n!)f(n)(0)=o (n-s), s > 1, then above-mentioned infinite series f (z) is to XtEvery track restrain, Therefore XtFor a stationary linear zero-mean time series, spectrum density is identical as the spectrum density of BF model, model
With BF model equivalency, i.e., to same white noise sequence εtDetermine same time series Xt
Therefore the BF model in time domain are as follows:
Equivalence model are as follows:
Further, Bloomfield ' the s model of voice signal is established described in step 3, specific as follows:
If X1,X2,…XNFor voice sequence XtN number of sample, to voice sequence XtCarry out Bloomfield ' s model prediction Fitting, i.e. estimation parameter γ12,…,γp2
When known to rank p, calculateCyclic graph:
γjEstimation are as follows:
Wherein N0=[(N-1)/2];
And σ2Estimation are as follows:
Wherein 0.57722 constant is drawn to be outstanding;
When rank p is unknown, then minimization AIC (s) are as follows:
Minimal point s0For an estimation of rank p, IN() isCyclic graph, λ is that phase, s are variables.
Further, the characteristic parameter of extraction Bloomfield ' s model described in step 4, specific as follows:
The characteristic parameter of voice signal includes linear prediction residue error LPCC, and coefficient of part correlation is also referred to as reflection coefficient REFL, wherein ki, i=1,2 ..., p, log-area is than coefficient LAR, wherein gi, i=1,2 ..., p, arcsine coefficient ARCSIN With line spectral frequencies LSF;,
LP analysis is carried out to a frame voice, one group of LP parameter is found out, to export the characteristic parameter of voice signal;
The estimation of Bloomfield ' s model is the parametric technique based on voice number of sampling points evidence, and parameter cannot characterize voice spy Sign, by aspect of model γi2Derive the predictive coefficient { a for being equivalent to LP1,a2,…apAnd other characteristic parameters;
By BF modelCarry out Taylor expansion:
Wherein z is rearward displacement operator, then the expansion of above formula is a kind of Infinite Order linear temporal logic Xt+A1Xt+1+ A2Xt+2+A3Xt+3+…AnXt+n+ ...=εt, wherein εtIt is white noise;Determine parameter γ12, it is determined that A1,A2,A3,… An,…;
LP model finds out predictive coefficient { a by linear prediction analysis under minimum mean square error criterion1,a2,…ap};
It predicts error ε (n) are as follows:
Wherein, s (n) is time series;
Compare BF model and LP model, LP model seeks linear predictor coefficient { a1,a2,…apCriterion be lowest mean square Criterion, when BF model is launched into similar LP model form, predictive coefficient is by seeking BF model parameter γjCome what is sought, I.e. predictive coefficient is γjDerivative form, the criterion sought is the criterion of model itself;
Access time and model order, in the allowed band of error, predictive coefficient striked by two kinds of criterion tends to one It causes, obtains:
a11,
Further, Bloomfield ' s model treatment voice signal is used described in step 5, specific as follows:
Step 5.1, voice restore:
The sample frequency of voice signal is 8kHZ, successively uses 1-0.975z to input voice-1Filter progress preemphasis, Adding window, end-point detection choose 128 points of frame length, 64 points of frame shifting, calculate frame by frame according to BF model and LP model and carry out voice signal weight Structure;
Step 5.2, the pitch Detection based on improved BF model:
Numerical filters are introduced in the front end of inverse filter, the pretreatment of framing is carried out to original signal first, will locate in advance Each frame signal after reason is transmitted to low-pass filter, is filtered to voice signal, removal third and the 4th high frequency Formant and high-frequency noise;Signal after low-pass filtering is sent into numerical filters;Again by the inverse filter based on BF model, Auto-correlation computation is carried out to signal, finds out the auto-correlation function R (n) of every frame signal, and find out the auto-correlation letter in addition to zero point Several first maximal peak points, the peak value position are the pitch period of this corresponding frame signal;Then to having found out from phase It closes function and carries out operation again, find out Rmax/ R (0), wherein RmaxThe peak value for being auto-correlation function in addition to zero point, R (0) are from phase The value of function R (n) as n=0 is closed, clear, turbid court verdict is obtained according to decision rule;According to court verdict, then if voiced sound Pitch period is exported, then exports pitch period zero setting if voiceless sound;
The formant extraction of step 5.3, voice signal
Formant peak estimation is carried out by voice signal logarithmic spectrum peak detection, after seeking model spectra, carries out logarithm fortune It calculates, obtains logarithmic spectrum distribution map;Using phase-width characteristic and logarithm width-frequency feature extraction voice signal formant;
LPC coefficient is replaced using BF model coefficient, and is expressed as { ak;K=1,2 ... p };
The formant of voice signal is estimated by digital transfer function H (z), polynomial rooting is carried out to H (z), by institute The root asked judges formant or spectral shape pole, according to these pole phase distributions determines corresponding formant frequency, counts The frequency of k-th of the formant calculated is Fkk/ 2 π T, θkIt is the period for k-th of pole, T.
Compared with prior art, the present invention its remarkable advantage are as follows: (1) Bloomfield model is introduced into speech processes neck Domain, and around a kind of new speech analysis method of the model foundation, new model has succinct voice signal model parameter, can use Less parameter characterization phonic signal character;(2) amplitude spectrum that can accurately characterize voice signal can be applied to voice knowledge Not, speech synthesis, voice coding, speaker Recognition Technology field, have a wide range of applications.
Detailed description of the invention
Fig. 1 is that the present invention is based on Bloomfield ' flow chart of the Pitch Detection of Speech Signals algorithm of s model.
Fig. 2 is the result schematic diagram that 10 rank LP models and 1,2 rank BF models of the invention carry out voice recovery.
Fig. 3 is algorithm effect figure of the invention, wherein (a) is voice spectrum figure and the clear of two methods, voiced sound judgement knot Fruit compares figure, (b) the pitch contour figure extracted for voice time domain waveform and two methods.
Fig. 4 is that formant extracts the effect picture that logarithmic spectrum peak detection is used in experiment, wherein (a) is 12 rank LP of male voice Model log-magnitude spectrum, 12 rank CBF model log-magnitude spectrums compare figure, (b) are 12 rank LP model log-magnitude spectrum of male voice, 16 ranks CBF model log-magnitude spectrum compares figure, (c) is 12 rank LP model log-magnitude spectrum of female voice, 12 rank CBF model log-magnitude spectrum ratios Compared with figure, (d) be 12 rank LP model log-magnitude spectrum of female voice, 16 rank CBF model log-magnitude spectrums compare figure.
Fig. 5 is that formant extracts in experiment using phase-frequency detection method effect picture.
Specific embodiment
The present invention is further illustrated with reference to the accompanying drawings and examples.
LP analysis method based on arma modeling is the core technology in Speech processing, it is in speech recognition, voice Synthesis, voice coding, Speaker Identification etc. are all successfully applied, the voice conversion (voice studied in recent years Conversion it in), also obtains using extensively and successfully.Its importance is the provision of one group of succinct voice signal model Parameter, relatively accurately characterizes the amplitude spectrum of voice signal, and the calculation amount needed for analyzing them and little.In real work It was found that have another kind of of great value linear session series model independently of arma modeling other than, i.e. Bloomfield ' s model Power estimation method.The present invention provides a kind of pronunciation modeling method for being based on Bloomfield ' s model, with BF model and LP model It calculates frame by frame and carries out voice signal reconstruct.
The present invention is based on Bloomfield ' the pronunciation modeling method of s model, utilize Bloomfield ' the s spectrum of voice signal Feature, establishes time domain Bloomfield ' s model, and Bloomfield ' the s model parameter estimation based on voice signal derives LP And other speech parameters, voice signal modeling is carried out, the characteristic parameter of voice signal is extracted, carries out answering for Speech processing With, specifically includes the following steps:
Step 1 establishes time domain Bloomfield ' s model;
Step 2, analysis Bloomfield ' s model time domain specification;
Step 3, Bloomfield ' the s model for establishing voice signal;
Step 4, the characteristic parameter for extracting Bloomfield ' s model;
Step 5, using Bloomfield ' s model treatment voice signal.
Further, time domain Bloomfield ' s model is established described in step 1, specific as follows:
If zero-mean stationary time series XtWith spectrum density:
S (w)=(2 π)-1σ2|f(e-iw)|2 (1)
Then XtIt may be expressed as:
Xt=f (z) εtOr f-1(z)Xtt (2)
Wherein εtIt is white noise;
Model is considered below
Wherein εtFor white noise, P is the order of setting, γ12,…,γp2It is unknown real parameter constant, σ2≥0;
And meet the following conditions:
(I) (0)=1 f;
(II)f(z)≠0,f(z)≠∞,|z|≤1;
(III) f (z) exists | z | parsing, f (e in≤1)∈L1, L1For complex plane domain;
Then exist | z | on≤1, the Taylor expansion of f (z) are as follows:
Wherein z is rearward displacement because of operator, can define shift operator f (z) are as follows:
Wherein XtFor stationary time series;
(IV)(1/n!)f(n)(0)=o (n-s), s > 1, then above-mentioned infinite series f (z) is to XtEvery track restrain, Therefore the X determined by formula (3)tFor a stationary linear zero-mean time series, spectrum density is identical as the spectrum density of BF model, formula (3) X determinedtModel and BF model equivalency, i.e., to same white noise sequence εtDetermine same time series Xt, then wushu (3) is true Fixed XtModel is known as the BF model in time domain, equivalence model are as follows:
Further, Bloomfield ' the s model of voice signal is established described in step 3, specific as follows:
If X1,X2,…XNFor voice sequence XtN number of sample, to voice sequence XtCarry out Bloomfield ' s model prediction Fitting, i.e. estimation parameter γ12,…,γp2
When known to rank p, it can calculateCyclic graph:
γjEstimation are as follows:
Wherein N0=[(N-1)/2];
And σ2Estimation it is desirable are as follows:
Wherein 0.57722 constant is drawn to be outstanding;
When rank p is unknown, then minimization AIC (s) are as follows:
Minimal point s0For an estimation of rank p, IN() isCyclic graph, λ is that phase, s are variables;
Using Green function calculating formula, any prediction formula, the formula that the present invention uses can be established are as follows:
WhereinThe l step prediction that data until indicating to use t moment are done forward, ajIt can be led to by Green function Above-mentioned formula is crossed to calculate.
Further, the characteristic parameter of extraction Bloomfield ' s model described in step 4, specific as follows:
The characteristic parameter of voice signal includes linear prediction residue error LPCC, and coefficient of part correlation is also referred to as reflection coefficient REFL, wherein ki, i=1,2 ..., p, log-area is than coefficient LAR, wherein gi, i=1,2 ..., p, arcsine coefficient ARCSIN With line spectral frequencies LSF;
LP analysis is carried out to a frame voice, one group of LP parameter can be found out, so that the characteristic parameter of voice signal is exported, it can In terms of being applied to different Speech processings.Including linear prediction residue error (LPCC), coefficient of part correlation is also referred to as Reflection R EFL (ki, i=1,2 ..., p), log-area is than coefficient LAR (gi, i=1,2 ..., p), arcsine coefficient (ARCSIN) and line spectral frequencies (LSF) etc..
The estimation of Bloomfield ' s model is a kind of parametric technique based on voice number of sampling points evidence, and parameter cannot characterize Phonetic feature, can be by aspect of model γi2Derive the predictive coefficient { a for being equivalent to LP1,a2,...apAnd other features ginseng Number;
Compare { a that the mean-square error criteria of time domain BF modular form (6) and formula (12) acquires1,a2,...ap, find BF The current sample value of Model in Time Domain and LP model exists with past sample value to be contacted, and LP model is linear, and BF model is Exponential type can be similar to LP model form if BF model is carried out Taylor expansion at current sample value;
By BF model:
Carry out Taylor expansion:
Wherein z is rearward displacement operator, then the expansion of above formula is a kind of Infinite Order linear temporal logic:
Xt+A1Xt+1+A2Xt+2+A3Xt+3+…AnXt+n+ ...=εt
Wherein εtIt is white noise;Determine parameter γ12, it is determined that A1,A2,A3,…An,…;
LP model core thought is exactly the presence of very big correlation between adjacent sample values, and the signal at certain moment is largely On can use the prediction of past sampled value is obtained, i.e., each sampled value can pass through the sampled value of several time in the past Linear combination approaches;
The purpose of linear prediction analysis is exactly that predictive coefficient { a is found out under minimum mean square error criterion1,a2,...ap}。
It predicts error ε (n) are as follows:
Wherein, s (n) is time series;
Compare BF model and LP model, LP model seeks linear predictor coefficient { a1,a2,...apCriterion be lowest mean square Criterion, when BF model is launched into similar LP model form, predictive coefficient is by seeking BF model parameter γjCome what is sought, I.e. predictive coefficient is γjDerivative form, the criterion sought is the criterion of model itself;
There is correlation in voice signal, choose reasonable time and model order in a short time, in the permission model of error In enclosing, predictive coefficient striked by two kinds of criterion reaches unanimity, through being derived by:
Further, Bloomfield ' s model treatment voice signal is used described in step 5, specific as follows:
Step 5.1, voice restore:
Speech samples are female voice " 1234567890 " in experiment, and the sample frequency of voice signal is 8kHZ, to input voice Successively use 1-0.975z-1Filter carries out preemphasis, adding window, and end-point detection chooses 128 points of frame length, and frame moves at 64 points, according to BF Model and LP model calculate frame by frame carries out voice signal reconstruct;
Step 5.2, the pitch Detection based on improved BF model:
Numerical filters are introduced in the front end of inverse filter, enhances the periodicity of voiced speech signal, overcomes pure BF model The disadvantage that half frequency mistake of fundamental tone can not be improved, carries out the pretreatment such as framing to original signal first, will treated each frame signal It is transmitted to the low-pass filter of a 800Hz, voice signal is filtered, removal third and the 4th high-frequency resonance Signal after low-pass filtering is sent into numerical filters, the periodicity of prominent voiced speech signal by peak and high-frequency noise;By in short-term The definition of auto-correlation function is it is found that for quasi-periodic signal, and short-time autocorrelation function is on each integral multiple point of pitch period There is very big peak value;Again by the inverse filter based on BF model, auto-correlation computation is carried out to signal, finds out every frame signal oneself Correlation function R (n), and the first maximal peak point of auto-correlation function in addition to zero point is found out, which corresponds to The pitch period of this frame signal;Then operation again is carried out to the auto-correlation function found out, finds out Rmax/ R (0), wherein RmaxThe peak value for being auto-correlation function in addition to zero point, obtains clear, turbid court verdict according to decision rule;According to court verdict, if Voiced sound then exports pitch period, then exports pitch period zero setting if voiceless sound;
In conjunction with Fig. 1, the present invention is based on Bloomfield ' flow chart of the Pitch Detection of Speech Signals algorithm of s model, including it is following Step:
(1) average value processing is removed;
(2) voice sub-frame processing;Each user's income calculation;
(3) low-pass filtering reduces the influence at high-frequency resonance peak and external high-frequency noise;
(4) it determines numerical filtering, highlights the periodicity of voiced speech signal, keep pitch evaluation reliable;
(5) it determines inverse filter, strengthens the periodic structure of Voiced signal, the envelope of the damped sine waveform of signal is more Smoothly, so that the periodicity of voice signal further enhances, it is more advantageous to pitch Detection;
(6) voicing decision carries out pitch Detection.
The formant extraction of step 5.3, voice signal
The theoretical basis of usual speech processes is its mathematical model.In terms of model, the process of speech production is believed in excitation Under the action of number, sound wave is through resonant cavity, i.e. sound channel, by mouth or nose radiative acoustic wave.Pole in channel transfer frequency response is claimed Be formant, and the formant frequency of voice, the i.e. distribution character of pole frequency decide the tone color of voice.Extract formant Method, formant peak estimation is exactly carried out by voice signal logarithmic spectrum peak detection.This method seeks model respectively Logarithm is asked to it after spectrum, the logarithmic spectrum distribution map of formant distribution can preferably be showed by obtaining it;By to LPC, that is, linear prediction Coding method extract voice signal formant studies have shown that, can be mentioned using phase-width characteristic is same as logarithm width-frequency characteristic Take voice signal formant;
In the LPC model of voice signal, voice signal sample s (n) can be indicated by (13), and corresponding digital filter passes Delivery function H (z) are as follows:
Formula (15) can be expressed as the cascade form of p pole:
Wherein, zk=rkexp(jθk) it is k-th pole of the H (z) on z-.If the formula is stable, then its is all Pole is all in the unit circle of its z- plane.
BF model be when carrying out speech analysis come as a kind of linear prediction model using.According to above formula, BF mould is utilized Type coefficient goes to replace LPC coefficient, and is expressed as { ak;K=1,2 ... p }.
The resonance peak energy of voice signal is estimated that mode is to carry out multinomial to H (z) by digital transfer function H (z) Rooting is judged formant or spectral shape pole by required root, is derived according to these pole phase distributions corresponding total Vibration peak frequency.By using formula (16), the frequency of k-th of the formant calculated is Fkk/ 2 π T, θkFor k-th of pole, T is the period.
Embodiment
Effectiveness of the invention is verified below by simulation example.
1. voice restores experiment.By the sample value X of known stationary time series x (t)1,X2,…XNIt sets out, provides to the sequence The Bloomfield models fitting of column.Optional five sections of voices utilize 12 rank LP models and 3 ranks, 5 ranks, 10 rank Bloomfield models Error of fitting quadratic sum value after carrying out voice signal curve matching compares, and the results are shown in Table 1.
1 data error of fitting of table compares
LP model have the defects that one be difficult to overcome be exactly be to be sought most using prediction error minimum in established model Good predictive coefficient, since existing algorithm is very easy to fall into local optimum search, so as to cause the optimum prediction system of linear prediction Number is not necessarily exactly global optimum.And the data error of fitting in table 1 is as the result is shown: when rank is 3, model carries out signal fitting Effect it is best, error of fitting is minimum.As can also be seen from Table 1, Bloomfield model order is got higher, and error of fitting does not subtract It is small, and for selected several model orders, error is respectively less than the error of fitting of 12 rank LP models.It is respectively shown in Fig. 2 " 1234567890 " voice signal of 10 rank LP models, 1,2 rank BF model reconstructions.The result shows that compared with passing logical LP model, BF model has restored voice signal with few parameter generation under the premise of the distortion factor is lesser.
2. pitch Detection is tested.Fig. 3 (a)~(b) the result shows that, method proposed by the present invention show two methods extract Pitch contour, the middle and upper part Fig. 3 (a) is divided into voice spectrum figure, and clear, the voiced sound court verdict that lower part is divided into two methods compare, The middle and upper part Fig. 3 (b) is divided into voice time domain waveform diagram, and lower part is divided into the pitch contour figure of two methods extraction.It sees on the whole, this The pitch contour that two methods are extracted is substantially similar, but when the time domain waveform of voice signal shows as the base of clear, turbid transition portion When the unobvious segment of sound, the generally existing difference of the result of two pitch contours, at most apparent 200th frame of testing result difference For voice, by finding clearly, after the hand dipping operations of voiced sound court verdict difference section to two kinds, which is Typical schwa signal;Other difference sections are mainly some voiced sound tail portions to the transition portion of schwa, the mistake of schwa to voiced sound The non-voiced part such as part and pure schwa is crossed, by hand dipping, the voice signal overwhelming majority of these parts should be judged to It is more appropriate for non-voiced.From Table 2, it can be seen that utilizing the base of the method for the present invention extraction for more traditional correlation method Voice synthesized by voice frequency and the error amount of raw tone are generally smaller, to further demonstrate improved pitch Detection side Method has a degree of raising than pure BF method in testing result.
2 two kinds of algorithm performances of table compare
Improved BF algorithm BF algorithm
EDR (%) 1.71 2.23
Algorithm complexity (about 1500 frame) 13.6S 11.45S
3. formant test experience.Logarithmic spectrum peak detection is respectively adopted and phase-frequency method extracts formant.Experimental selection Male voice, female voice each one section of voice pass through H (z)=1- α z first-1The preemphasis to voice signal is realized, in order to frequency Spectrum analysis or channel parameters analysis;Then framing is carried out with frame length 32ms, step-length 16ms to each voice segments respectively, and to each voice Frame carries out clear, turbid judgement, seeks its logarithm Spectral structure to unvoiced frame.Fig. 4 is the LP model of male voice and female voice and BF model is sought Logarithmic spectrum compares figure.As seen from Figure 4, BF model log-magnitude spectral method for male voice the 2nd, the description energy of 3 formants Power is stronger, and then weaker for the 1st formant expressive ability, and with the increase of model order, this problem can be gradually no longer prominent Out.For female voice this problem then without the protrusion of male voice performance, and it shows the abilities of the 3rd, 4 formants and is better than LP Model.Fig. 5 is the effect picture that formant is extracted using phase-frequency method.
Above-mentioned emulation demonstrates validity, the reasonability of the algorithm that the present invention is proposed.The present invention by Bloomfield this Mathematical model is introduced into speech processes field, and less around the model foundation a kind of new speech analysis method, use Parameter characterization phonic signal character, have widely in speech signal analysis, identification field, especially voice coding field Application prospect.

Claims (5)

1. the pronunciation modeling method that one kind is based on Bloomfield ' s model, which comprises the following steps:
Step 1 establishes time domain Bloomfield ' s model;
Step 2, analysis Bloomfield ' s model time domain specification;
Step 3, Bloomfield ' the s model for establishing voice signal;
Step 4, the characteristic parameter for extracting Bloomfield ' s model;
Step 5 uses Bloomfield ' s model treatment voice signal.
2. the pronunciation modeling method according to claim 1 based on Bloomfield ' s model, which is characterized in that step 1 It is described to establish time domain Bloomfield ' s model, specific as follows:
If zero-mean stationary time series XtWith spectrum density:
S (w)=(2 π)-1σ2|f(e-iw)|2
Then XtIt is represented by Xt=f (z) εtOr f-1(z)Xtt
Wherein, εtIt is white noise;
Model
Wherein εtFor white noise, P is the order of setting, γ12,…,γp2It is unknown real parameter constant, σ2≥0;
And meet the following conditions:
(I) (0)=1 f;
(II)f(z)≠0,f(z)≠∞,|z|≤1;
(III) f (z) exists | z | parsing, f (e in≤1)∈L1, L1For complex plane domain;
Then exist | z | on≤1, the Taylor expansion of f (z) are as follows:
Wherein z is rearward displacement because of operator, is defined shift operator f (z) are as follows:
Wherein XtFor stationary time series;
(IV)(1/n!)f(n)(0)=o (n-s), s > 1, then above-mentioned infinite series f (z) is to XtEvery track restrain, therefore Xt For a stationary linear zero-mean time series, spectrum density is identical as the spectrum density of BF model, model
With BF model equivalency, i.e., to same white noise sequence εtDetermine same time series Xt
Therefore the BF model in time domain are as follows:
Equivalence model are as follows:
3. the pronunciation modeling method according to claim 2 based on Bloomfield ' s model, which is characterized in that step 3 Bloomfield ' the s model for establishing voice signal, specific as follows:
If X1,X2,…XNFor voice sequence XtN number of sample, to voice sequence XtBloomfield ' s model prediction fitting is carried out, Estimate parameter γ12,…,γp2
When known to rank p, calculateCyclic graph:
γjEstimation are as follows:
Wherein N0=[(N-1)/2];
And σ2Estimation are as follows:
Wherein 0.57722 constant is drawn to be outstanding;
When rank p is unknown, then minimization AIC (s) are as follows:
Minimal point s0For an estimation of rank p, IN() isCyclic graph, λ is that phase, s are variables.
4. the pronunciation modeling method according to claim 1 based on Bloomfield ' s model, which is characterized in that step 4 The characteristic parameter of extraction Bloomfield ' the s model, specific as follows:
The characteristic parameter of voice signal includes linear prediction residue error LPCC, and coefficient of part correlation is also referred to as reflection coefficient REFL, wherein ki, i=1,2 ..., p, log-area is than coefficient LAR, wherein gi, i=1,2 ..., p, arcsine coefficient ARCSIN With line spectral frequencies LSF;,
LP analysis is carried out to a frame voice, one group of LP parameter is found out, to export the characteristic parameter of voice signal;
The estimation of Bloomfield ' s model is the parametric technique based on voice number of sampling points evidence, and parameter cannot characterize phonetic feature, By aspect of model γi2Derive the predictive coefficient { a for being equivalent to LP1,a2,…apAnd other characteristic parameters;
By BF modelCarry out Taylor expansion:
Wherein z is rearward displacement operator, then the expansion of above formula is a kind of Infinite Order linear temporal logic Xt+A1Xt+1+A2Xt+2+ A3Xt+3+…AnXt+n+ ...=εt, wherein εtIt is white noise;Determine parameter γ12, it is determined that A1,A2,A3,…An,…;
LP model finds out predictive coefficient { a by linear prediction analysis under minimum mean square error criterion1,a2,…ap};
It predicts error ε (n) are as follows:
Wherein, s (n) is time series;
Compare BF model and LP model, LP model seeks linear predictor coefficient { a1,a2,…apCriterion be lowest mean square criterion, When BF model is launched into similar LP model form, predictive coefficient is by seeking BF model parameter γjCome what is sought, that is, predict Coefficient is γjDerivative form, the criterion sought is the criterion of model itself;
Access time and model order, in the allowed band of error, predictive coefficient striked by two kinds of criterion reaches unanimity, and obtains It arrives:
a11,
5. the pronunciation modeling method according to claim 1 based on Bloomfield ' s model, which is characterized in that step 5 Use Bloomfield ' the s model treatment voice signal, specific as follows:
Step 5.1, voice restore:
The sample frequency of voice signal is 8kHZ, successively uses 1-0.975z to input voice-1Filter carries out preemphasis, adds Window, end-point detection choose 128 points of frame length, 64 points of frame shifting, calculate frame by frame according to BF model and LP model and carry out voice signal weight Structure;
Step 5.2, the pitch Detection based on improved BF model:
Numerical filters are introduced in the front end of inverse filter, the pretreatment of framing are carried out to original signal first, after pretreatment Each frame signal be transmitted to low-pass filter, voice signal is filtered, removal third and the 4th high-frequency resonance Peak and high-frequency noise;Signal after low-pass filtering is sent into numerical filters;Again by the inverse filter based on BF model, to letter Number auto-correlation computation is carried out, finds out the auto-correlation function R (n) of every frame signal, and find out the auto-correlation function in addition to zero point the One maximal peak point, the peak value position are the pitch period of this corresponding frame signal;Then to the auto-correlation letter found out Number carries out operation again, finds out Rmax/ R (0), wherein RmaxThe peak value for being auto-correlation function in addition to zero point, R (0) are auto-correlation letter Value of number R (n) as n=0, obtains clear, turbid court verdict according to decision rule;According to court verdict, then exported if voiced sound Pitch period then exports pitch period zero setting if voiceless sound;
The formant extraction of step 5.3, voice signal
Formant peak estimation is carried out by voice signal logarithmic spectrum peak detection, after seeking model spectra, logarithm operation is carried out, obtains Obtain logarithmic spectrum distribution map;Using phase-width characteristic and logarithm width-frequency feature extraction voice signal formant;
LPC coefficient is replaced using BF model coefficient, and is expressed as { ak;K=1,2 ... p };
The formant of voice signal is estimated by digital transfer function H (z), polynomial rooting is carried out to H (z), by required Root judges formant or spectral shape pole, determines corresponding formant frequency according to these pole phase distributions, calculates The frequency of k-th of the formant come is Fkk/ 2 π T, θkIt is the period for k-th of pole, T.
CN201811122154.2A 2018-09-26 2018-09-26 Voice modeling method based on Bloomfield's model Pending CN109308894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811122154.2A CN109308894A (en) 2018-09-26 2018-09-26 Voice modeling method based on Bloomfield's model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811122154.2A CN109308894A (en) 2018-09-26 2018-09-26 Voice modeling method based on Bloomfield's model

Publications (1)

Publication Number Publication Date
CN109308894A true CN109308894A (en) 2019-02-05

Family

ID=65224868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811122154.2A Pending CN109308894A (en) 2018-09-26 2018-09-26 Voice modeling method based on Bloomfield's model

Country Status (1)

Country Link
CN (1) CN109308894A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270934A (en) * 2020-09-29 2021-01-26 天津联声软件开发有限公司 Voice data processing method of NVOC low-speed narrow-band vocoder

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1412742A (en) * 2002-12-19 2003-04-23 北京工业大学 Speech signal base voice period detection method based on wave form correlation method
CN1496559A (en) * 2001-01-12 2004-05-12 艾利森电话股份有限公司 Speech bandwidth extension
CN1758678A (en) * 2005-10-26 2006-04-12 熊猫电子集团有限公司 Voice recognition and voice tag recoding and regulating method of mobile information terminal
CN1815552A (en) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 Frequency spectrum modelling and voice reinforcing method based on line spectrum frequency and its interorder differential parameter
CN104200804A (en) * 2014-09-19 2014-12-10 合肥工业大学 Various-information coupling emotion recognition method for human-computer interaction
CN105810198A (en) * 2016-03-23 2016-07-27 广州势必可赢网络科技有限公司 Channel robust speaker identification method and device based on characteristic domain compensation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496559A (en) * 2001-01-12 2004-05-12 艾利森电话股份有限公司 Speech bandwidth extension
CN1412742A (en) * 2002-12-19 2003-04-23 北京工业大学 Speech signal base voice period detection method based on wave form correlation method
CN1758678A (en) * 2005-10-26 2006-04-12 熊猫电子集团有限公司 Voice recognition and voice tag recoding and regulating method of mobile information terminal
CN1815552A (en) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 Frequency spectrum modelling and voice reinforcing method based on line spectrum frequency and its interorder differential parameter
CN104200804A (en) * 2014-09-19 2014-12-10 合肥工业大学 Various-information coupling emotion recognition method for human-computer interaction
CN105810198A (en) * 2016-03-23 2016-07-27 广州势必可赢网络科技有限公司 Channel robust speaker identification method and device based on characteristic domain compensation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张广纯: "Bloomfield’s模型在语音信号处理中的应用研究", 《军事通信技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270934A (en) * 2020-09-29 2021-01-26 天津联声软件开发有限公司 Voice data processing method of NVOC low-speed narrow-band vocoder
CN112270934B (en) * 2020-09-29 2023-03-28 天津联声软件开发有限公司 Voice data processing method of NVOC low-speed narrow-band vocoder

Similar Documents

Publication Publication Date Title
CN101599271B (en) Recognition method of digital music emotion
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN107871499A (en) Audio recognition method, system, computer equipment and computer-readable recording medium
Al-Radhi et al. Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis.
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
Jing et al. Speaker recognition based on principal component analysis of LPCC and MFCC
Kumar Real‐time implementation and performance evaluation of speech classifiers in speech analysis‐synthesis
CN113436607A (en) Fast voice cloning method
Katsir et al. Evaluation of a speech bandwidth extension algorithm based on vocal tract shape estimation
Kawahara et al. Higher order waveform symmetry measure and its application to periodicity detectors for speech and singing with fine temporal resolution
CN109308894A (en) Voice modeling method based on Bloomfield's model
CN112735477B (en) Voice emotion analysis method and device
Raju et al. Application of prosody modification for speech recognition in different emotion conditions
CN114913844A (en) Broadcast language identification method for pitch normalization reconstruction
Dharini et al. CD-HMM Modeling for raga identification
Natarajan et al. Segmentation of continuous Tamil speech into syllable like units
Dawande et al. Analysis of different feature extraction techniques for speaker recognition system: A review
CN111862931A (en) Voice generation method and device
Li et al. Graphical model approach to pitch tracking.
Alhanjouri et al. Robust speaker identification using denoised wave atom and GMM
Gadekar et al. Analysis of speech recognition techniques
Tamulevičius et al. High-order autoregressive modeling of individual speaker's qualities
Narendra et al. A deterministic plus noise model of excitation signal using principal component analysis for parametric speech synthesis
Paulraj et al. Vowel recognition based on frequency ranges determined by bandwidth approach
Galajit et al. ThaiSpoof: A Database for Spoof Detection in Thai Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190205