CA1336212C - Distance measurement control of a multiple detector system - Google Patents

Distance measurement control of a multiple detector system

Info

Publication number
CA1336212C
CA1336212C CA000562766A CA562766A CA1336212C CA 1336212 C CA1336212 C CA 1336212C CA 000562766 A CA000562766 A CA 000562766A CA 562766 A CA562766 A CA 562766A CA 1336212 C CA1336212 C CA 1336212C
Authority
CA
Canada
Prior art keywords
value
frames
voiced
calculating
present
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CA000562766A
Other languages
French (fr)
Inventor
David Lynn Thomson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
American Telephone and Telegraph Co Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by American Telephone and Telegraph Co Inc filed Critical American Telephone and Telegraph Co Inc
Application granted granted Critical
Publication of CA1336212C publication Critical patent/CA1336212C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Radar Systems Or Details Thereof (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Time-Division Multiplex Systems (AREA)
  • Testing Or Calibration Of Command Recording Devices (AREA)

Abstract

Apparatus for detecting a fundamental frequency in speech utilizing a plurality of voiced detectors and selecting one of those detectors to make the voicing decision utilizing distance measurement values with each value generated by one of the voiced detectors. The voiced detector selected is the one which generated the best distance measurement value. The distance measurement value may be the Mahalanobis distance value or Hotelling's two-sample T2 statistic. Two types of voiced detectors are disclosed: statistical voiced detectors and discriminant voiced detectors. The disclosed statistical voiced detector adapts to changing speech environments by detecting changes in the voice environment in response to classifiers that define certain attributes of the speech.

Description

- -DISTANCE MEASUREMENT CONTROL OF
A MULTIPLE DETECTOR SYSTEM

Technical Field This invention relates to determining whether or not speech has a filnd~ment~l frequency present. This is also referred to as a voicing decision.
More particularly, the invention is directed to selecting one of a plurality of 5 voiced detectors which are concurrently processing speech samples for making the voicing decision with the selection being based on a distance measurement calculation.
Background and Problem In low bit rate voice coders, degradation of voice quality is often due 10 to inaccurate voicing decisions. The difficulty in correctly making these voicing decisions lies in the fact that no single speech classifier can reliably distinguish voiced speech from unvoiced speech. The use of multiple voiced detectors and the selection of one of these detectors to make the determin~tion of whether thespeech is voiced or unvoiced is disclosed in the paper of J. P. Campbell, et al., 15 "Voiced/Unvoiced Cl;l~ification of Speech with Applications to the U.S.
Gove. .~ nt LPC-lOE Al~o~ ," EEE International Conrel~nce on Acoustics, Speech, and Signal Processing, 1986, Tokyo, Vol. 9.11.4, pp. 473-476. This paperdiscloses the utili7~tion of multiple linear discrimin~nt voiced detectors each utilizing dirrerent weights and threshold values to process the same speech 20 cl~csifiers for each frame of speech. The weights and thresholds for each detector are ~letermined by utili7ing training data. For each detector, a Lrrt-el~ level of white noise is added to the training data. During the processing of actual speech, the detector to be utilized to make the voicing decision is determined by ex~mining the signal-to-noise ratio, SNR. The range of possible values that the 25 SNR can have is subdivided into subranges with each subrange being assigned to one of the detectors. For each frame, the SNR is calculated, the subrange is determ~ined, and the detector associated with this subrange is selected to make the voicing decision.
A problem with the prior art approach is that it does not p~lrOllll well 30 with respect to a speech environment in which characteristics of the speech itself have been altered. In addition, the method used by Campbell is only adapted to white noise and cannot adjust for colored noise. Therefore, there exists a need for a method of selecting between a plurality of voiced detectors that allows detection in a varying speech environment.
Solution The above described problem is solved and a technical advance is achieved by a voiced decision apparatus that selects between a plurality of voiced S delectc)l~. by co~ g separation or merit values generated by each of the voiced detectors. The separation values are also referred to as distance measurements.
Advantageously, the appa-~Lus comprises dirrGlcllt types of voiced detectors such as discrimin~nt and statistical detectors each generating a separation value. A coll~pal~tor within the apparatus selects the voiced detector to make the 10 determin~tion whether the speech is voiced or unvoiced that is generating the largest separation value. Advantageously, the separation value may be a statistical, generalized distance value.
All of the voiced detectors inrlic:lte whether a frame is voiced or unvoiced and each of the detectors first determines a discrimin~nt variable for 15 each one of the present and previous frames. After de~e~ g the variable, eachof the detectors determines mean values for both voiced and unvoiced ones of theprevious and present frames. Each detector determines variance values for voicedand unvoiced ones of the previous and present frames. After calculating the means and the variances, each detector cletermines the separation value from the20 mean and variance values for the voiced frames and the mean and variance values for the unvoiced frames.
Advantageously, the del~ ".i--~tion of the separation values is pelrulllled in each detector by combining variance values into a weighted sum.
The mean value of each of the unvoiced frames is subtracted from the mean value 25 of each of the voiced frames. This subtracted value is squared for each of the frames and the weighted sum of the variance values is divided into the resultingsquared subtracted value. Advantageously, before forming the weighted sum, each cletector multiples the variance value for the voiced frames by the probability of a voiced frame occurring, and multiples the variance value for the unvoiced frames30 by the probability of an unvoiced frame oCcurrin~ In addition, before dividing the squared subll~cled value by weighted sum, the squared subtracted value is multiplied by the probabilities of an voiced frame occurring and unvoiced frame occurring.

-- 3 -- .
The method comprises the steps of calculating a first merit value defining the separation between voiced and unvoiced frames by the discriminant detector, calculating a second merit value defining separation between voiced and unvoiced frames by said statistical voiced detector, and selecting the detector that calculated best merit value to 5 indicate whether a frame is voiced or unvoiced.
In accordance with one aspect of the invention there is provided an apparatus for determining voicing in frames of non-training set speech and each of said frames being unvoiced, voiced or silent and said apparatus having a plurality of detecting means for performing a voicing decision and for indicating the voicing decision in a frame, 10 comprising: each of the detecting means comprises means for calculating a merit value defining the separation between voiced and unvoiced decision regions for present and previous ones of said frames of non-training set speech; and means for selecting one of said detecting means to indicate the voicing decision for said present one of said frames of non-training set speech upon the selected one of said detecting means calculating a merit 15 value better than any other one of said detecting means calculated merit value.
In accordance with another aspect of the invention there is provided a method for determining voicing in frames of non-training set speech having a first and second voiced detectors for performing a voicing decision and for indicating the voicing decision in a frame, comprising the steps of: calculating a first merit value defining the 20 separation between voiced and unvoiced decision regions for present and previous ones of said frames of non-training set speech by said first voiced detector; calculating a second merit value defining separation between voiced and unvoiced decision regions for present and previous frames of non-training set speech by said second voiced detector; and selecting said first voiced detector to indicate the voicing decision upon said first merit 25 value being better than said second value and selecting said second voiced detector to indicate the voicing decision upon said second merit value being better than said first value.
Brief Description of the Drawin~
The invention may be better understood from the following detailed 30 description which when read with reference to the drawing in which:
FIG. 1 is a block diagram illustrating the present invention;
FIG. 2 illustrates, in block diagram form, statistical voice detector 103 of FIG. l;

,~F, ~5 `

-3a- 1 3362 1 2 FIGS. 3 and 4 illuskate, in greater detail, the functions performed by statistical voiced detector 103 of FIG. 2; and FIG. 5 illuskates, in greater detail, functions performed by block 340 of FIG. 4.
S Detailed Description FIG. 1 illustrates an apparatus for performing the unvoiced/voiced decision operation by selecting between one of two voiced detectors. It would be obvious to one skilled in the art to use more than two voiced detectors in FIG. 1. The selection between detectors 102 and 103 is based on a distance measurement that is generated by each 10 detector and transmitted to distance comparator 104. Each generated distance measurement represents a merit value indicating the correctness of the generating detector's voicing decision. Distance coll,pal~tor 104 compares the two distance measurement values and controls a multiplexer 105 such that the detector generating the greatest distance measurement value is selected to make the unvoiced/voiced decision. However, for other 15 types of measurements, the lowest merit value would indicate the detector making the most accurate voicing decision. Advantageously, the distance measurement may be the Mahalanobis distance. Advantageously, detector 102 is a discriminant detector, and detector 103 is a statistical detector. However, it would be obvious to one skilled in the art that the detectors could all be of the same type and that there could be more than two 20 detectors present in the system.
Consider now the overall operation of the apparatus illuskated in FIG. 1.
Classifier generator 101 is responsive to each frame of speech to generate classifiers which advantageously may be the log of the speech energy, the log of .

- 4 l 3362 1 2 the LPC gain, the log area ratio of the first reflection coefficient, and the squared correlation coefficient of two speech segments one frame long which are offset by one pitch period. The calculation of these classifiers involves digitally sampling analog speech, forrning frames of the digital samples, and processing those frames 5 and is well known in the art. Generator 101 transmits the classifiers to detectors 102 and 103 via path 106.
Detectors 102 and 103 are responsive to the classifiers received via path 106 to make unvoicedlvoiced decisions and transmit these decisions via paths 107 and 110, respectively, to multiplexer 105. In addition, the detectors 10 determine a distance measure between voiced and unvoiced frames and transmit these distances via paths 108 and 109 to co~ tol 104. Advantageously, these distances may be Mahalanobis distances or other generalized distances.
Comparator 104 is responsive to the distances received via paths 108 and 109 to control multiplexer 105 so that the latter multiplexer selects the output of the15 detector that is generating the largest distance.
FIG. 2 illustrates, in greater detail, st~ti~tic~l voiced detector 103. For each frame of speech, a set of cl~ifiers also refell~,d to as a vector of classifiers is received via path 106 from classifier ge.l~ator 101. Silence detector 201 is responsive to these cl~csifiers to ~letçrmine whether or not speech is present in the present frame. If speech is present, detector 201 transmits a signal via path 210.
If no speech (silence) is present in the frame, then only subtractor 207 and U/Vdetermin~tor 205 are operational for that particular frame. Whether speech is present or not, the unvoiced/voiced decision is made for every frame by determin~tor 205.
In l'~S~llSe to the signal from detector 201, classifier averager 202 m~int~in~ an average of the individual classifiers received via path 106 by averaging in the cl~cifiers for the present frame with the classifiers for previous frames. If speech (non-silence) is present in the frame, silence detector 201 signals st~tistir~l calculator 203, generator 206, and averager 202 via path 210.
St~ti~tiC~l calculator 203 calc~ tes st~ti~tic~l distributions for voiced and unvoiced frames. In particular, calculator 203 is responsive to the signal received via path 210 to calculate the overall probability that any frame is unvoiced and the probability that any frame is voiced. In addition, statistical calculator 203 calculates the statistical value that each classifier would have if the 35 frame was unvoiced and the st~ti~tic~l value that each cl~ssifier would have if the frame was voiced. Further, calculator 203 calculates the covariance matrix of the classifiers. Advantageously, that statistical value may be the mean. The calculations performed by calculator 203 are not only based on the present framebut on previous frames as well. Statistical calculator 203 performs these 5 calculations not only on the basis of the classifiers received for the present frame via path 106 and the average of the classifiers received path 211 but also on the basis of the weight for each classifiers and a threshold value defining whether a frame is unvoiced or voiced received via path 213 from weights calculator 204.
Weights calculator 204 is responsive to the probabilities, covariance 10 matrix, and st~tictic~l values of the classifiers for the present frame as generated by calculator 203 and received via path 212 to recalculate the values used as weight vector a, for each of the classifiers and the threshold value b, for the present frame. Then, these new values of a and b are tr~ncmittecl back to st~tictir~l calculator 203 via path 213.
Also, weights calculator 204 transmits the weights and the statistical values for the cl~csifi~rs in both the unvoiced and voiced regions via path 214,determin~tor 205, and path 208 to generator 206. The latter generator is responsive to this information to calculate the distance measure which is subsequently tr~ncmitte~ via path 109 to colllpal~tor 104 as illustrated in FIG. 1.
U/V determin~t~r 205 is responsive to the information trancmitted via paths 214 and 215 to determine whether or not the frame is unvoiced or voiced and to transmit this decision via path 110 to multiplexer 105 of FIG. 1.
Consider now in greater detail the operation of each block illustrated in FIG. 2 which is now given in terms of vector and matrix mathematics.
Averager 202, st~tistir~l calculator 203, and weights calculator 204 implement an improved EM alg~ m similar to that suggested in the article by N. E. Day entitled "Estim~ing the Components of a Mixture of Normal Distributions", Biometrik~, Vol. 56, no. 3, pp. 463-474, 1969. Utilizing the concept of a decaying average, cl~ccifier averager 202 calculates the average for the classifiers 30 for the present and previous frames by calculating following equations 1, 2, and 3:

n=n+1 if n<2000 (1) - 6 l 3362 1 2 z = l/n (2) Xn = (1--Z) Xn-l + ZXn ( ) Xn is a vector representing the classifiers for the present frame, and n is the number of frames that have been processed up to 2000. z represents the decaying 5 average coefficient, and Xn represents the average of the cl~csifiers over thepresent and past frames. Statistical calculator 203 is responsive to receipt of the Z7 Xn and Xn ih~folllla~ion to calculate the covariance matrix, T, by first calculating the matrix of sums of squares and products, Qn~ as follows:

Qn = (1--z) Qn-l + z xn x n . (4 10 After Qn has been calcul~te~l, T is calculated as follows:

T = Qn--Xn X n -The means are subtracted from the classifiers as follows:

Xn = Xn--Xn (6) Next, calculator 203 delellllines the probability that the frame represented by the 15 present vector xn is unvoiced by solving equation 7 shown below where, advantageously, the components of vector a are initi~ as follows: colllpol-ent corresponding to log of the speech energy equals 0.3918606, coll~onent 1 3~62 1 2 corresponding to log of the LPC gain equals -0.0520902, component corresponding to log area ratio of the first reflection coefficient equals 0.5637082, and component corresponding to squared correlation coefficient equals 1.361249;
and b initially equals -8.36454:

1 + exp(a xn+b) After solving equation 7, calculator 203 determines the probability that the classifiers represent a voiced frame by solving the following:

P(v I xn) = 1--P(u I xn) (8) Next, calculator 203 determines the overall probability that any frame will be 10 unvoiced by solving equation 9 for Pn:

Pn = ( 1--Z) Pn-l + z P(u I xn) . (9) After determining the probability that a frame will be unvoiced, calculator 203 then ~le~ çS two vectors, u and v, which give the mean values of each c1~s~ifiçr for both unvoiced and voiced type frames. Vectors u and v are15 the st~lti!5ti~z~1 averages for unvoiced and voiced frames, l~speclively. Vector u, st~tistic~l average unvoiced vector, conlains the mean values of each classifier if a frame is unvoiced; and vector v, st~ti~tic~l average voiced vector, gives the mean value for each classifier if a frame is voiced. Vector u for the present frame is solved by calculating equation 10, and vector v is determined for the present 20 frame by calculating equation 11 as follows:

Un = (1--z) Un_l + z xn P(u I Xn)lpn--ZXn (10) -8- l 33621 2 Vn = (1--z) Vn_l + Z Xn P(vlxn)~ Pn) --ZXn (11) Calculator 203 now co,.. ~ cates the u and v vectors, T matrix, and probability p to weights calculator 204 via path 212.
Weights calculator 204 is responsive to this information to calculate 5 new values for vector a and scalar b. These new values are then tr~n~mitt~ back to st~ti~tir~l calculator 203 via path 213. This allows detector 103 to adapt rapidly to changing environm.onts. Advantageously, if the new values for vector a and scalar b are not tr~n~mitted back to st~tisti~al calculator 203, detector 103 will continue to adapt to changing envi~nlllent~ since vectors u and v are being 10 updated. As will be seen, determin~tor 205 uses vectors u and v as well as vector a and scalar b to make the voicing decision. If n is greater than advantageously 99, vector a and scalar b are calculated as follows. Vector a is delelll~ined by solving the following equation:

T~l (Vn--Un) (12) 1--pn(l--Pn) (Un--vn) T (Un--Vn) 15 Scalar b is determined by solving the following equation:

b = 2 a~(un+vn) + lg[(l-Pn)/Pn ] (13) After calculating equations 12 and 13, weights calculator 204 transmits vectors a, u, and v to block 205 via path 214. If the ~ame contained silence only equation 6 is c~ teA
Determin~tor 205 is responsive to this tr~n~mitte~ infolmation to decide whether the present frame is voiced or unvoiced. If the ele...el-t of vector (Vn - un) corresponding to power is positive, then, a frame is declared voiced if the following equation is true:

`- I 3362 1 2 g a'xn--a'(un+vn)/2 > 0; (14) or if the çlçm~nt of vector (vn - un) corresponding to power is negative, then, a frame is declared voiced if the following equation is true:

a'xn--a (un+vn)/2 < 0 . (15) S Equation 14 can also be ,e~vliue.l as:

a'xn + b - lOg[(l-pn)/pn] > 0 -Equation 15 can also be ~e.~v~iuell as:

a'xn + b--log[(l-pn)/pn] < 0 -If the previous conditions are not meet, determinator 205 declares the frame 10 unvoiced. Equations 14 and 15 represent decision regions for making the voicing decision. The log term of the l~,~liu~,n forms of equations 14 and 15 can be elimin~tç~1 with some change of perform~nce. Advantageously, in the present example, the ele..~ t corresponding to power is the log of the speech energy.
Generator 206 is responsive to the information received via path 214 15 from calculator 204 to calculate the distance measure, A, as follows. First, the discrimin;~nt vanable, d, is calculated by equation 16 as follows:

d = a'xn + b - log[(l-pn)lpn] (16) - lo- 1 3352 ~ 2 Advantageously, it would be obvious to one skilled in the art to use different types of voicing detectors to generate a value similar to d for use in the following equations. One such detector would be an auto-correlation detector. If the frameis voiced, the equations 17 through 20 are solved as follows:

S m~ z) ml + zd, (17) sl = (l-z) sl + zd~ and (18) kl = Sl - m2 (19) where ml is the mean for voiced frames and kl is the variance for voiced frames.The probability, Pd, that deterrnin~tor 205 will declare a frame 10 unvoiced is calculated by the following equation:

Pd = (1--z) Pd (20) Advantageously, Pd is initially set to .5.
If the frame is unvoiced, equations 21 through 24 are solved as follows:

mO = (l-z) mO + zd, (21) - 11- 1 3362~ 2 So=(l-z)so+zd2- and (22) ko = so - mO (23) The probability, Pd, that determinator 205 will declare a frame unvoiced is calculated by the following equation:

S Pd = (1--z) Pd + z (24) After calculating equation 16 through 22 the distance measure or merit value is calculated as follows:

A2 = Pd (1 Pd) (ml --mO)2 . (25) (1 --Pd)kl + Pdko Equation 25 uses Hotelling's two-sample T2 statistic to calculate the distance 10 measure. For equation 25, the larger the merit value the greater the separation.
However, other merit values exist where the smaller the merit value the greater the separation. Advant~ageously, the distance measure can also be the Mahalanobis rli~t~nce which is given in the following equation:

A2 = (1 P )k P k (26) -- ~ 336212 Advantageously, a third technique is given in the following equation:

2 (ml - mO)~ (27) Advantageously, a fourth technique for calculating the distance measure is illustrated in the following equation:

A2 = a (vn--un) (28) Discrimin~nt detector 102 makes the unvoicedJvoiced decision by tr~ncmitfing inrollllalion to multiplexer 105 via path 107 in~ ting a voiced frame if a'x+b > 0. If this condition is not true, then detector 102 inflic~tes an unvoiced frame. The values for vector a and scalar b used by detector 102 are 10 advantageously identical to the initial values of a and b for statistical voiced detector 103.
Detector 102 detçnninçs the distance measure in a manner similar to generator 206 by pelrolllling calculations similar to those given in equations 16 through 28.
In flow chart form, FIGS. 3 and 4 illustrate, in greater detail, the operations performed by s~tictic~l voiced detector 103 of FIG.2. Blocks 302 and 300 implement blocks 202 and 201 of FIG. 2, respectively. Blocks 304 through 318 implement statistical calculator 203. Blocks 320 and 322 implement weights calculator 204, and blocks 326 through 338 implement block 205 of 20 FIG.2. Generator 206 of FIG. 2 is implemented by block 340. Subtractor 207 is implem~nteA by block 308 or block 324.
Block 302 calculates the vector which represents the average of the cl~sifiçrs for the present frame and all previous frames. Block 300 determines whether speech or silence is present in the present frame; and if silence is present in the present frame, the mean for each classifier is subtracted from each classifier by block 324 before control is transferred to decision block 326. However, if speech is present in the present frame, then the statistical and weights calculations are pl~ro"lled by blocks 304 through 322. First, the average vector is found in 5 block 302. Second, the sums of the squares and products matrix is calculated in block 304. The latter matrix along with the vector X representing the mean of the cl~sifiers for the present and past frames is then utilized to calculate the covariance matrix, T, in block 306. The mean X is then subtracted from the cl~sifi~r vector xn in block 308.
Block 310 then calculates the probability that the present frame is unvoiced by ufili7ing the present weight vector a, the present threshold value b, and the classifier vector for the present frame, xn. After calculating the probability that the present frame is unvoiced, the probability that the presentframe is voiced is calculated by block 312. Then, the overall probability, Pn, that 15 any frame will be unvoiced is calculated by block 314.
Blocks 316 and 318 calculate two vectors: u and v. The values contained in vector u represent the st~ti~tic~l average values that each classifier would have if the frame were unvoiced. Whereas, vector v contains values representing the statistical average values that each cl~ ifier would have if the 20 frame were voiced. The actual vectors of classifiers for the present and previous frames are clustered around either vector u or vector v. The vectors representing the classifiers for the previous and present frames are clustered around vector u if these frames are found to be unvoiced; otherwise, the previous cl~csifier vectors are clustered around vector v.
After execution of blocks 316 and 318, control is transferred to decision block 320. If N is greater than 99, control is transferred to block 322;
otherwise, control is transferred to block 326. Upon receiving control, block 322 then calculates a new weight vector a and a new threshold value b. The vector a and value b are used in the next sequential frame by the prece.1ing blocks in 30 FIG. 3. Advantageously, if N is required to be greater than infinity, vector a and scalar b will never be changed, and detector 103 will adapt solely in response to vectors v and u as illustrated in blocks 326 through 338.
Blocks 326 through 338 implement u/v ~terrnin~tor 205 of FIG. 2.
Block 326 determines whether the power term of vector v of the present frarne is35 greater than or equal to the power term of vector u. If this condition is true, then 1 3362~ 2 decision block 328 is executed. The latter decision block determines whether thetest for voiced or unvoiced is met. If the frame is found to be voiced in decision block 328, then the frame is so marked as voiced by block 330 otherwise the frame is marked as unvoiced by block 332. If the power term of vector v is less 5 than the power term of vector u for the present frame, blocks 334 through 338 function are executed and functAion in a similar manner. Finally, block 340 calculates the distance measure.
In flow chart form, FIG. 5 illustrates, in greater detail the operations pclrc,lmed by block 340 of FIG. 4. Decision block S01 determines whether the 10 frame has been in~lic~te l as unvoiced or voiced by examining the calculations 330, 332, 336, or 338. If the frame has been ~lesign~ted as voiced, path 507 is selected. Block 510 calculates probability Pd, and block 502 recalculates the mean, ml, for the voiced frames and block 503 recalculates the variance, k1, forvoiced frames. If the frame was clete~ninYl to be unvoiced, decision block 501 15 selects path 508. Block 509 recalculates probability Pd, and block 504 recalculates mean, mO, for unvoiced frames, and block 505 recalculates the variance ko for unvoiced frames. Finally, block 506 calculates the distance measure by pelrol~ing the calculations in~1ic~te~1 It is to be understood that the afore-described embodiment is merely 20 illustrative of the principles of the invention and that other arrangements may be devised by those skilled in the art without departing from the spirit and the scope of the invention. In particular, the calculations pelrc ~ ed per frame or set could be pelrulllled for a group of frames or sets.

Claims (23)

1. An apparatus for determining voicing in frames of non-training set speech and each of said frames being unvoiced, voiced or silent and said apparatus having a plurality of detecting means for performing a voicing decision and for indicating the voicing decision in a frame, comprising:
each of the detecting means comprises means for calculating a merit value defining the separation between voiced and unvoiced decision regions for present and previous ones of said frames of non-training set speech; and means for selecting one of said detecting means to indicate the voicing decision for said present one of said frames of non-training set speech upon the selected one of said detecting means calculating a merit value better than any other one of said detecting means calculated merit value.
2. The apparatus of claim 1 wherein said calculating means of each of said detecting means performs a statistical calculation to determine said merit value.
3. The apparatus of claim 2 wherein said statistical calculations are distance measurement calculations.
4. The apparatus of claim 2 wherein one of said detecting means for indicating aframe is voiced upon detecting said fundamental frequency and indicating a frame is unvoiced upon said fundamental frequency being absent;
said calculating means for said one of said detecting means further comprises means for determining a discriminant variable for each ones of previous and present frames;
means for determining a mean value for voiced ones of said previous and present frames;
means for determining a variance value of said voiced ones of said previous and present frames;
means for determining a mean value of said unvoiced ones of said previous and present frames;

means for determining a variance value of said unvoiced ones of said previous and present frames; and means for determining the merit value of said one of said detecting means from the determined voiced mean and variance values and the determined unvoiced mean and variance values.
5. The apparatus of claim 4 wherein said means for determining the merit value for said one of said detecting means comprises means for summing said variance values;
means for calculating a weighted sum of said variance values;
means for subtracting the mean value of said unvoiced frames from said mean value of said voiced frames;
means for squaring the subtracted value; and means for dividing said weighted sum by the sum of said squared values, thereby generating said merit value for said one of said detecting means.
6. The apparatus of claim 5 wherein said means for calculating said weighted sumcomprises means for calculating a first probability that said one of said detecting means indicates the presence of voicing in said present frame;
means for calculating a second probability that said one of said detecting meansindicates non-voicing in said present frame;
means for multiplying said variance of said voiced ones of said previous and present frames by said first probability and said variance of said unvoiced ones of said previous and present frames by said second probability; and means for forming said weighted sum from the results of said multiplications.
7. The apparatus of claim 6 wherein said means for dividing comprises means for multiplying the results of the division of said weighted sum by the sum of said squared values by said first and second probabilities to generate said merit value of said one of said detecting means.
8. The apparatus of claim 7 wherein said one of said detecting means further comprises a means responsive to a set of classifiers defining speech attributes of said present frame of non-training set speech for calculating a set of statistical parameters;
means responsive to the calculated set of parameters for calculating a set of weights each associated with one of said classifiers; and means responsive to the calculated set of weights and classifiers and said set of parameters for performing the voicing decision for said present frame of non-training set speech.
9. The apparatus of claim 8 wherein said means for calculating said set of weights comprises means for calculating a threshold value in response to said set of said parameters;
means for communicating said set of weights and said threshold value to said means for calculating said set of statistical parameters to be used for calculating another set of parameters for another one of said frames of speech; and said means for calculating said set of statistical parameters further responsive to the communicated set of weights and another set of classifiers defining said speech attributes of said other frame for calculating another set of statistical parameters.
10. An apparatus for determining voicing in frames of non-training set speech and each of said frames being unvoiced, voiced or silent, comprising:
first means for generating a first signal indicating voicing in a present one of said frames of non-training set speech;
second means for generating a second signal indicating voicing in said present one of said frames of non-training set speech;
said first means comprises means for calculating a first generalized distance value representing the degree of separation between voiced and unvoiced decision regions as determined by said first means for present and previous ones of said frames;
said second means comprises means for calculating a second generalized distance value representing the degree of separation between voiced and unvoiced decision regions as determined by said second means for present and previous ones of said frames; and means for selecting said first signal to indicate the voicing decision upon said first generalized value being better than said second generalized value and for selecting said second signal to indicate the voicing decision upon said second generalized value being better than said first generalized value.
11. The apparatus of claim 10 wherein said generalized distance values are the Mahalanobis distance values.
12. The apparatus of claim 11 wherein said first means further comprises a meansresponsive to a set of classifiers defining speech attributes of one frame of speech for calculating a set of statistical parameters;
means responsive to the calculated set of parameters for calculating a set of weights each associated with one of said classifiers; and means responsive to the calculated set of weights and classifiers and said set of parameters for determining the voicing in said present ones of said frames of non-training set speech.
13. The apparatus of claim 12 wherein said means for calculating said first generalized distant value comprises means responsive to said calculated set of parameters and said calculated set of weights for determining said first generalized distance value.
14. The apparatus of claim 13 wherein said second means is a discriminant voiceddetector.
15. The apparatus of claim 14 wherein said means for calculating said second generalized distance value comprises means for determining a mean value for voiced ones of said previous and present frames;
means for determining a mean value of said unvoiced ones of said previous and present frames;

means for determining a variance value of said unvoiced ones of said previous and present frames; and means for determining said second distance measurement value from the determined voiced mean and variance values and the determined unvoiced means andvariance values.
16. The apparatus of claim 15 wherein said means for determining said second distance measurement value comprises:
means for calculating the weighted sum of said variance values;
means for subtracting the mean value of said unvoiced frames from said mean value of said voiced frames;
means for squaring the subtracted value; and means for dividing said weighted sum of said variance values by the sum of said squared values thereby generating said second distance measurement value.
17. A method for determining voicing in frames of non-training set speech having a first and second voiced detectors for performing a voicing decision and for indicating the voicing decision in a frame, comprising the steps of:
calculating a first merit value defining the separation between voiced and unvoiced decision regions for present and previous ones of said frames of non-training set speech by said first voiced detector;
calculating a second merit value defining separation between voiced and unvoiceddecision regions for present and previous frames of non-training set speech by said second voiced detector; and selecting said first voiced detector to indicate the voicing decision upon said first merit value being better than said second value and selecting said second voiced detector to indicate the voicing decision upon said second merit value being better than said first value.
18. The method of claim 17 wherein said steps of calculating said first and second values each comprises the step of performing a statistical calculation to determine said first and second values, respectfully.
19. The method of claim 18 wherein said statistical calculations are distance measurement calculations.
20. The method of claim 18 wherein:
said step of calculating said first value further comprises the steps of determining a discriminant variable for each ones of previous and present frames;
determining a mean value for voiced ones of said previous and present frames;
determining in response to said mean value for voiced ones of said previous and present frames a variance value of said voiced ones of said previous and present frames;
determining a mean value of said unvoiced ones of said previous and present frames;
determining in response to said mean value for unvoiced ones of said previous and present frames a variance value of said unvoiced ones of said previous and present frames;
and determining said first value from the determined voiced mean and variance valuesand the determined unvoiced mean and variance values.
21. The method of claim 20 wherein said step of determining said first value comprises the steps of summing said variance values;
calculating the weighted sum of said variance values;
subtracting the mean value of said unvoiced frames from said mean value of said voiced frames;
squaring the subtracted values; and dividing said weighted sum of variance values by the sum of said squared variance values thereby generating said statistical value.
22. The method of claim 21 wherein said step of calculating said weighted sum comprises the steps of calculating a first probability that said step of determining said first value indicates the presence of voicing in said present frame;
calculating a second probability that said step of determining said first value indicates the non-voicing in said present frame;

multiplying said variance of said voiced ones of said previous and present frames by said first probability and said variance of said unvoiced ones of said previous and present frames by said second probability; and forming said weighted sum from the results of said multiplications.
23. The apparatus of claim 22 wherein said step of dividing comprises the step of multiplying the results of the division of said weighted sum by the sum of said squared values by said first and second probabilities to generate said first value.
CA000562766A 1987-04-03 1988-03-29 Distance measurement control of a multiple detector system Expired - Fee Related CA1336212C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US3429787A 1987-04-03 1987-04-03
US034,297 1987-04-03

Publications (1)

Publication Number Publication Date
CA1336212C true CA1336212C (en) 1995-07-04

Family

ID=21875527

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000562766A Expired - Fee Related CA1336212C (en) 1987-04-03 1988-03-29 Distance measurement control of a multiple detector system

Country Status (8)

Country Link
EP (1) EP0310636B1 (en)
JP (1) JPH0795238B2 (en)
AT (1) ATE80488T1 (en)
CA (1) CA1336212C (en)
DE (1) DE3874471T2 (en)
HK (1) HK108993A (en)
SG (1) SG59693G (en)
WO (1) WO1988007740A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU696092B2 (en) * 1995-01-12 1998-09-03 Digital Voice Systems, Inc. Estimation of excitation parameters
JP3670217B2 (en) 2000-09-06 2005-07-13 国立大学法人名古屋大学 Noise encoding device, noise decoding device, noise encoding method, and noise decoding method
JP4517045B2 (en) * 2005-04-01 2010-08-04 独立行政法人産業技術総合研究所 Pitch estimation method and apparatus, and pitch estimation program

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60114900A (en) * 1983-11-25 1985-06-21 松下電器産業株式会社 Voice/voiceless discrimination
JPS60200300A (en) * 1984-03-23 1985-10-09 松下電器産業株式会社 Voice head/end detector
JPS6148898A (en) * 1984-08-16 1986-03-10 松下電器産業株式会社 Voice/voiceless discriminator for voice

Also Published As

Publication number Publication date
HK108993A (en) 1993-10-22
SG59693G (en) 1993-07-09
JPH01502853A (en) 1989-09-28
DE3874471T2 (en) 1993-02-25
JPH0795238B2 (en) 1995-10-11
EP0310636B1 (en) 1992-09-09
AU602957B2 (en) 1990-11-01
EP0310636A1 (en) 1989-04-12
ATE80488T1 (en) 1992-09-15
DE3874471D1 (en) 1992-10-15
WO1988007740A1 (en) 1988-10-06
AU1242988A (en) 1988-11-02

Similar Documents

Publication Publication Date Title
EP0335521B1 (en) Voice activity detection
CA2165229C (en) Method and apparatus for characterizing an input signal
US5276765A (en) Voice activity detection
EP0153787B1 (en) System of analyzing human speech
US6314396B1 (en) Automatic gain control in a speech recognition system
US5046100A (en) Adaptive multivariate estimating apparatus
US5007093A (en) Adaptive threshold voiced detector
US4972490A (en) Distance measurement control of a multiple detector system
CA1336212C (en) Distance measurement control of a multiple detector system
CA1337708C (en) Adaptive multivariate estimating apparatus
CA1336208C (en) Adaptive threshold voiced detector
AU612737B2 (en) A phoneme recognition system
CN117457026A (en) Noise detection system and method for riding equipment
Yamazaki et al. An objective method for evaluating the quality of speech with code errors using pattern matching techniques
CN118149960A (en) Real-time sound field testing method, device, equipment and storage medium
KR20020049764A (en) Apparatus and method for speech detection using multiple sub-detection system

Legal Events

Date Code Title Description
MKLA Lapsed