MXPA98007769A

MXPA98007769A - Processing of

Info

Publication number: MXPA98007769A
Application number: MXPA/A/1998/007769A
Authority: MX
Inventors: Peter Milner Benjamin
Original assignee: British Telecommnications Public Limited Company; Peter Milner Benjamin
Priority date: 1996-03-29
Filing date: 1998-09-23
Publication date: 1999-02-24

Abstract

A method and apparatus for generating features for the purpose of being used in speech recognition is described, the method comprising calculating the logarithmic frame energy value of each predetermined number of frames of an input speech signal, and applying a transformation of matrix to the logarithmic frame energy values n to form a time matrix representing the input speech signal. The matrix transformation can be a discrete cosine transformation

Description

VOICE PROCESSING DESCRIPTION OF THE INVENTION This invention relates to speech recognition and in particular to the generation of features to be used in speech recognition. Automatic speech recognition systems are generally designed for a particular use. For example, a service that must be accessed by the general public requires a generic voice recognition system to recognize the voice of any user. The automatic voice signals associated with specific data for a user are used to recognize a user or to verify the identity claimed by the user (the recognition of the so-called speaker). Automatic speech recognition systems receive an input signal from a microphone, either directly or indirectly (for example, via a telecommunications link). The input signal is then processed by the speech processing means which normally divide the input signal in successive segments or time frames producing an appropriate (spectral) representation of the characteristics of the variable time input signal. Common techniques of spectral analysis are linear predictive coding (LPC) and Fourier transformation. After the spectral measurements are converted into a group or vector of characteristics that describe the broad acoustic properties of the input signals. The most common characteristics used in oz recognition are melio sequence cepstral coefficients (FCC). The characteristic vectors are then compared to a plurality of patterns that represent or refer in some way to words (or parts thereof) or phrases that are recognized. The results of the comparison indicate the word / phrase that is considered recognized. The pattern matching approach for voice recognition usually involves one of the techniques: matching the model or model.; : - > For example, the model is formed by repressing the spectral properties of a normal voice signal representing the word. model is the concatenation of the spectral frames during the speech. A normal sequence of voice frames for a pattern is then produced via an average procedure and an input signal is compared with these models. A well-known statistical method currently used to characterize the spectral properties of pattern frames is the focus of the hidden model of Mar ov (HMM). The underlying assumption of the HMM (or any other type of statistical model) is that the speech signal can be characterized by a random parametric process and that the parameters of the stochastic process can be determined in a precisely defined way. A well-known deficiency of pattern matching techniques, especially HMM, is the lack of an effective mechanism to use the characteristic extraction correlation. A left-hand HMM provides a temporal structure for modeling the evolution of time from the spectral characteristics of speech from one state to the next, but within each state it is assumed that observation vectors are altered independently and identically (IID). ). The IID establishes the assumption that there is no correlation between the successive speech vectors. This implies that within each state the voice vectors are associated with identical probability density functions (FDP) that have the same medium and covancence. This further implies that the trajectory and spectrum-time within each state is a random fluctuation curve with a stationary average. However, in relation to the trajectory of the spe ct-tim e, it clearly has a defined direction as it moves from one voice event to the next. This iolation by the spectral vectors of the IID assumption contributes to the limitation in the performance of HMM. Including some temporary information in the voice feature can diminish the effect of this assumption that the voice is a stationary independent process, and can be used to improve recognition performance. A conventional method that allows the inclusion of temporal information in the characteristic vector is to increase the characteristic vector in time derivatives of the first and second order of the cepstrum, and with time derivatives of the first and second order of an energy parameter logarithmic Such techniques are described by JG Wilpon, CH Lee and LR Rabiner in "Improvements in Connected Digit Recognition Using Higher Order Spectral and Energy Features", Speech Processing 1, Toronto, May 14-17, 1991, Institute of Electrical and Electronic Engineers, pages 349-352. A more implicit mathematical representation of voice dynamics is the cepstral-time matrix that uses a cosine transform to encode temporal information as described in BP Milner and SV Vaseghi, "An analysis of c ep stra 1 - t ime feature matrices for noise and channel robust speech recognition ", Proc. Eurospeech, pp 519-522, 1995. The cepstral time matrix is also described by M. Pawlewski et al in "Advances in telephony based speech recognition" BT Technology Journal Vol. 14, No. 1. A cepstral matrix of time ct ( m, n), is obtained either by applying a Discrete Cosine Transformation 2-D (DCT) to a spectral time matrix or by applying a DCT 1-D to a stack of voice vectors of cepstral coefficients of medium sequence (MFCC ). The logarithmic filter bank vectors M N-dimen s i ona 1 e s are stacked together to form a spectral time matrix, Xt (f, k), where t indicates the time frame, f the filter bank channel and k the time vector in the matrix. The matrix is then transformed into a matrix of time-cepstral-using a two-dimensional DCT. Since the two-dimensional DCT can be divided into two one-dimensional DCTs, an alternative implementation of the cep stra 1 - 1 matrix is to apply a 1-D DCT along the time axis of a matrix consisting of , of conventional MFCC vectors of M. According to a first aspect of the invention, a method is provided for generating characteristics to be used with devices in voice response, the method comprising: calculating the logarithmic frame energy value of each number predetermined frame n of an input speech signal; and multiplying the calculated logarithmic frame energy values considered as elements of a vector by a two-dimensional transformation matrix to form a temporal vector corresponding to such predetermined number of frames n of the output speech signal. Transitional voice dynamics occurs implicitly within the temporal vector, compared to the explicit representation achieved by a cepstral vector with increased derivatives. Therefore, models trained in such time vectors have the advantage that inverse transformations can be applied that allow transformations again in the domain of the linear filter bank for techniques such as parallel model combination (PMC), for robustness of improved noise. The transformation can be a discrete cosine transformation. Preferably, the temporary vector is truncated so that it includes fewer than n elements. It has been found that this produces good performance results while reducing the amount of switching involved. The resting state column (m = 0) of the matrix can be omitted, thus removing any distortion to the speech signal by a linear convolutional channel distortion forming the matrix as a solid channel characteristic. According to another aspect of the invention there is provided a method for speech recognition comprising: receiving an input signal representing the voice, the input signal being divided into frames; generating a characteristic to calculate the logarithmic frame power alue of each predetermined number of frames n of the input speech signal; and multiplying the calculated logarithmic frame energy values considered as elements of a vector by a dimensional transformation matrix to form a time vector corresponding to such predetermined number of frames n in the input speech signal; compare the generated characteristic with recognition data representing allowed pronunciations, the recognition data is related to the characteristic; and indicate acknowledgment or otherwise based on the comparison stage. In another aspect of the invention, the feature generation stop is provided for use with the speech response apparatus, the characteristic generation step includes: a processor arranged in operation to calibrate the logarithm of the energy of each number or set of frames of a signal of vo: '. e n a a; Y multiplying the calculated logarithmic frame energy values considered as elements of a vector by a two-dimensional transformation matrix to form a time vector corresponding to the predetermined number of frames n of the input speech signal. The feature generation means of the invention are suitable for use with the speech recognition apparatus and also generate recognition data for use with such an apatto. The invention will now be described by way of example only for reference to the accompanying drawings in which: Figure 1 shows schematically the use of a recognized voice in a telecommunications environment; Figure 2 is a schematic representation of a speech recognizer; Figure 3 schematically shows the components of an embodiment of a feature extractor according to the invention; Figure 4 shows the steps for erasing the Karhunen-Loeve training; Figure 5 schematically shows the components of a conventional speech classifier that is part of the speech recognizer of Figure 2; Figure 6 is a flow chart schematically showing the operation of the classifier of Figure 5; Figure 7 is a block diagram schematically showing the components of a conventional sequencer forming part of the speech recognizer of Figure 2; Figure 8 schematically shows the contents of a field within a store that is part of the sequencer of Figure 7; and Figure 9 is a flow chart schematically showing the operation of the sequencer of Figure 7. Referring to Figure 1, a telecommunications system including speech recognition generally comprises a microphone 1 (usually forming part of an apparatus). telephone), a telecommunications network 2 (normally a telecommunication network is switched to the public (PSTN)), a voice recognizer 3, connected to receive a voice signal from the network 2 and using the apparatus 4 connected to the speech recognizer 3 and arranged to receive therefrom a speech recognition signal, indicating the recognition or in some other form of a particular word or phrase, and act in response to the same. For example, the use device 4 can be a remotely operated terminal for carrying out bank transactions, an information service, etc. In many cases, the use device 4 generates an auditory response for the user, transmitted via the network 2 to a loudspeaker 5, normally forming part of the user's device. In operation, a user speaks into the microphone 1 and a signal from the microphone 1 on the network 2 is transmitted to the speech recognizer 3. The speech recognizer analyzes the speech signal and a signal indicating the recognition of some form of a word or particular phrase is generated and transmitted to the use device 4, which operates appropriately in the case of recognizing the voice. Generally, the voice recognizer 3 ignores the route taken by the signal from the microphone 1 to and through the network 2. Any of a wide variety of types or qualities of apparatus They can be used. Likewise, within network 2, any of a wide variety of transmission routes can be taken, including radio links, analog and digital routes and so on. Consequently, the voice signal Y reaches the speech recognizer 3 corresponds to the voice signal S received in the microphone 1, convolded with the transformation characteristics of the microphone 1, the link of the network 2, the channel through the network 2 and the link to the voice recognizer 3, which can be grouped and designated by a single traffic detector H. Normally, the speech recognizer 3 needs to acquire data that refers to the voice against which the signal is verified. of speech and this data acquisition is performed by the speech recognizer in the operation training mode, in which the speech recognizer 3 receives a voice signal from the microphone 1 to form the recognition data for that word or phrase. However, other methods for acquiring speech recognition data are also possible. Referring to Figure 2, a speech recognizer comprises an input 31 for receive voice in digital form (either from a digital network or from an analogue to the digital converter); a frame generator 32 for dividing the sequence of digital samples in a succession of contiguous sample frames; a feature extractor 33 to generate a corresponding characteristic vector of the frames of the samples; a classifier 34 for receiving the succession of characteristic vectors and generating recognition results; a sequencer 35 for determining the predetermined pronunciation for which the signal introduced indicates the greatest similarity; and an output port 35 in which a recognition signal is provided indicating the speech pronunciation that has been recognized. As mentioned before, a speech recognizer usually obtains recognition data during a training phase. During the training, the voice signals are input to the speech recognizer 3 and a feature is extracted by the feature extractor 33 according to the invention. This feature is stored by the speech recognizer 3 by subsequent recognition. The feature can be stored in any convenient way, for example, modeled by Hidden Markov Models (HMM), a well-known technique in voice processing, as will be described later. During recognition, the feature extractor extracts a similar characteristic from an unknown input signal and compares them with the unknown signal with the stored characteristics for each pa 1 ab r a / f r a s that will be recognized. For simplicity, the operation of the speech recognizer in the recognition phase will be described later. In the training phase, the extracted characteristic is used to train an appropriate classifier 34 as is well known in the art.

Frame Generator 32 Frame generator 32 is arranged to receive speech samples at a rate of, for example, 8,000 samples per second and to form frames comprising 256 contiguous samples., At a rate. of frame of 1 frame every 16 ms.

Preferably, each frame is divided into 'windows (ie, samples towards the edge of the frame are multiplied by predetermined weight constants), using, for example, a window of Hamming to produce fake artifacts generated by the edges of the mico. In a preferred embodiment, the frames overlap, (for example, by 50%) so that the effects of window formation are diminished.

Feature Extractor 33 Feature extractor 33 receives frame generator frames 32 and generates, from each frame, a characteristic or feature vector. Figure 3 shows an embodiment of a feature extractor according to the invention. Additionally, means can be provided to generate other characteristics, for example, cepstral coefficients of LPC or MFCC. Each frame j of an incoming speech signal is input to a processor 331 which calculates the average energy of the data frame, ie, the energy calculating processor 331 calculates: 256 £ -. = - > ? - X '~ > 56 í-- ' where is the value of the energy of the sample and in frame j.

A logarithmic processor 332 then forms the logarithm of this average value of the frame j. The logarithmic energy values are introduced into a buffer 333 having a length sufficient to store the logarithmic energy values for frames n, for example, n = 7. Once the values of seven frames have been calculated, the data stacked they are output to a transformation processor 334. In the formation of the frame energy vector or time matrix, the time-spectrum vector of the stacked logarithmic energy values input to the transformation processor 334 is multiplied by a transformation matrix, is say MH = T where M is the vector of stacked logarithmic energy values, H is the transformation that can encode the temporal information, and T is the frame energy vector. The columns of the H transformation are the basic functions to encode the temporal information. Using this method of temporary coding information, you can use a A large scale of transformations as the temporal transformation matrix H. The transformation H encodes the temporal information, that is, the transformation H causes the covariance matrix of the logarithmic energy value stack to be gona 1 i z a da. That is, the di agone elements of the covariance matrix (ie, nonconducting diagonals) of the logarithmic energy values transformed by H tend to be zero. Diagram 1 of a covariance matrix indicates the degree of correlation between the respective samples. The optimal transformation to achieve this is the transformation of Karhunen-Loe v'e (KL) as described in the book by NS Jayant and P Noli, "Digital coding of waveforms", Prentice-Hall, 1984. To find the transformation of Optimal KL to encode the temporary information transported by the characteristic vectors, it is necessary the statistics that refers to the successive correlation of the vectors. Using this correlation formation, the transformation of KL can be calculated. Figure 4 shows the processing involved to determine the KL transformation of the voice data. To precisely determine the transformation of KL to the entire group of training data, parameters are first formed at the logarithmic energy values. The vectors xt that contain n logarithmic energy values successive in time, are generated: Xt = [ct / Ct_ ?, ..., Ct + ni] From the whole group of these vectors through the training group, calculate an Sxx = E. { xxt} - μxμxt, covariance matrix? xx, where μx is the mean vector of the logarithmic energy values. As can be seen, this relates closely to the correlation matrix E. { xxt} and as such it contains information regarding the temporal dynamics of the voice. The transformation of KL is determined from the eigenvectors of the covariance matrix, and can be calculated, for example, using singular value decomposition, where H S: y. H = di a (? O? I, ...?? M) = A The resulting matrix H, is formed of eigenvectors of the covariance matrix. These they vary according to the size of their respective eigenvalues,? i. This matrix is the temporary transformation matrix derived from KL. Another inominale pol can be used to generate the temporal transformation matrix, such as Legendre, Laguerre etc. The transformation of KL is complicated by the need to calculate the own transformation for each group of training data. Alternatively, Discrete Cosine Transformation (DCT) can also be used. In this case, the transformation processor 334 calculates the DCT of the stacked data referring to the logarithmic energy values for frames n. The DCT of a dimension is defined as: where f (i) = the logarithmic energy value of the frame and C (u) = 1/2 for u = 0 = 1 another form u is an integer from 0 to nl The transformation processor 334 outputs the DCT coefficients n generated from the data frames n. These coefficients form a vector of ma r c o - e ne r g a a referring to the energy level of the input signal. A frame energy vector is formed for each successive n frame of the input signal eg for frames 0 to 6, l to 7, 2 to 8 and so on when n = 7. The frame energy vector is part of a characteristic vector of a speech frame. This feature can be used to increase other features such as MFCC or differential MFCC.

Classifier 34 With reference to Figure 5, classifier 34 is of a conventional design and as in this embodiment, comprises a HMM 341 classifier processor, a state memory of HMM 342 and a mode memory 343. The status memory 342 comprises a state field 3421, 3422 ..., for each plurality of speech parts that will be recognized. For example, a state field may be provided in state memory 342 for each phoneme of a word that will be recognized. A state field can also be provided for r i i / i i i n c i o.

Each state field in state memory 342 includes a pointing field 3421b, 3422b, ... to store a pointing direction to a mode field set 361, 362, in mode memory 343. Each mode field set it comprises a plurality of mode fields 3611, 3612 ... each comprising data defining a Gaussian distribution mu 11 i dimen si on a 1 of characteristic coefficient values characterizing the state in question. For example, if there are coefficients d in each characteristic, (for example, the first 8 MFCC coefficients and seven coefficients in the energy matrix 'of the invention) the data stored in each mode field 3611, 3612 ... characterizing each mode is a constant C, a set of characteristic mean values of d μx and a set of characteristic deviations d, s1 (in other words, a total of 2d + 1 numbers, the number Nx of the mode fields 3611, 3612, ... in each mode field set 361, 362 ... is variable. The mode fields are generated during the training phase and represent the characteristics derived by the feature extractor.

During recognition, the sorting processor 34 is arranged to read each state field within the memory 342 in turn, and calculate for each, using the set of characteristic current input coefficients, ejected by the characteristic extractor 33. of the invention, the probability that the set of characteristics introduced or, vector corresponds to the corresponding state. To do this, as shown in Figure 6, the processor 341 is arranged to read the pointer in the status field; to access the mode field set in mode memory 343 to which it points; and for each field of mode j within the mode field set, to calculate a modal probability of PS. Then, the processor 341 calculates the state probability by summing the modal probabilities Px. Accordingly, the output of the sorting processor 341 is a plurality of state probabilities P, in each state in the state memory 342, indicating the probability that the feature vector corresponds to each state.

It will be understood that Figure 6 is merely illustrative of the operation of the classifier processor 341. In practice, the mode probabilities can be calculated each at a time and stored temporarily, to be used in the calculation of all the state probabilities referred to. to the phoneme which correspond the modes. The sorting processor 341 may be a suitably programmed digital signal processing (DSP) device and in particular may be the same digital signal processing device as the feature extractor 33.

Sequencer 35 With reference to Figure 7, sequencer 35 is the conventional design and, in this embodiment, comprises a state probability memory 353 that stores, for each processed frame, the state probabilities described by the classification processor 341; a state sequence memory 352; an analysis processor 351; and a sequencer 354 output buffer.

The state sequence order 352 comprises a plurality of state sequence fields 3521, 3522 ..., each corresponding to a sequence of words or phrases that will be recognized consisting, in this example, of a row of phonemes. Each state sequence in state sequence memory 352 comprises, as illustrated in Figure 8, a number of states Pi, P 2 / • • • PN and, for each state, two probabilities; a probability of repetition (Pii) and a probability of transition to the next state (Pl2). The observed sequence of states associated with a series of frames can therefore comprise several repeats for each state Pj. in each model of sequence of state 3521, etc; for e j emp 1 o: Frame number 1 2 3 4 5 6 7 8 9 ... Z Z + 1 State Pl Pl P1 P2 P2 P2 P2 P2 P2 ... Pn Pn As shown in Figure 9, the sequence processor 351 is arranged in parallel, in each frame, the state probabilities expelled by the classification processor 341, and the probabilities of prior stored state in the probability memory. of state 35-! and calculate mine plus icha bit. "." - Current sides with the t and m << < < < < > mpair <. = to < on each state sequence stored in the state sequence memory 52. The calculation employs the Method of the Model Hidden Markov, well known generally described in "Hidden Markov Models for Automatic Speech Fecognition: theory and applicat. Ions" S. J. ox. British Teleco Technology Journal, April 1988, p 305. Conveniently, the HMM processing performed by the sequencing processor 351 uses the well-known Viterbi algorithm. The sequencing processor 351, for example, can be a microprocessor such as the microprocessor Intel (MR) microprocessor i-486"or microprocessor from Motorola!" ~ 68000 or alternatively it can be the DSP device (for example, the device of DSP as used for any of the above processors). Consequently, for each state sequence (co responding to a word, phrase or sequence of, which will be re-generated), a probability classification is output by the sequencing processor .- '51 in each input speech frame. For example, the sequences of esi do can • 26 understand the names in a telephone directory. When the end of the pronunciation is detected, a mark signal indicating the most probable state sequence to the output port 38 leaves the sequencing processor 351 to indicate that the name has been recognized, with the corresponding word or phrase.

Claims

1. A method for generating ca t a c t i i s i s to be used with the apparatus corresponding to the voice, the method comprises: calculating the logarithmic frame energy value of each predetermined number n of frames of an input speech signal; and multiplying the calculated logarithmic frame energy values considered as elements of a vector by a two-dimensional transmission matrix to form a time vector corresponding to the predetermined number of frames n of the input speech signal.

2. The method according to claim 1, wherein the successive time vectors represent groups of frames 1 of the frames n of the input signal.

3. The method according to claim 1 or 2, wherein the transformation matrix represents a discrete cosine transformation.

4. The method according to claim 1, 2 6 3, wherein the temporary vector is truncated so that it includes less than n e 1 ement.

5. A method of speech recognition, comprising: receiving an input signal representing the voice, the input signal is divided into frames; generating a characteristic by calculating the logarithmic frame energy value of each predetermined number of frames n of the input speech signal; and multiplying the calculated logarithmic frame energy values considered as elements of a vector by a two-dimensional transformation matrix to form a time vector corresponding to the predetermined number of frames n in the input speech signal; compare the generated characteristic with recognition data representing allowed pronunciations, the recognition data related to the characteristic; Y

indicate recognition in another way on the basis of the comparison stage.

6. The speech recognition method according to claim 5, wherein the transformation matrix represents a discrete cosine transformation.

7. An apparatus that generates features for use with the voice-responsive apparatus, the apparatus that generates the feature, comprises a processor arranged in operation to calculate the logarithm of the energy of each predetermined number of frames of an input speech signal; and multiply the calculated logarithmic frame energy values considered as elements of a vector by a matrix d

transformation represents a discrete cosine transformation.

9. The speech recognition apparatus including the characteristic generating apparatus according to claim 7 or 8.