WO1997037345A1

WO1997037345A1 - Speech processing

Info

Publication number: WO1997037345A1
Application number: PCT/GB1997/000816
Authority: WO
Inventors: Kevin Joseph Power; Simon Patrick Alexander Ringland
Original assignee: British Telecommunications Public Limited Company
Priority date: 1996-03-29
Filing date: 1997-03-24
Publication date: 1997-10-09

Abstract

A method and apparatus for extracting a feature from an input signal for use with an automated speech system, e.g. a speech recogniser or a voice activity detector. An input digital signal, divided into frames, is received and a power spectrum of an input frame calculated. The log of the power spectrum is formed and cepstral coefficients are calculated from the log power spectrum of a frame. A pitch detector detects a cepstral coefficient which meets a predetermined criterion, e.g. the highest valued coefficient. A feature is then derived relating to the detected cepstral coefficient, the feature representing whether the input frame includes voiced speech.

Description

SPEECH PROCESSING

This invention relates to a feature extractor for extracting features from an input signal for use by subsequent automated speech systems. Such systems may be used in speech recognition, speaker identification, speaker verification, voice activity detection or the like.

It is becoming more common for humans to interact with machines via a speech interface. To achieve this, automated speech recognition systems are being developed, such systems generally being designed for a particular use. For example, a service that is to be accessed by the general populace requires a generic speech recogniser designed to recognise speech from an unknown user. Automated speech recognisers associated with data specific to a user are used either to recognise a user or to verify a user's claimed identity. Voice activity detectors are used to detect speech in an input signal and affect the operation of associated systems when speech is detected.

Automated speech recognition systems receive an input signal from a microphone, either directly or indirectly (e.g. via a telecommunications link). The input signal is then processed by speech processing means which typically divides the input signal into successive time segments or frames by producing an appropriate (spectral) representation of the characteristics of the time varying input signal is produced. The most common technique of spectral analysis is Linear

Predictive Coding (LPC) Next, the spectral measurements are converted into a set or vector of features that describe the broad acoustic properties of the input signals. The most common features used in speech recognition are mel-frequency cepstral coefficients (MFCCs). Other known features that may be used, alone or in combination, are energy ratios, identification of the first formants in a signal, the nasality of the input signal and many others. The feature vectors are then compared with a plurality of patterns representing or relating in some way to the words (or parts thereof) or phrases to be recognised. The results of the comparison indicate the word/phrase deemed to have been recognised.

Voice activity detectors generally operate on the level of the speech incoming signal. For instance, it is known to compare the average level of samples of an incoming signal with a threshold. If the incoming signal level is higher than the threshold, speech is deemed to be present Such techniques however may fail for a number of reasons e.g if the transmission is noisy or a user speaks clearly but quietly.

Generally, speech is a slowly time-varying signal when examined over a sufficiently short period of time (for example 50 msec) . However, over longer periods (0.2 sec or more) the signal characteristics change to reflect the sounds being made. Overlying the speech signal is the pitch of the speaker's voice, which is relatively specific to a speaker when talking normally. Pitch is present in voiced sounds (such as the letter "v" in the English language) and is absent from unvoiced sounds (such as the letter "i" in the English language) .

In accordance with a first aspect of the invention there is provided a feature extractor for use with an automated speech system, said feature extractor comprising an input for receiving an input digital signal, said input signal being divided into frames; a spectrum calculation device to calculate the logarithm of the spectrum of an input frame, a cepstrum calculation device to calculate the cepstrum of the logarithm of the spectrum of a frame, a pitch detector to detect a cepstral coefficient which meets a predetermined criterion, and a feature deπver to derive a feature relating to the detected cepstral coefficient, which feature represents whether the input frame includes voiced speech. The feature may be of a binary nature, indicating either that the input frame includes voiced speech or not. When the feature is intended for use by a speech recognition system, the feature is preferably of a continuous nature and may relate either to the magnitude of the detected cepstral coefficient, if any, (and hence indicate the amount of voicing present in the input signal) and/or to the frequency represented by the detected cepstral coefficient, if any, which may be specific to a user

Preferably the pitch feature is derived from the power spectrum of an input speech signal. According to a preferred embodiment of the invention only those cepstral coefficients within a predetermined limited range of the cepstral coefficients are considered by the pitch detector 335. Advantageously, the cepstral coefficients which lie inside the normal speech frequency range, say the first 20 cepstral coefficients, may be discarded.

Preferably only those cepstral coefficients having a magnitude greater than a predetermined threshold are considered by the pitch detector.

In a preferred embodiment of the invention the pitch detector includes means to examine a detected coefficient to determine if it represents a d.c. component of the input signal. To determine if the coefficient detected represents pitch or the d.c component, the value of the coefficients neighbouring the detected coefficient may be examined and, if non-zero, the input frame is deemed to include voiced speech.

In accordance with a second aspect of the invention, a method of extracting a feature from an input signal for use in speech recognition comprises: receiving an input signal, said input signal being divided into frames; calculating the logarithm of the spectrum of an input frame; calculating cepstral coefficients from the log of the spectrum; detecting a cepstral coefficient which meets a predetermined criterion; and deriving a feature relating to the detected cepstral coefficient, which feature represents whether the input frame includes voiced speech.

The invention will now be described further by way of example only, with reference to the accompanying drawings in which:

Figure 1 shows schematically the employment of a speech recogniser in a telecommunications environment;

Figure 2 is a schematic representation of a speech recogniser; Figure 3 shows an example of a frame of an input signal produced by the frame generator of a speech recogniser;

Figure 4 shows schematically the components of one embodiment of a feature extractor according to the invention;

Figures 5a-d show an example of the log power spectrum and cepstral coefficients generated by the feature extractor of Figure 4 for voiced and unvoiced speech; Figure 6 shows schematically the components of a conventional speech classifier forming part of the speech recogniser of Figure 2;

Figure 7 is a flow diagram showing schematically the operation of the classifier of Figure 6; Figure 8 is a block diagram showing schematically the components of a conventional sequencer forming part of the speech recogniser of Figure 2;

Figure 9 shows schematically the content of a field within a store forming part of the sequencer of Figure 8; and

Figure 1 0 is a flow diagram showing schematically the operation of the sequencer of Figure 8

Referring to Figure 1 , a telecommunications system including speech recognition generally comprises a microphone 1 (typically forming part of a telephone handset), a telecommunications network 2 (typically a public switched telecommunications network (PSTN)), a speech recogniser 3, connected to receive a voice signal from the network 2, and a utilising apparatus 4 connected to the speech recogniser 3 and arranged to receive therefrom a voice recognition signal, indicating recognition or otherwise of a particular word or phrase, and to take action in response thereto, For example, the utilising apparatus 4 may be a remotely operated terminal for effecting banking transactions, an information service etc.

In many cases, the utilising apparatus 4 will generate an audible response to the user, transmitted via the network 2 to a loudspeaker 5 typically forming part of the user's handset.

In operation, a user speaks into the microphone 1 and a signal is transmitted from the microphone 1 into the network 2 to the speech recogniser 3.

The speech recogniser analyses the speech signal and a signal indicating recognition or otherwise of a particular word or phrase is generated and transmitted to the utilising apparatus 4, which then takes appropriate action in the event of recognition of the speech. Generally the speech recogniser 3 is ignorant of the route taken by the signal from the microphone 1 to and through network 2. Any one of a large variety of types or qualities of handset may be used. Likewise, within the network

2, any one of a large variety of transmission paths may be taken, including radio links, analogue and digital paths and so on Accordingly the speech signal Y reaching the speech recogniser 3 corresponds to the speech signal S received at the microphone 1 , convolved with the transform characteristics of the microphone 1 , the link to the network 2, the channel through the network 2, and the link to the speech recogniser 3, which may be lumped and designated by a single transfer characteristic H

Typically, the speech recogniser 3 needs to acquire data concerning the speech against which to verify the speech signal, and this data acquisition is performed by the speech recogniser in the training mode of operation in which the speech recogniser 3 receives a speech signal from the microphone 1 to form the recognition data for that word or phrase However, other methods of acquiring the speech recognition data are also possible

Referring to Figure 2, a speech recogniser comprises an input 31 for receiving speech in digital form (either from a digital network or from an analog to digital converter) , a frame generator 32 for partitioning the succession of digital samples into a succession of frames of contiguous samples; a feature extractor 33 for generating from a frame of samples a corresponding feature vector; a classifier 34 for receiving the succession of feature vectors and generating recognition results; a sequencer 35 for determining the predetermined utterance to which the input signal indicates the greatest similarity; and an output port 35 at which a recognition signal is supplied indicating the speech utterance which has been recognised.

As mentioned earlier, a speech recogniser generally obtains recognition data during a training phase. During training, speech signals are input to the speech recogniser 3 and a feature is extracted by the feature extractor 33 according to the invention. This feature is stored by the speech recogniser 3 for subsequent recognition. The feature may be stored in any convenient form, for example modelled by Hidden Markov Models (HMMs), a technique well known in speech processing, as will be described below. During recognition, the feature extractor extracts a similar feature from an unknown input signal and compares the feature of the unknown signal with the feature(s) stored for each word/phrase to be recognised. For simplicity, the operation of the speech recogniser in the recognition phase will be described below. In the training phase, the extracted feature is used to train a suitable classifier 34, as is well known in the art.

Frame Generator 32

The frame generator 32 is arranged to receive speech samples at a rate of, for example, 8,000 samples per second, and to form frames comprising 256 contiguous samples, at a frame rate of 1 frame every 16ms. Preferably, each frame is windowed (i.e. the samples towards the edge of the frame are multiplied by predetermined weighting constants) using, for example, a Hamming window to reduce spurious artefacts generated by the frame edges. In a preferred embodiment, the frames are overlapping (for example by 50%) so as to ameliorate the effects of the windowing. Figure 3 shows an example of a frame output by the frame generator 32.

Feature Extractor 33

The feature extractor 33 receives frames from the frame generator 32 and generates, for each frame, a feature or vector of features. According to the invention, the feature extractor 33 generates a feature relating to the pitch of the input signal. Means may additionally be provided to generate other features, for example LPC cepstral coefficients or MFCCs.

As shown in Figure 4 the feature extractor of the invention comprises: an input 331 for receiving frames from the frame generator 32. The frames are input to a Fast Fourier Transform (FFT) processor 332, which generates a frequency spectrum of the input frame. A log power spectrum processor 333 then calculates the logarithm of the power spectrum of the input frame by taking the logarithm of the magnitude squared of each sample of the frequency spectrum of the input frame. The log of the power spectrum is input to a cepstrum processor 334 which calculates the cepstrum of the input power spectrum. A cepstrum is formed by taking the Fourier transform of the log of the power spectrum. Since the power spectrum is real, it is sufficient to carry out a discrete cosine transform (DCT) on the log power spectrum. The output of the cepstrum processor 334 are cepstral coefficients F_n of the input signal f,, the power spectrum of the log input frame i.e.

\ - I

F_n = ∑ /, cos /7π (2; + 1) / 2 N] where Ν = number of samples

n is an integer

Since the cepstral coefficients may be negative, the magnitude of the cepstral coefficients may be input to the pitch detector 335 to detect pitch. Alternatively, the magnitude of each cepstral coefficient may be squared and input to a pitch detector 335

Figure 5a-d shows an example of the log power spectrum and the cepstral coefficients generated for an input frame of voiced (Figs 5a & 5c) and unvoiced (Figs 5b and 5d) respectively. Pitch manifests itself in the power spectrum as harmonic ripples, the interval between these ripples being the pitch frequency. Carrying out a Fourier transform or DCT on the log of the power spectrum of an input signal identifies a cosine with a period that best matches the pitch frequency. In the cepstral domain, this appears as a spike with some activity on either side. This can be seen at the cepstral coefficient marked A in Figure 5c. The cepstral coefficients generated are input to a pitch detector 335 which examines the coefficients to find the coefficient having the highest value above a given threshold. Information relating to the detected coefficient is passed to a feature deriver 336 which derives a feature relating to the coefficient. According to a first embodiment of the invention, if such a coefficient is detected, the signal is deemed to be voiced i.e. include pitch. If the highest valued coefficient has a magnitude less than the threshold, the signal is deemed not to be voiced. The feature deriver 336 derives a feature which indicates whether the signal is voiced or not. This feature is used by the subsequent speech classifier 34.

According to a preferred embodiment of the invention only the cepstral coefficients which e outside the range representing the normal speech frequency range are considered by the pitch detector 335. Generally the first 20 coefficients F₀ to F₁₉ are discarded.

As described so far, the feature is of a binary nature, indicating simply either that the input frame includes voiced speech or not. According to a further embodiment of the invention the feature deriver 336 derives a feature of a continuous nature which relates either to the magnitude of the detected cepstral coefficient, if any and/or to the frequency represented by the detected cepstral coefficient, if any. The magnitude of the detected cepstral coefficient indicates the amount of voicing present in the input signal The cepstral coefficient itself relates to the frequency of the pitch, which is fairly specific to a particular user.

The largest-valued cepstral coefficient may represent the d.c. component of the input signal, which would be represented in the cepstral domain as a monotonic decreasing function, with a maximum at coefficient 0. If the detected coefficient (e.g. say F₃₄ marked A in Figure 5c) indicates the presence of pitch, it should occur as a spike in the cepstral domain rather than a continuous function. To determine if the detected coefficient is the peak of a spike, the average magnitude of the remaining coefficients under consideration is calculated and a ratio: magnitude of detected coefficient (C) average magnitude

is formed. The second harmonic of the pitch frequency also causes a peak around a coefficient occurring at double the value of the detected coefficient e.g. around coefficient 68 (marked as B in Figure 5c) . The magnitude of this coefficient and three or so coefficients on either side of this coefficient are omitted from the average magnitude calculation.

The feature deriver 336 may form a feature from the ratio C itself or a binary decision made as to whether pitch is present by comparing the ratio to a threshold; if above the threshold, pitch is deemed to be present.

Classifier 34

Referring to Figure 6, the classifier 34 is of a conventional design and, in this embodiment, comprises a HMM classifying processor 341 , an HMM state memory 342, and a mode memory 343. The state memory 342 comprises a state field 3421 , 3422, ...., for each of the plurality of speech parts to be recognised. For example, a state field may be provided in the state memory 342 for each phoneme of a word to be recognised. There may also be provided a state field for noise/silence.

Each state field in the state memory 342 includes a pointer field 3421 b, 3422b, .... storing a pointer address to a mode field set 361 , 362, .... in mode memory 343. Each mode field set comprises a plurality of mode fields 361 1 , 361 2... each comprising data defining a multidimensional Gaussian distribution of feature coefficient values which characterise the state in question. For example, if there are d coefficients in each feature (for instance 8 MFCC coefficients and the voicing feature of the invention), the data stored in each mode field 361 1 , 361 2... characterising each mode is: a constant C, a set of d feature mean values μ, and a set of d feature deviations, σ,; in other words, a total of 2d + 1 numbers.

The number N of mode fields 361 1 , 361 2 in each mode field set 361 ,

362, .... is variable. The mode fields are generated during the training phase and represent the feature(s) derived by the feature extractor. During recognition, the classification processor 34 is arranged to read each state field within the memory 342 in turn, and calculate for each, using the current input feature coefficient set output by the feature extractor 33 of the invention, the probability that the input feature set or vector corresponds to the corresponding state To do so, as shown in Figure 7, the processor 341 is arranged to read the pointer in the state field; to access the mode field set in the mode memory 343 to which it points; and, for each mode field j within the mode field set, to calculate a modal probability P, as follows;

Next, the processor 341 calculates the state probability by summing the modal probabilities P, Accordingly, the output of the classification processor 341 is a plurality of state probabilities P, one for each state in the state memory 342, indicating the likelihood that the input feature vector corresponds to each state. It will be understood that Figure 7 is merely illustrative of the operation of the classifier processor 341 In practice, the mode probabilities may each be calculated once, and temporarily stored, to be used in the calculation of all the state probabilities relating to the phoneme to which the modes correspond.

The classifying processor 341 may be a suitably programmed digital signal processing (DSP) device and may in particular be the same digital signal processing device as the feature extractor 33.

Sequencer 35

Referring to Figure 8, the sequencer 35 is conventional in design and, in this embodiment, comprises a state probability memory 353 which stores, for each frame processed, the state probabilities output by the classifier processor 341 ; a state sequence memory 352, a parsing processor 351 , and a sequencer output buffer 354

The state sequence memory 352 comprises a plurality of state sequence fields 3521 , 3522, , each corresponding to a word or phrase sequence to be recognised consisting, in this example, of a string of phonemes. Each state sequence in the state sequence memory 352 comprises, as illustrated in Figure 9, a number of states P_{1 #} P₂, PN and, for each state, two probabilities; a repeat probability (P,-, ) and a transition probability to the following state (P_l2) . The observed sequence of states associated with a series of frames may therefore comprise several repetitions of each state P in each state sequence model 3521 etc. ; for example

Frame Number .. Z Z + 1

State P1 P1 P1 P2 P2 P2 P2 P2 P2 Pn Pn

As shown in Figure 10 the sequencing processor 351 is arranged to read, at each frame, the state probabilities output by the classifier processor 341 , and the previous stored state probabilities in the state probability memory 353, and to calculate the most likely path of states to date over time, and to compare this with each of the state sequences stored in the state sequence memory 352. The calculation employs the well known Hidden Markov Model method described generally in ' Hidden Markov Models for Automatic Speech Recognition: theory and applications" S.J. Cox, British Telecom Technology Journal, April 1988 p 105. Conveniently, the HMM processing performed by the sequencing processor 351 uses the well known Viterbi algorithm The sequencing processor 351 may, for example, be a microprocessor such as the Intel¹™¹ I-486¹™' microprocessor or the Motorola^{l | >} 68000 microprocessor, or may alternatively be a DSP device (for example, the same DSP device as is employed for any of the preceding processors).

Accordingly for each state sequence (corresponding to a word, phrase or other speech sequence to be recognised) a probability score is output by the sequencing processor 351 at each frame of input speech. For example the state sequences may comprise the names in a telephone directory. When the end of the utterance is detected, a label signal indicating the most probable state sequence is output from the sequencing processor 351 to the output port 38, to indicate that the corresponding name, word or phrase has been recognised. The frame generator 32, classifier 34 and sequencer 35 are all conventional in design and have no limiting effect on the invention.

When used with a voice activity detector, the feature is used to control the operation of a speech responsive system. Such apparatus will thus only trigger if voiced speech is detected.

Claims

CLA|M$

1 . A feature extractor for use with an automated speech system, said feature extractor comprising- an input for receiving an input digital signal, said input signal being divided into frames; a spectrum calculation device to calculate the logarithm of the spectrum of an input frame; a cepstrum calculation device to calculate the cepstrum of the logarithm of the spectrum of a frame, a pitch detector to detect a cepstral coefficient which meets a predetermined criterion, and a feature deriver to derive a feature relating to the detected cepstral coefficient, which feature represents whether the input frame includes voiced speech.

2. Apparatus according to claim 1 wherein the spectrum calculation device calculates the log of the power spectrum of the input signal.

3. Apparatus according to claim 1 or 2 wherein the feature relates to the magnitude of the detected cepstral coefficient, if any

4. Apparatus according to claim 1 , 2 or 3 wherein the feature relates to the frequency represented by the detected cepstral coefficient, if any.

5. Apparatus according to any of claims 1 to 4 wherein only those cepstral coefficients having a magnitude greater than a predetermined threshold are considered by the pitch detector.

6. Apparatus according to any preceding claim wherein only those cepstral coefficients within a predetermined limited range of the cepstral coefficients are considered by the pitch detector

7. Apparatus according to claim 6 wherein the cepstral coefficients which lie outside the area representing the normal speech frequency range are discarded.

8. Apparatus according to any preceding claim wherein the pitch detector includes means to examine the detected coefficient if any to determine if it represents a d.c. component of the input signal

9. Apparatus according to any preceding claim wherein the feature relates to the ratio of the magnitude of the detected cepstral coefficient and the average magnitude of the remaining coefficients.

10. A speech recogniser including a feature extractor according to any preceding claim

1 1 . A voice activity detector including a feature extractor according to any of claims 1 to 9.

1 2. A method of extracting a feature from an input signal for use in speech recognition comprising receiving an input signal, said input signal being divided into frames; calculating the logarithm of the spectrum of an input frame; calculating cepstral coefficients from the spectrum; detecting a cepstral coefficient which meets a predetermined criterion; and deriving a feature relating to the detected cepstral coefficient, which feature represents whether the input frame includes voiced speech.

1 3. A method according to claim 1 2 wherein the cepstral coefficients are calculated from the power spectrum of the input signal.

1 4. A method according to claim 1 2 or 1 3 wherein the feature relates to the magnitude of the detected cepstral coefficient, if any

1 5. A method according to any of claims 1 2 to 14 wherein the feature relates to the frequency represented by the detected cepstral coefficient, if any.

1 6. A method according to claim 1 2 to 1 5 wherein only those cepstral coefficients having a magnitude greater than a predetermined threshold are considered in the detecting step.

1 7. A method according to any of claims 1 2 to 1 6 wherein only those cepstral coefficients within a predetermined limited range of the cepstral coefficients are considered in the detecting step.

1 8. A method according to any one of claims 1 2 to 1 7 wherein the detected coefficient is examined to determine if it represents a d.c. component of the input signal.

1 9. A method according to any of claims 1 2 to 1 8 wherein the feature relates to the ratio of the magnitude of the detected cepstral coefficient and the average magnitude of the remaining coefficients.