WO1998001848A1

WO1998001848A1 - Speech synthesis system

Info

Publication number: WO1998001848A1
Application number: PCT/GB1997/001831
Authority: WO
Inventors: Costas Xydeas
Original assignee: The Victoria University Of Manchester
Priority date: 1996-07-05
Filing date: 1997-07-07
Publication date: 1998-01-15
Also published as: CA2259374A1; DE69724819D1; AU3452397A; ATE249672T1; EP0950238A1; EP0950238B1; JP2000514207A

Abstract

A speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.

Description

SPEECH SYNTHESIS SYSTEM

The present invention relates to speech synthesis systems, and in

particular to speech systems coding and synthesis systems which can be used in

speech communication systems operating at low bit rates.

Speech can be represented as a waveform the detailed structure of which

represents the characteristics of the vocal tract and vocal excitation of the person

producing the speech. If a speech communication system is to be capable of

providing an adequate perceived quality, the transmitted information must be

capable of representing that detailed structure. Most of the power in voiced

speech is at relatively low frequencies, for example below 2kHz. Accordingly

good quality speech synthesis can be achieved on the basis of speech waveforms

that have been low pass filtered to reject higher frequency components. The

perceived speech quality is however adversely effected if the frequency is

restricted much below 4kHz.

Many models have been suggested for defining the characteristics of

speech. The known models rely upon dividing a speech signal into blocks or

frames and deriving parameters to represent the characteristics of the speech

within each frame. Those parameters are then quantized and transmitted to a

receiver. At the receiver the quantization process is reversed to recover the

parameters, and a speech signal is then synthesised on the basis of the recovered

parameters. The common objective of the designers of the known models is to

minimise the volume of data which must be transmitted whilst maximising the

perceived quality of the speech that can be synthesised from the transmitted

data. In some of the models a distinction is made between whether or not a

particular frame is "voiced" or "unvoiced". In the case of voiced speech, speech

is produced by glottal excitation and as a result has a quasi-periodic structure.

Unvoiced speech is produced by turbulent air flow at a constriction and does not

have the "periodic" spectral structure characteristic of voiced speech. Most

models seek to take advantage of the fact that voiced speech signals evolve

relatively slowly in the context of frames the duration of which is typically 10 to

30msecs. Most models also rely upon quantization schemes intended to minimise

the amount of information which must be transmitted without significant loss of

perceived quality. As a result of the work done to date it is now possible to

produce speech synthesis systems capable of operating at bit rate of only a few

thousand bits per second.

One model which has been developed is known as "sinusoidal coding"

(R.J. McAulay and T.F. Quatieri, "Low Rate Speech Coding Based on Sinusoidal

Coding", Advances in Speech Signal Processing, Editors S. Furui and M.M.

Sondhi, Chapter 6, pp. 165-208, Markel Dekker, New York, 1992). This

approach relies upon an FFT analysis of each input frame to produce a

magnitude spectrum, estimating the pitch period of the input frame from that

spectrum, and defining the amplitudes at the pitch related harmonics, the harmonics being multiples of the fundamental frequency of the frame. An error

measure is calculated in the time domain representing the difference between

harmonic and aharmonic speech spectra and that error measure is used to define

the degree of voicing of the input frame in terms of a frequency value. Thus the

parameters used to represent a frame are the pitch period, the magnitude and

phase values for each harmonic, and the frequency value. Proposals have been

made to operate this system such that phase information is predicted in a

coherent way across successive frames.

In another system known as "multiband excitation coding" (D.W. Griffin

and J.S. Lim, "Multiband Excitation Vocoder" IEEE Transaction on Acoustics,

Speech and Signal Processing, vol. 36, pp 1223-1235, 1988 and Digital Voice

Systems Inc, "INMARSA T M Voice Codec, Version 3.0", Voice Coding System

Description, Module 1, Appendix 1, August 1991) the amplitude and phase

functions are determined in a different way from that employed in sinusoidal

coding. The emphasis in this system is placed on dividing a spectrum into bands,

for example up to twelve bands, and evaluating the voiced/unvoiced nature of

each of these bands. Bands that arc classified as unvoiced are synthesised using

random signals. Where the difference between the pitch estimates of successive

frames is relatively small, linear interpolation is used to define the required

amplitudes. The phase function is also defined using linear frequency

interpolation but in addition includes a constant displacement which is a random

variable and which depends on the number of unvoiced bands present in the short term spectrum of the input signal. The system works in a way to preserve

phase continuity between successive frames. When the pitch estimates of

successive frames are significantly different, a weighted summation of signals

produced from amplitudes and phases derived for successive frames is formed to

produced the synthesised signal.

Thus the common ground between the sinusoidal and multiband systems

referred to above is that both schemes directly model the input speech signal

which is DFT analysed, and both systems arc at least partially based on the same

fundamental relationship for representing speech to be synthesised. The systems

differ however in terms of the way in which amplitudes and phase are estimated

and quantized, the way in which different interpolation methods arc used to

define the necessary phase relationships, and the way in which "randomness" is

introduced in the recovered speech.

Various versions of the multiband excitation coding system have been

proposed, for example an enhanced multiband excitation speech coder (A. Das

and A. Gersho, Variable-Dimension Spectral Coding of Speech at 2400 bps and

below with phonetic classification", IEEE Proc. ICASSP-95, pp. 492-495, May

1995) in which input frames are classified into four types, that is noise, unvoiced,

fully voiced and mixed voiced, and a variable dimension vector quantization

process for spectral magnitude is introduced, the bi-harmonic spectral modelling

system (C. Garcia-Matteo., J. L. Alba-Castro and Eduardo R. Banga, "Speech

Coding Using Bi-Harmonic Spectral Modelling", Proc. EUSIPCO-94, Edingburgh, Vol. 2, pp. 391-394, September 1994) in which the short term

magnitude spectrum is divided into two bands and a separate pitch frequency is

calculated for each band, the spectral excitation coding system (V. Cupcrman, P.

Lupini and B. Bhattacharya, "Spectral Excitation Coding of Speech at 2.4 kb/s",

IEEE Proc. ICASSP-95, pp. 504-507, Detrpot, May 1995) which applies

sinusoidal based coding in the linear predictive coding (LPC) residual domain

where the synthesised residual signal is the summation of pitch harmonic

oscillators with appropriate amplitude and phase functions and amplitudes are

quantized using a non-square transformation, the band-widened harmonic

vocoder (G. Yang, G Zanellato and H. Leich, "Band Widened Harmonic Vocoder

at 2 to 4 kbps", IEEE Proc. 1CASSP-95, pp. 504-507, Detroit, May 1995) in which

randomness in the signal is introduced by adding jitter to the amplitude

information on a per band basis, pitch synchronous multiband coding (H. Yang,

S. N. Koh and P. Sivaprakasapilai, "Pitch Synchronous Multi-Band (PSMB)

Speech Coding", IEEE Proc. 1CASSP-95, pp. 516-519, Detroit, May 1995) in

which a CELP (code-excited linear prediction) based coding scheme is used to

encode speech period segments, multi band LPC coding (S. Yeldener, M. Kondoz

and G. Evans, "High Quality Multiband LPC Coding of Speech at 2.4 kbits/s",

Electronic Letters, pp. 1287-1289, Vol. 27, No 14, 4th July 1991) in which a single

amplitude value is allocated to each frame to in effect specify a "flat" residual

spectrum, and harmonic and noise coding (M. Nishiguchi and J. Matsumoto,

"Harmonic and Noise Coding of LPC Residuals with Classified Vector Quantisation", IEEE Proc. ICASSP-95, pp. 484-487, Detroit, May 1995) with

classified vector quantization which operates in the LPC residual domain, an

input signal being classified as voiced or unvoiced and being full band modelled.

A further type of coding system exists, that is the prototype interpolation

coding system. This relies upon the use of pitch period segments or prototypes

which are spaced apart in time and reiteration/interpolation techniques to

synthesise the signal between two prototypes. Such a system was described as

early as 1971 (J.S. Severwight, "Interpolation Reiterations Techniques for

Efficient Speech Transmission", Ph.D. Thesis, Loughborough University,

Department of Electrical Engineering, 1971). More sophisticated systems of the

same general class have been described more recently, for example in the paper

by W.B. Kleijn, "Continuous Representations in Linear Predictive Coding, Proc.

ICASSP-91, pp201-204, May 1991. The same author has published a series of

related papers. The system employs 20msecs coding frames which are classified

as voiced or unvoiced. Unvoiced frames are effectively CELP coded. Pitch

prototype segments are defined in adjacent voiced frames, in the LPC residual

signal, in a way which ensures maximum alignment (correlation) of the

prototypes and defines the prototype so that the main pitch excitation pulse is

not near to either of the ends of the prototype. A pitch period in a given frame is

considered to be a cycle of an artificial periodic signal from which the prototype

for the frame is obtained. The prototypes which have been appropriately selected from adjacent frames are Fourier transformed and the resulting

coefficients are coded using a differential vector quantization scheme.

With this scheme, during synthesis of voiced frames, the decoded

prototype Fourier representations for adjacent frames are used to reconstruct

the missing signal waveform between the two prototype segments using linear

interpolation. Thus the residual signal is obtained which is then presented to an

LPC synthesis filter the output of which provides the synthesised voiced speech

signal. An amount of randomness can be introduced into voiced speech by

injecting noise at frequencies larger than 2khz, the amplitude of the noise

increasing with frequency. In addition, the periodicity of synthesised voiced

speech is controlled during the quantization of prototype parameters in

accordance with a long term signal to change ratio measure that reflects the

similarity which exists between the prototypes of adjacent frames in the residual

excitation signal.

The known prototype interpolation coding systems rely upon a Fourier

Series synthesis equation which involves a linear-with-time-interpolation process.

The assumption is that the pitch estimates for successive frames are linearly

interpolated to provide a pitch function and an associated instant fundamental

frequency. The instant phase used in the cosine and sine terms of the Fourier

series synthesis equation is the integral of the instantaneous harmonic

frequencies. This synthesis arrangement allows for the linear evolution of the instantaneous pitch and the non-linear evolution of the instantaneous harmonic

frequencies.

A development of this system is described by W.B. Kleijn and J. Haaden,

"A Speech Coder Based on Decomposition of Characteristics Waveforms", Proc.

ICASSP-95, pp508-511, Detroit, May 1995. In the described system the Fourier

series coefficients are low pass filtered over time, with a cut-off frequency of

20Hz, to provide a "slowly evolving" waveform component for the LPC

excitation signal. The difference between this low pass component and the

original parameters provides the "rapidly evolving" components of the excitation

signal. Periodic voice excitation signals are mainly represented by the "slowly

evolving" component, whereas random unvoiced excitation signals are

represented by the "rapidly evolving" component in this dual decomposition of

the Fourier series coefficients. This removes effectively the need for treating

voiced and unvoiced frames separately. Furthermore, the rate of quantization and transmission of the two components is different. The "slowly evolving"

signal is sampled at relatively long intervals of 25msecs, but the parameters are

quantized quite accurately on the basis of spectral magnitude information. In

contrast, the spectral magnitude of the "rapidly evolving" signal is sampled

frequently, every 4msecs, but is quantized less accurately. Phase information is

randomised every 2msecs.

Other developments of the prototype interpolation coding system have

been proposed. For example one known system operates on 5msec frames, a pitch period being selected for voiced frames and DFT transformed to yield

prototype spectral magnitude values. These values arc quantized and the

quantized values for adjacent frames are linearly interpolated. Phase

information is defined in a manner which does not satisfy any frequency

restrictions at the interpolation boundaries. This causes problems of

discontinuity at frame boundaries. At the receiver the excitation signal is

synthesised using a decoded magnitude and estimated phase values, via an

inverse DFT process. The resulting signal is filtered by a following LPC

synthesis filter. This model is purely periodic during voiced speech, and this is

why a very short duration frame is used. Unvoiced speech is CELP coded.

The wide range of speech synthesis models currently being proposed, only

some of which are described above, and the range of alternative approaches

proposed to implement those models, indicates the interest in such systems and

the lack of any consensus as to which system provides the most advantageous

performance.

It is an object of the present invention to provide an improved low bit rate

speech synthesis system.

In known systems in which it is necessary to obtain an estimate of the

pitch of a frame of a speech signal, it has been thought necessary, if high quality

of synthesised speech is to be achieved, to obtain high resolution non-integer

pitch period estimates. This requires complex processes, and it would be highly desirable to reduce the complexity of the pitch estimation process in a manner

which did not result in degraded quality.

According to a first aspect of the present invention, there is provided a

speech synthesis system in which a speech signal is divided into a scries of

frames, and each frame is converted into a coded signal including a

voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered

speech segment centred about a reference sample is defined in each frame, a

correlation value is calculated for each of a series of candidate pitch estimates as

the maximum of multiple crosscorrelation values obtained from variable length

speech segments centred about the reference sample, the correlation values are

used to form a correlation function defining peaks, and the locations of the

peaks are determined and used to define a pitch estimate.

The result of the above system is that an integer pitch period value is

obtained. The system avoids undue complexity and may he readily implemented.

Preferably the pitch estimate is defined using an iterative process. A

single reference sample may be used, for example centred with respect to the

respective frame, or alternatively multiple pitch estimates may be derived for

each frame using different reference samples, the multiple pitch estimates being

combined to define a combined pitch estimate for the frame. The pitch estimate

may be modified by reference to a voiced/unvoiced status and/or pitch estimates

of adjacent frames to define a final pitch estimate. The correlation function may be clipped using a threshold value,

remaining peaks being rejected if they are adjacent to larger peaks. Peaks are

initially selected and can be rejected if they arc smaller than a following peak by

more than a predetermined factor, for example smaller than 0.9 times the

following peak.

Preferably the pitch estimation procedure is based on a least squares

error algorithm. Preferably the algorithm defines the pitch as a number whose

multiples best fit the correlation function peak locations. Initial possible pitch

values may be limited to integral numbers which are not consecutive, the

increment between two successive numbers being proportional to a constant

multiplied by the lower of those two numbers.

It is well known from the prior art to classify individual frames as voiced

or unvoiced and to process those frames in accordance with that classification.

Unfortunately such a simple classification process does not accurately reflect the

true characteristics of speech. It is often the case that individual frames are made up of both periodic (voiced) and aperiodic (unvoiced) components. Prior

attempts to address this problem have not proved particularly effective.

It is an object of the present invention to provide an improved voiced or

unvoiced classification system.

According to a second aspect of the present invention there is provided a

speech synthesis system in which a speech signal is divided into a series of

frames, and each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed

voiced classification which classifies harmonics in the magnitude spectrum of

voiced frames as strongly voiced or weakly voiced, wherein a scries of samples

centred on the middle of the frame arc windowed to form a data array which is

Fourier transformed to produce a magnitude spectrum, a threshold value is

calculated and used to clip the magnitude spectrum, the clipped data is searched

to define peaks, the locations of peaks are determined, constraints are applied to

define dominant peaks, and harmonics not associated with a dominant peak are

classified as weakly voiced.

Peaks may be located using a second order polynomial. The samples may

be Hamming windowed. The threshold value may be calculated by identifying

the maximum and minimum magnitude spectrum values and defining the

threshold as a constant multiplied by the difference between the maximum and

minimum values. Peaks may be defined as those values which arc greater than

the two adjacent values. A peak may be rejected from consideration if

neighbouring peaks are of a similar magnitude, e.g. more than 80% of the

magnitude, or if there are spectral magnitudes in the same range of greater

magnitudes. A harmonic may be considered as not being associated with a

dominant peak if the difference between two adjacent peaks is greater than a

predetermined threshold value.

The spectrum may be divided into bands of fixed width and a

strongl /weakly voiced classification assigned for each band. Alternatively the frequency range may be divided into two or more bands of variable width,

adjacent bands being separated at a frequency selected by reference to the

strongly/weakly voiced classification of harmonics.

Thus, the spectrum may be divided into fixed bands, for example fixed

bands each of 500Hz, or variable width bands selected in dependence upon the

strongly/weakly voiced status of harmonic components of the excitation signal. A

strongly/weakly voiced classification is then assigned to each band. The lowest

frequency band, e.g. 0-500Hz, may always be regarded as strongly voiced,

whereas the highest frequency band, for example 3500Hz to 4000Hz, may always

be regarded as weakly voiced, In the event that a current frame is voiced, and

the previous frame is unvoiced, other bands within the current frame, e.g.

3000Hz to 3500Hz may be automatically classified as weakly voiced. Generally

the strongly/weakly voiced classification may be determined using a majority

decision rule on the strongly/weakly voiced classification of those harmonics

which fall within the band in question. If there is no majority, alternate bands may be alternately assigned strongly voiced and weakly voiced classifications.

Given the classification of a voiced frame such that harmonics are

classified as either strongly or weakly voiced, it is necessary to generate an

excitation signal to recover the speech signal which takes into account this classification. It is an object of the present invention to provide such a system.

According to a third aspect of the present invention, there is provided a

speech synthesis system in which a speech signal is divided into a series of frames, each frame is defined as voiced or unvoiced, each frame is converted into

a coded signal including a pitch period value, a frame voiced/unvoiced

classification and, for each voiced frame, a mixed voiced spectral band

classification which classifies harmonics within spectral bands as either strongly

or weakly voiced, and the speech signal is reconstructed by generating an

excitation signal in respect of each frame and applying the excitation signal to a

filter, wherein for each weakly voiced spectral band, an excitation signal is

generated which includes a random component in the form of a function which is

dependent upon the respective pitch period value.

Thus for each frame which has a spectral band that is classified as weakly

voiced, the excitation signal is represented by a function which includes a first

harmonic frequency component, the frequency of which is dependant upon the

pitch period value appropriate to that frame, and a second random component

which is superimposed upon the first component.

The random component may be introduced by reducing the amplitude of

harmonic oscillators assigned the weakly voiced classification, for example by

reducing the power of the harmonics by 50%, while disturbing the oscillator

frequencies, for example by shifting the oscillators randomly in frequency in the

range of 0 to 30 Hz such that the frequency is no longer a multiple of the

fundamental frequency, and then adding further random signals. The phase of

the oscillators producing random signals may be randomised at pitch intervals. Thus for a weakly voiced band, some periodicity remains but the power of the

periodic component is reduced and then combined with a random component.

In a speech synthesis system in which a speech signal is represented in

part by spectral information in the form of harmonic magnitude values, it is

possible to process an input speech signal to produce a series of spectral

magnitude values and then to use all of those magnitude values at harmonic

locations in subsequent processing steps. In many circumstances however at

least some of the magnitude values contain little information which is useful in

the recovery of the input speech signal. Accordingly when magnitude values are

quantized for transmission to a receiver it is sensible to discard magnitude values

which contain little useful information.

In one known system an input speech signal is processed to produce an

LPC residual signal which in turn is processed to provide harmonic magnitude

values, but only a fixed number of those magnitude values is vector quantized for

transmission to a receiver. The discarded magnitude values arc represented at

the receiver as identical constant values. This known system reduces

redundancy but is inflexible in that the locations of the fixed number of

magnitude values to be quantized are always the same and predetermined on the

basis of assumption that may be inappropriate in particular circumstances.

It is an object of the present invention to provide an improved magnitude

value quantization system. According to a fourth aspect of the present invention, there is provided a

speech synthesis system in which a speech signal is divided into a series of

frames, and each voiced frame is converted into a coded signal including a pitch

period value, LPC coefficients, and pitch segment spectral magnitude

information, wherein the spectral magnitude information is quantized by

sampling the LPC short term magnitude spectrum at harmonic frequencies, the

locations of the largest spectral samples are determined to identify which of the

magnitudes are relatively more important for accurate quantization, and the

magnitudes so identified are selected and vector quantized.

Thus rather than relying upon a simple location selection strategy of a

fixed number of magnitude values for quantization and transmission, for

example the "low part" of the magnitude spectrum, the invention selects only

those values which make a significant contribution according to the subjectively

important LPC magnitude spectrum, thereby reducing redundancy without

compromising quality.

In one arrangement in accordance with the invention a pitch segment of

P_n LPC residual samples is obtained, where P„ is the pitch period value of the

nth frame, the pitch segment is DFT transformed, the mean value of the

resultant spectral magnitudes is calculated, the mean value is quantized and used

as a normalisation factor for the selected magnitudes, and the resulting

normalised amplitudes are quantized. Alternatively, the RMS value of the pitch segment is calculated, the RMS

value is quantized and used as a normalisation factor for the selected

magnitudes, and the resulting normalised amplitudes are quantized.

At the receiver, the selected magnitudes are recovered, and each of the

other magnitude values is reproduced as a constant value.

Interpolation coding systems which employ a pitch-related synthesis

formula to recover speech generally encounter the problem of coding a variable

length, pitch dependant spectral amplitude vector. The quantization scheme

referred to above in which only the magnitudes of relatively greater importance

are quantized avoids this problem by quantizing only a fixed number of

magnitude values and setting the rest of the magnitude values to a constant

value. Thus at the receiver a fixed length vector can be recovered. Such a

solution to the problem however may result in a relatively spectrally flat

excitation model which has limitations in providing high recovered speech

quality.

In an ideal world output speech quality would be maximised by

quantizing the entire shape of the magnitude spectrum, and various approaches

have been proposed for coding the entire magnitude spectrum. In one approach,

the spectrum is DFT transformed and coded differentially across successive

spectra. This and similar coding schemes are rather inefficient however and

operate with relatively high bit rates. The introduction of vector quantization allowed for the development of sinusoidal and prototype interpolation systems

which operate at lower bit rates, typically around 2.4Kbits/sec.

Two vector quantization methodologies have been reported which

quantize a variable size input vector with a fixed size code vector. In a first

approach, the input vector is transformed to a fixed size vector which is then

conventionally vector quantized. An inverse transform of the quantized fixed

size vector yields the recovered quantized vector. Transformation techniques

which have been used include linear interpolation, band limited interpolation, all

pole modelling and non-square transformation. This approach however

produces an overall distortion which is the summation of the vector quantization

noise and a component which is introduced by the transformation process. In a

second known approach, a variable input vector is directly quantized with a

fixed size code vector. This approach is based on selecting only a limited number

of elements from each codebook vector to form a distortion measure between a

codebook vector and an input vector. Such a quantization approach avoids the

transformation distortion of the alternative technique mentioned above and

results in an overall distortion that is equal to the vector quantization noise, but

this is significant.

It is an object of the present invention to provide an improved variable

sized spectral vector quantization scheme.

According to a fifth aspect of the present invention, there is provided a

speech synthesis system in which a variable size input vector of coefficients to be transmitted to a receiver for the reconstruction of a speech signal is vector

quantized using a codebook defined by vectors of fixed size, the codebook vectors

of fixed size are obtained from variable size training vectors and an interpolation

technique which is an integral part of the codebook generation process, codebook

vectors are compared to the variable sized input vector using the interpolation

process, and an index associated with the codebook entry with the smallest

difference from the comparison is transmitted, the index being used to address a

further codebook at the receiver and thereby derive an associated fixed size

codebook vector, and the interpolation process being used to recover from the

derived fixed sized codebook vector an approximation of the variable sized input

vector.

The invention is applicable in particular to pitch synchronous low bit rate

coders of the type described in this document and takes advantage of the

underlying principle of such coders which means that the shape of the magnitude

spectrum is represented by a relatively small number of equally spaced samples.

Preferably the interpolation process is linear. For an input vector of given dimension, the interpolation process is applied to produce from the

codebook vectors a set of vectors of that given dimension. A distortion measure

is then derived to compare the interpolated set of vectors and the input vector

and the codebook vector which yields the minimum distortion is selected.

Preferably the dimension of the input vectors is reduced by taking into

account only the harmonic amplitudes with the input brandwidth range, for example 0 to 3.4kHz. Preferably the remaining amplitudes i.e. in the region of

3.4kHz to 4 kHz are set to a constant value. Preferably, the constant value is

equal to the mean value of the quantized amplitudes.

Amplitude vectors obtained from adjacent residual frames exhibit

significant amounts of redundancy which can be removed by means of backward

prediction. The backward prediction may be performed on a harmonic basis

such that the amplitude value of each harmonic of one frame is predicted from the amplitude value of the same harmonic in the previous frame or frames. A

fixed linear predictor may be incorporated in the system, together with mean

removal and gain shape quantization processes which operate on a resulting

error magnitude vector.

Although the above described variable sized vector quantization scheme

provides advantageous characteristics, and in particular provides for good

perceived signal quality at a bit rate of for example 2.4Kbits/sec, in some

environments a lower bit rate would be highly desirable even at the loss of some

quality. It would be possible for example to rely upon a single value

representation and quantization strategy on the assumption that the magnitude

spectrum of the pitch segment in the residual domain has an approximately flat

shape. Unfortunately systems based on this assumption have a rather poor

decoded speech quality.

It is an object of the present invention to overcome the above limitation in

lower bit rate systems. According to a sixth aspect of the present invention, there is provided a

speech synthesis system in which a speech signal is divided into a series of

frames, each frame is converted into a coded signal including an estimated pitch

period, an estimate of the energy of a speech segment the duration of which is a

function of the estimated pitch period, and LPC filter coefficients defining an

LPC spectral envelope, and a speech signal of related power to the power of the

input speech signal is reconstructed by generating an excitation signal using

spectral amplitudes which are defined from a modified LPC spectral envelope

sampled at the harmonic frequencies defined by the pitch period.

Thus, although a single value is used to represent the spectral envelope of

the excitation signal, the excitation spectral envelope is shaped according to the

LPC spectral envelope. The result is a system which is capable of delivering high

quality speech at 1.5Kbits/sec. The invention is based on the observation that

some of the speech spectrum resonance and anti-resonance information is also

present in the residual magnitude spectrum, since LPC inverse filtering cannot

produce a residual signal of absolutely flat magnitude spectrum. As a

consequence, the LPC residual signal is itself highly intelligible.

The magnitude values may be obtained by spectrally sampling a modified

LPC synthesis filter characteristic at the harmonic locations related to the pitch

period. The modified LPC synthesis filter may have reduced feed back gain and

a frequency response which consists of equalised resonant peaks, the locations of

which are close to the LPC synthesis resonant locations. The value of the feed back gain may be controlled by the performance of the LPC model such that it is

for example proportional to the normalised LPC prediction error. The energy of

the reproduced speech signal may be equal to the energy of the original speech

waveform.

It is well known that in prototype interpolation coding speech synthesis

systems there are often substantial similarities between the prototypes of

adjacent frames in the residual excitation signals. This has been used in various

systems to improve perceived speech quality by ensuring that there is a smooth

evolution of the speech signal over time.

It is an object of the present invention to provide an improved speech

synthesis system in which the excitation and vocal tract dynamics are

substantially preserved in the recovered speech signal.

According to a seventh aspect of the present invention, there is provided a

speech synthesis system in which a speech signal is divided into a series of

frames, each frame is converted into a coded signal including LPC filter

coefficients and at least one parameter associated with a pitch segment

magnitude, and the speech signal is reconstructed by generating two excitation

signals in respect of each frame, each pair of excitation signals comprising a first

excitation signal generated on the basis of the pitch segment magnitude

parameter or parameters of one frame and a second excitation signal generated

on the basis of the pitch segment magnitude parameter or parameters of a

second frame which follows and is adjacent to the said one frame, applying the first excitation signal to a first LPC filter the characteristics of which are

determined by the LPC filter coefficients of the said one frame and applying the

second excitation signal to a second LPC filter the characteristics of which are

determined by the LPC filter coefficients of the said second frame, and weighting

and combining the outputs of the first and second LPC filters to produce one

frame of a synthesised speech signal.

Preferably the first and second excitation signals include the same phase

function and different phase contributions from the two LPC filters involved in

the above double synthesis process. This reduces the degree of pitch periodicity

in the recovered signals. This and the combination of the first and second LPC

filter outputs ensures an effective smooth evolution of the speech spectral

envelope on a sample by sample basis.

Preferably the outputs of the first and second LPC filters are weighted by

half a window function such as a Hamming window such that the magnitude of

the output of the first filter is decreasing with time and the magnitude of the

output of the second filter is increasing with time.

According to an eighth aspect of the present invention, there is provided a

speech coding system which operates on a frame by frame basis, and in which

information is transmitted which represents each frame as either voiced or

unvoiced and, for each voiced frame, represents that frame by a pitch period

value, quantized magnitude spectral information, and LPC filter coefficients, the

received pitch period value magnitude spectral information being used to generate residual signals at the receiver which are applied to LPC speech

synthesis filters the characteristics of which arc determined by the transmitted

filter coefficients, wherein each residual signal is synthesised according to a

sinusoidal mixed excitation synthesis process, and a recovered speech signal is

derived from the residual signals.

Embodiments of the present invention will now be described, by way of

example, with reference to the accompanying drawings, in which:

Figure 1 is a general block diagram of the encoding process in accordance with the present invention;

Figure 2 illustrates the relationship between coding and matrix

quantisation frames;

Figure 3 is a general block diagram of the decoding process;

Figure 4 is a block diagram of the excitation synthesis process;

Figure 5 is a schematic diagram of the overlap and add process;

Figure 6 is a schematic diagram of the calculation of an instantaneous

scaling factor;

Figure 7 is a block diagram of the overall voiced/unvoiced classification

and pitch estimation process;

Figure 8 is a block diagram of the pitch estimation process;

Figure 9 is a schematic diagram of two speech segments which participate

in the calculation of a crosscorrelation function value;

Figure 10 is a schematic diagram of speech segments used in the calculation of the crosscorrelation function value;

Figure 11 represents the value allocated to a parameter used in the

calculation of the crosscorrelation function value for different delays;

Figure 12 is a block diagram of the process used for calculated the

crosscorrelating function and the selection of its peaks;

Figure 13 is a flow chart of a pitch estimation algorithm; Figure 14 is a flow chart of a procedure used in the pitch estimation

process;

Figure 15 is a flow chart of a further procedure used in the pitch

estimation process;

Figure 16 is a flow chart of a further procedure used in the pitch

estimation process.

Figure 17 is a flow chart of a threshold value selection procedure;

Figure 18 is a flow chart of the voiced/unvoiced classification process;

Figure 19 is a schematic diagram of the voiced/unvoiced classification

process with respect to parameters generated during the pitch estimation

process;

Figure 20 is a flow chart of the procedure used to determine offset values;

Figure 21 is a flow chart of the pitch estimation algorithm;

Figure 22 is a flow chart of a procedure used to impose constraints on

output pitch estimates to ensure smooth evolution of pitch values with time;

Figures 23, 24 and 25 represent different portions of a flow chart of a

pitch post processing procedure;

Figure 26 is a general block diagram of the LPC analysis and LPC

quantisation process;

Figure 27 is a general flow chart of a strongly or weakly voiced

classification process; Figure 28 is a flow chart of the procedure responsible for the

strongly/weakly voiced classification.

Figure 29 represents a speech waveform obtained from a particular

speech utterance;

Figure 30 shows frequency tracks obtained for the speech utterance of

Figure 29;

Figure 31 shows to a larger scale a portion of Figure 30 and represents the

difference between strongly and weakly voiced classifications;

Figure 32 shows a magnitude spectrum of a particular speech segment

and the corresponding LPC spectral envelope and the normalised short term

magnitude spectra of the corresponding residual segment, excitation segment

obtained using a binary excitation model and an excitation segment obtained

using the strongly/weakly voiced model;

Figure 33 is a general block diagram of a system for representing and

quantising magnitude information;

Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;

Figure 35 is a general block diagram of a quantisation process;

Figure 36 is a general block diagram of a differential variable size

spectral vector quantiser; and

Figure 37 represents the hierarchical structure of a mean gain shape

quantiser. A system in accordance with the present invention is described below, firstly in general terms and then in greater detail. The system operates on an LPC residual signal on a frame by frame basis.

Speech is synthesised using the following general expression: s = ∑A_k(i)cos(3_k (i) + φ_k) (1) k=0

where i is the sampling instant and A_k(i) represents the amplitude value of the kth cosine term cos(Θ, (/^')) (with Θ_k ( ) = 3 _k (i) + φ _k ) as a function of i. In voiced speech depends on the pitch frequency of the signal.

A voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to be handled in different ways. Unvoiced frames are modelled in terms of an RMS value and a random time series. In voiced frames a pitch period estimate is obtained and used to define a pitch segment which is centred at the middle of the frame. Pitch segments from adjacent frames are DFT transformed and only the resulting pitch segment magnitude information is coded and transmitted. Furthermore, pitch segment magnitude samples are classified as strongly or weakly voiced. Thus in addition to voiced/unvoiced information, the system transmits for every voiced frame the pitch period value, the magnitude spectral information of the pitch segment, the strong/weak voiced classification of the pitch magnitude spectral values,, and the LPC coefficient. Thus, the information which is transmitted for every voiced frame is, in addition to voiced/unvoiced information, the pitch period value, the magnitude spectral information of the pitch segment, and the LPC filter coefficients.

At the receiver a synthesis process, that includes interpolation, is used to reconstruct the waveform between the middle points of the current (n+ 1 )th and previous nth frames. The basic synthesis equation for the residual signal is: Res(i) = MG_j cos(phase _j i) (2)

where MG _} are decoded pitch segment magnitude values and phase_j(i) is calculated from the integral of the linearly interpolated instantaneous harmonic frequencies ω,(i). K is the largest value of j for which ω_j"(i)<π.

In the transitions from unvoiced to voiced, the initial phase for each harmonic is set to zero. Phase continuity is preserved across the boundaries of successive interpolation intervals.

The synthesis process is performed twice however, once using the magnitude spectral values MG_j"⁺ of the pitch segment derived from the current (n+l )th frame and again using the magnitude values MG_j" of the pitch segment derived in the previous nth frame. The phase function phase i) in each case remains the same. The resulting residual signals Res_n(i) and Res_n+)(i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and (n+l)th speech frames. The two LPC synthesised speech waveforms are then weighted by W_{n+ 1}(i) and W_n(i) to yield the recovered speech" signal.

Thus the overall synthesis process, for successive voiced frames, can be described by:

•5(0 = W_n (i)∑ H" (ω - (Ϊ))MG cosϊphase", (i) + φ " (ω '; ( ))]

cosfyhase", (i) + φ "^+l (ω " ( )) .

where H"(ω "( )J is the frequency response of the nth frame LPC synthesis filter calculated, at the co:ⁿ(i) harmonic frequency function at the ith instant. φ "(ω " (/)J is the associated phase response of this filter. α>_j"(i) and phase _j"(i) are the frequency and phase functions defined for the sampling instants i, with i covering the middle of the nth frame to the middle of the (n+l)th frame segments. K is the largest value of j for which cø_j ⁿ(i)<π. The above speech synthesis process introduces two "phase dispersion" terms i.e. φ"^ύ '!( )^") and φ"⁺'(ω"(/H which effectively reduce the degree of pitch periodicity in the recovered signal. In addition, this "double synthesis" arrangement followed by an overlap-add process ensures an effective smooth evolution of the speech spectral envelope (LPC) on a sample by sample basis.

The LPC excitation signal is based on a "mixed" excitation model which allows for the appropriate mixing of periodic and random excitation components in voiced frames on a frequency-band basis. This is achieved by operating the system such that the magnitude spectrum of the residual signal is examined, and applying a peak-picking process, near the ω: resonant frequencies, to detect possible dominant spectral peaks. A peak associated with a frequency G>J indicates a high degree of voicing (represented by hv_j=l) for that harmonic. The absence of an adjacent spectral peak, on the other hand, indicates a certain degree of randomness (represented by hv_j=0). When hv_j=l (to indicate "strong" voicing) the contribution of the jth harmonic to the synthesis process is MG_t cos(phase ,( )) However, when hv:=0 (to indicate "weak" voicing) the frequency of the jth harmonic is slightly dithered, its magnitude MG_f is reduced to MG_/ I Jl j and random cosine terms are added

symmetrically alongside the jth harmonic O_j. The terms "strong" and "weak" are used in this sense below. The number NRS of these random terms is

where |~ ^"j indicates rounding off to the next larger integer value. Furthermore, the NRS random components are spaced at 50 Hz intervals symmetrically about ω, cθ_| being located in the middle of such a 50 Hz interval. The amplitudes of the NRS random components are set to MG_j I V2 x NRS Their initial phases are selected randomly from the [-π, +π] region at

pitch period intervals. The hv_j information must be transmitted to be available at the receiver and, in order to reduce the bit rate allocated to hv_f, the bandwidth of the input signal is divided into a number of fixed size bands BD_k and a "strongly" or "weakly" voiced flag Bhv_k is assigned for each band. In a "strongly" voiced band, a highly periodic signal is reproduced. In a "weakly" voiced band, a signal which combines both periodic and aperiodic components is required. These bands are classified as strongly voiced (Bhv_k=l) or weakly voiced (Bhv_k=0) using a majority decision rule approach on the hv. classification values of the harmonics ω: contained within each frequency band.

Further restrictions can be imposed on the strongly/weakly voiced profiles resulting from the classification of bands. For example, the first λ bands may always be strongly voiced i.e. hv_j=l for BD with k=l,2,...,λ, and λ being a variable. The remaining spectral bands can be strongly or weakly voiced.

Figure 1 schematically illustrates processes operated by the system encoder. These processes are referred to in Figure 1 as Processes I to VII and these terms are used throughout this document. Figure 2 represents the relationship between analysis/coding frame sizes employed. These are M samples per coding frame, e.g. 160 samples per frame, and k frames are analysed in a block, for example k=4. This block size is used for matrix quantization. A speech signal is input and processes I, III, IV, VI AND VII produce outputs for transmission.

Assuming that the first Matrix Quantization analysis frame (MQA) of kxM samples is available, each of the k coding frames within the MQA is classified as voiced or unvoiced (V_n) using, Process I. A pitch estimation part of Process I provides a pitch period value P„ only when a coding frame is voiced. 32 Process II operates in parallel on the input speech samples and estimates p LPC filter coefficients a (for example p=10) every L samples (L is a multiple of M i.e. L=mxM, and m may be equal to for example 2). In addition, k/m is an integer and represents the frame dimension of the matrix quantizer employed in Process III. Thus the LPC filter coefficients are quantized, using Process III and transmitted. The quantized coefficients a are used to derive a residual signal Rⁿ(i).

When an input coding frame is unvoiced, the Energy E„ of the residual obtained for this frame is calculated (Process VII). JE^ is then quantized and transmitted.

When the nth coding frame is classified as voiced, a segment of P_n residual samples is obtained (P_n is the pitch period value associated with the nth frame). This segment is centred in the middle of the frame. The selected P_n samples are DFT transformed (Process V) to yield + l) / 2 spectral magnitude values MG" ,

+ 1) / 2^~|, and [(/^>„ + l) / 2^"j phase values. The phase information is neglected. The magnitude information is coded (using Process VI) and transmitted. In addition a segment of 20 secs, which is centred in the middle of the nth coding frame, is obtained from the residual signal R^π(i). This is input to Process IV, together with P_n to provide the strongly/weakly voiced classification parameters hv:ⁿ of the harmonics ω,^π. Process IV produces quantized Bhv information, which for voiced frames is multiplexed and transmitted to the receiver together with the voiced/unvoiced decision V the pitch period P_n, the quantized LPC coefficients a of the corresponding LPC frame, and the magnitude values MG" . In unvoiced frames only the

quantized value and the quantized LPC filter coefficients a are transmitted.

Figure 3 schematically illustrates processes operated by the system decoder. In general terms, given the received parameters of the nth coding frame and those of the previous (n-l)th coding frame, the decoder synthesises a speech signal S_n(i) that extends from the middle of the (n-l)th frame to the middle of the nth frame. This synthesis process involves the generation in parallel of two excitation signals Res_n(i) and Res_n.,(i) which are used to drive two independent LPC synthesis filters 1 / A_n (z) and 1 / A_tl^ (z) the coefficients of which are derived from the transmitted quantized coefficients a . The outputs X_n(i) and X_n_,(i) of these synthesis filters are weighted and added to provide a speech segment which is then post filtered to yield the recovered speech S_n(i). The excitation synthesis process used in both paths of Figure 3 is shown in more detail in Figure 4.

The process commences by considering the voiced unvoiced status V_k, where k is equal to n or n-1, (see Figure 4). When the frame is unvoiced i.e. V_k=0, a gaussian random number generator RG(0,1) of zero mean and unit variance, provides a time series which is subsequently scaled by the JE^ value received for this frame. This is effectively the required:

signal which is then presented to the corresponding LPC synthesis filter 1 / A_k (z) , k=n or n- 1. Performance could be increased if the ^E_k value was calculated, quantized and transmitted every 5msecs. Thus, provided that bits are available when coding unvoiced speech, four E_{k A} , ξ=0,..,3, values are transmitted for every unvoiced frame of 20msecs duration (160 samples).

In the case where V_k=l, the Res_k(i) excitation signal is defined as the summation of a "harmonic" Res_k (i) component and a "random" Res_k ^r(i) component. The top path of the V =l part of the synthesis in Figure 4, which provides the harmonic component of this mixed excitation model, calculates always the instantaneous harmonic frequency function ω_j"(i) which is associated with the interpolation interval that is defined between the middle points of the nth and (n-l)th frames, (i.e. this action is independent of the value of k). Thus, when 34 decoding the nth frame, α>_j ⁿ(i) is calculated using the pitch frequencies f_j ¹'", f,²"" and linear interpolation i.e.

with 0≤j<f(P_mtx +l)/2^"l,0<i<M and P_mm = max[/^>„ , /^>„_,]

The frequencies, fj¹'" and f_j ^,n are defined as follows:

I) When both the nth and (n-l)th coding frames are voiced i.e V_n=l and V_n.,=l, then the pitch frequencies are estimated as follows: a) If - P„_,|< 0.2 x (/^>„ + /^>„.,) (7) which means that the pitch values of the nth and (n-l)th coding frames are rather similar, then: f? ⁼ J ^{" +} (^J " ^hv" ) Λi^/(-^β,+^fl) (8⁾

₂„ , / ^■"^"'+ A

₊P,_l_₂)) ' I _/'"^"' otherwise

The f*^~x value is calculated during the decoding process of the previous (n-l)th coding frame, hv_j" is the strongly/weakly voiced classification (0, or 1 ) of the jth harmonic ω". P_n and P_n_, are the received pitch estimates from the n and n-1 frames. RU(-a,+a) indicates the output of a random number generator with uniform pdf within the -a to +a range. (a=0.00375) b) if \P„ -/^>„_,!> 0.2 x(P. + /^>,_,) (10)

then f " + (l-hv';)^χRU(-a,+a) (11)

and /?"=/!"-' +b j where b is defined as:

ssgfinnJl— -/ (12)

Notice that in case (b) which applies for significantly different P_n and P_π_, pitch estimates, equations 11 and 12 ensure that the rate of change of the ω,ⁿ(i) function is restricted to

I) When one of the two coding frames (i.e. n, n-1) is unvoiced, one of the following two definitions is applicable: a) for V_n.,=0 and V_n=l

2*< _ P..+1

/; 0<y

and fj¹'" is given by Equation (8). b)forV_n.,=landV_n=0 ζ '" is set to the fj ^,n" value, which has been calculated during the decoding process of the previous (n-l)th coding frame and f_j ¹'" = fj²'".

Given cj,ⁿ(i) the instantaneous function phase^i) is calculated by:

(/;* -/ -). + 1 phase" =2π +2πff"ι + phase"-' (M) forO≤j<

2M (13) and 0 < < M

Furthermore, the "harmonic" component R _k (i) of the residual signal is given by:

where k=n or n-1, z/ω';()>π

/ = 0 and 36 MG j =

+ 1)/2_|-1 are the received magnitude values of the "kth" coding frame, with k=n or k=n-l .

37

The second path of the V_k=l case in Figure 4 provides the random excitation component Res_k ^r ( ). In particular, given the recovered strongly/weakly voiced classification values hv \ the system calculates for those harmonics with hv_j =0 the number of random sinusoidal NRS components, which are used to randomise the corresponding harmonic. This is: ω ,

NRS = 2 x (15)

4π x (50/ fs)

where fs is the sampling frequency. Notice that the NRS random sinusoidal components are located symmetrically about the corresponding harmonic ω * and they are spaced 50 Hz apart.

The instantaneous frequency of the qth random component, q=0,l ,...,NRS-l, for the jth harmonic ω * is calculated by: . + _ω J _q (i) = _ω (i) + 2π x (25/ 'fs) + (</ - (NRS/2))x 2π x (50/ fs for 0 < /^' < (16)

and O ≤ i ≤ M

The associated phase value is:

and O ≤ i ≤ M

where φ = RU(π,-π) . In addition, the Ph^k _{l q} (i) function is randomised at pitch intervals

(i.e. when the phase of the fundamental harmonic component is a multiple of 2π, i.e. taoάyphase" ( ), 2π )= 0 ).

Given the Ph* ( ) , the random excitation component Res_kr(i) is calculated as follows:

where

0

0 ω *_Λ(/) > π

C,Λi) = ω *_Λ( <π

Thus for V_k=l voiced coding frames, the mixed excitation residual is formed as: Res_k (0 = Res, ( ) + Re^;( ) (19)

Notice that when V_k=0, instead of using Equation 5, the random excitation signal Res_k(i) can be generated by the summation of random cosines located 50 Hz apart, where their phase is randomised every λ samples, and λ<M, i.e

Res_k(i) = cos(2π( /50) + δ(i - λ x ξ -ζ) x RU(-π,+π)) where

(20) ξ = 0,1,2,..., and O ≤ i < M and

ζ is defined so as to ensure that the phase of the cos terms is randomised every λ samples across frame boundaries. The resulting Res_n(i) and Res_n._|(i) excitation sequences, see Figure

4, are processed by the corresponding 1 / A (z) and 1 / A„__{ (z) LPC synthesis filters. When coding the next (n+l)th frame, M A_H_^(z) becomes 1 / A_n(z) (including the memory) and

1 / A_n (z) becomes 1 / A_n+I (z) with the memory of 1 / A_n (z) . This is valid in all cases except during an unvoiced to voiced transition, where the memory of the 1 / Λ„_+l (z) filter is set to zero. The coefficients of the 1 / A_n(z) and 1 / A_H__X (z) synthesis filters are calculated directly from the nth and (n-l)th coding speech frames respectively, when the LPC analysis frame size L is equal to M samples. However, when L≠M (usually L>M) linear interpolation is used on the filter coefficients (defined every L samples) so that the transfer function of the synthesis filter is updated every M samples. The output signals of these filters, denoted as X_n.,(i) and X_n(i), are weighted, overlapped and added as schematically illustrated in Figure 5 to yield X_n (i) i.e:

^„( = ^-,( ^-,( + ^( ^,(

where

(21) and

(22)

X„(i) ^ιs then filtered via a PF(z) post filter and a high pass filter HP(z) to yield the speech segment S'_n(i). PF(z) is the conventional post filter:

(23)

with b=0.5, c=0.8 and μ = 0.5AT_l" ..v_l" is the first reflection coefficient of the nth coding frame. HP(z) is defined as: b^l - c.z^'1

^HP(z)=τ 1 —τ a z^ <² > with b,=C_!=0.9807 and a,=0.96148l .

In order to ensure that the energy of the recovered S(i) signal is preserved, as compared to that of the X(i) sequence, a scaling factor SC is calculated every LPC frame of L samples.

SC, = (2S)

V ^l i-l l,-\ where: E,' = X,(i and E, = ∑S,'(i)² i.O ,=0

SC| is associated with the middle of the 1th LPC frame as illustrated in Figure 6. The filtered samples from the middle of the (1-1 )th frame to the middle of the lth frame are then multiplied by SC_j(i) to yield the final output of the system,

where:

SC, (i) = SC,W, (i) + SC,_, W,. (0 0<i<L (26)

and

W, ( ) = 0.5 - 05 cosf π — ~J 0 ≤ i < L

W_l (i) = 0.5 + 05cosf π — '— J 0 < < L

The scaling process introduces an extra half LPC frame delay into the coding-decoding process.

The above described energy scaling procedure operates on an LPC frame basis in contrast to both the decoding and PF(z), HP(z) filtering procedures which operate on the basis of a frame of M samples.

Details of the coding processes represented in Figure 1 will now be described. Process I derives a voiced/unvoiced (V/UV) classification V_n for the nth input coding frame and also assigns a pitch estimate P_n to the middle sample M„ of this frame. This process is illustrated in Figure 7.

The V/UV and pitch estimation analysis frame is centred at the middle M_{n+ i} of the (n+l)th coding frame with 237 samples on either side. The signal x(i) in the above analysis frame is low pass filtered with a cut off frequency f_c=1.45KHz and the resulting (-147, 147) samples centred about M_n+j are used in a pitch estimation algorithm, which yields an estimate PM__+I. The pitch estimation algorithm is illustrated in Figure 8, where P represents the output of the pitch estimation process. The 294 input samples are used to calculate a crosscorrelation function CR(d), where d is shown in Figure 9 and 20<d<147. Figure 9 shows the two speech segments which participate in the calculation of the crosscorrelation function value at "d" delay. In particular, for a given value of d, the crosscorrelation function p^d(j) is calculated for the segments {Xι ^d , { _R} ,as:

where: x_L ^d (i)=x(M_n+rd+j+i), x_R ^d (i)=x(M_n+1+j+i), for 0<i<d-j-l, j=0,l,...,f(d) (Figure 10 schematically represents the Xf and X_H'' speech segments used in the calculation of the value Ci?(<- and the non linear relationship between d and f(d) is given in Figure 1 1 x^d and X_R represent the mean value of the {x ) ^d and {x_R}^d sequences respectively.

The algorithm then selects max[p^d(j)] and defines CR(d)= max [p^d(j)], 20<d<147.

O≤j≤f(d)

In addition to CR(d), the box in Figure 8 labelled "Calculation of CR function and selection of its peaks", whose detailed diagram is shown in Figure 12, provides also the locations loc(k) of the peaks of the CR(d) function, where k=l,2,...,Np and Np is the number of peaks in a CR(d) function.

Figure 12 is a block diagram of the process involving the calculation of the CR function and the selection of its peaks. As illustrated, given CR(d), a threshold th(d) is determined as:

th(d) = CR(<C_. )~ b - (d - d£ )x a - c (28)

where c=0.08 when ₍p = 1)]

AND(d > 0.875 x P_ιι')AND(d < 1.125 x /^>„') or c=0 elsewhere, and constants a and b are defined as:

* *^s

. Using this threshold the CR(d) function is clipped to CR_L(d). i.e. CR_L(d) =0 for CR(d)<th(d)

CR_L(d) =CR(d) otherwise.

CRi/d) contains segments G_s s= 1,2,3 , of positive values separated by G₀ runs of zero values. The algorithm examines the length of the G₀ runs which exist between successive G_s segments (i.e. G_s and Gs₊i), and when G₀ < 17, then the G_s segment with the max CR (d) value is kept. This procedure yields CR, (d) , which is then examined by the following "peak picking" procedure. In particular those CR, (d) values are selected for which: CR_L(d) > CR, (d - \) and CR, (d) > CR, (d + \) However certain peaks can be rejected if: CR_L (loc(k)) ≤ CR_L (loc(k + 1)) x 0.9

This ensures that the final CR, (loc(k)) k=l,...,Np does not contain spurious low level

CR, (d) peaks. The locations d of the above defined CR, (d) peaks are given by loc(k) k=l,2,...,Np.

CR(d) and loc(k) are used as inputs to the following Modified High Resolution Pitch Estimation algorithm (MHRPE) shown in Figure 8, whose output is P_Mn+). The flowchart of this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, the estimated P is the requested P_Mn+l. In Figure 13 the main pitch estimation procedure is based on a Least Squares Error (LSE) algorithm which is defined as follows: For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j . i.e. j € {21,23,25-27,30,33,36,40,44,48,53,58,64,70,77,84,92,101,1 1 1,122,134} . (Thus 21 iterations are performed.)

1) Form the multiplication factor vector: u_f = j — loc

2) Reject possible pitch j and go back to (J) if a) the same element occurs in u, twice. b) the elements of ϊi_j have as a common factor a prime number.

3) Form the following error quantity

E_t - \oc loc - 2p_ιύ_/ loc + p u_l u_t where loc ii_j u_jUj ^τ

4) Select the p,_s value for which the associated Error quantity E_|S is minimum, (i.e. jsiE_j, < E_j Vy e {21,23,...134}). Set P=p_js. The next two general conditions "Reject Highest Delay" loc(Np) and "Reject Lowest Delay" loc(l) are included in order to reject false pitch, "double" or "half values and in general to provide constraints in the pitch estimates of the system. The "Reject Highest Delay" condition involves 3 constraits: i) if P=0 then reject Ioc(Np). ii) if loc(Np) >100 then find the local maximum CR(d_lm) in CR(d) at the vicinity of the estimated pitch P (i.e 0.8xP to 1.2xP) and compare this with th(d_im), which is determined as in Equation 28 Reject loc(Np) when CR(d_lm)<th(dι_m)-0.02. Hi) If the error E_jS of the LSE algorithm is larger than 50 and u_n (Np)=Nρ with Np>2 then reject loc(Nρ). The flowchart of this is given in Figure 14.

The "Reject Lowest Delay" general condition, whose flowchart is given in Figure 15, rejects loc(l) when the following three constraints are simultaneously satisfied: i) The density of detection of the peaks of the correlation coefficient function is less than or equal to 0.75. i.e.

^NP < 0.75 u,, (Np) ii) If the location of the first peak is neglected (i.e. loc(l)), then the remaining locations exhibit a common factor. Hi) The value of the correlation coefficient function at the locations of the missing peaks is relatively small compared to adjacent detected peaks, i.e.

If u_Pn ^k-Up_n(k)>l , for k=l,...Np. then for i=u_Pπ(k)+l : u_Pn(k+l)-l a) find local maximum CR(d_[m) in the range from (i-O.l)χloc(l) to (i+0.1)χ loc(l). b) if CR(d_Im) <0.97xCR(u_Pn(k)) then Reject Lowest Delay END. else Continue This concludes the pitch estimation procedure of Figure 7 whose output is P_Mntl. As is also illustrated in Figure 7 however, in parallel to the pitch estimation. Process I obtains 160 samples centred at the middle of the M_n+! coding frame, removes their mean value, and then calculates RO, Rl and the average R^ of the energies of the previous K non-silence coding frames. K is fixed to 50 for the first 50 non-silence coding frames, increases from 50 to 100 with the next 50 non-silence coding frames, and then remains constant at the value of 100. The flowchart of the procedure that calculates R^, Rl, R0 and updates the R_j,_v buffer is shown in Figure 16, where "Count" represents the number of non-silence speech frames, and "++" denotes increase by one. Notice that TH is an adaptive threshold that is representative of a silence (non speech) frame and is defined as in Figure 17. CR in this case is equal to

CR

Given R0, Rl, R_av and CR _t , the V/UV part of Process I calculates the status V_Mn+| of the n+1 frame. The flowchart of this part of the algorithm is shown in Figure 18 where "V" represents the output V/UV flag of this procedure. Setting the "V" flag to 1 or 0 indicates voiced or unvoiced classification respectively. The "CR" parameter denotes the maximum value of the CR function which is calculated in the pitch estimation process. A diagrammatic representation of the voiced/unvoiced procedure is given in Figure 19.

Having the V_Mll+l value, the P _ntl estimate and the V'_n and P'_π estimates which have been produced from Process I operating on the previous nth coding frame, as illustrated in Figure 7, part b, two further locations M_n+1+dl and M_n+,+d2 are estimated and the corresponding [-147,147] segments of filtered speech samples are obtained as illustrated in Figure 7, part b. These additional two analysis frames are used as input to the "Pitch Estimation process" of Figure 8 to yield P_M„_+ι+di and ^P _N ._+d2- The procedure for calculating dl and d2 is given in the flowchart of Figure 20.

The final step in part (a) of Process I of Figure 7, evolves the previous V/UV classification procedure of Figure 8 with inputs RO, Rl, R_av, and

to yield a preliminary value V^ .

In addition, a multipoint pitch estimation algorithm accepts P_Mι+1, P_M. _+d P -,_+d2> V„.,, P_n.ι, V'_n, P'_n to provide a preliminary pitch value P ", . The flowchart of this multipoint pitch estimation algorithm is given in Figure 21, where P,, P₂ and P₀ represent the pitch estimates associated with the M_n+1+d|, M_{n+( +d2} and M_{n+ 1} points respectively, and P denotes the output pitch estimate of the process, that is P_n+,.

Finally part (b) Process I of Figure 7 imposes constraints on the V_n1 and estimates in order to ensure a smooth evolution for the pitch parameter. The flowchart of this section is given in Figure 22. At the start of this process "V" and "P" represent the voicing flag and pitch estimate values before constraints are applied ( V ₊ and P , in Figure 7) whereas at the end of the process "V" and "P" represent the voicing flag and pitch estimate values after the constraints have been applied ( _H'_+I and P_l+l ). The V'_n+I and P'_n+i produced from this section are then used in the next pitch past processing section together with V_n_,, V'_n, P_n., and P'_n to yield the final voiced/unvoiced and pitch estimate parameters V_n and P_n for the nth coding frame. This pitch post processing stage is defined in the flowchart of Figures 23, 24 and 25, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 being the input to Figure 25. At the start of this procedure "P„" and "V_n" represent the pitch estimate and voicing flag respectively, which correspond to the nth coding frame prior to post processing (i.e. P,,¹ , V„ ) whereas at the end of the procedure "P_n" and "V_n" represent the final pitch estimate and voicing flag associated with the nth frame (i.e. P„ , V_n). The LPC analysis process (Process II of Figure 1) can be performed using the Autocorrelation, Stabilised Covariance or Lattice methods. The Burg algorithm was used, although simple autocorrelation schemes could be employed without a noticeable effect in the decoded speech quality. The LPC coefficients are then transformed to an LSP representation. Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been used. LPC analysis processes are well known and described in the literature, for example "Digital Processing of Speech Signals", L.R. Rabiner and R.W. Schafer, Prentice - Hall Inc., Englewood Cliffs, New Jersey, 1978. Similarly, LSP representations are well known, for example from "Line Spectrum Pair and Speech Data Compression", F Soong and B.H. Juang, Proc. ICASSP-84, pp 1.10.1-1.10.4, 1984. Accordingly these processes and representations will not be described further in this document.

In process II, ten LSP coefficients are used to represent the data. These 10 coefficients could be quantized using scalar 37 bits with the following bit allocation pattern [3,4,4,4,4,4,4,4,3,3]. This is a relatively simple process, but the resulting bit rate of 1850 bits/second is unnecessarily high. Alternatively the LSP coefficients can be Vector Quantised (VQ) using a Split-VQ technique. In the Split-VQ technique an LSP parameter vector of dimension "p" is split into two or more subvectors of lower dimensions and then, each subvector is Vector Quantised separately (when Vector Quantising the subvectors a direct VQ approach is used). In effect, the LSP transformed coefficient vector, C, which consists of "p" consecutive coefficients (ci,c2,...,cp) is split into "K" vectors, C (l≤k≤K), with the corresponding dimensions dfc (l≤dk≤P)-

In particular, when " " is set to "p" (i.e. when C is partitioned into "p" elements) the Split-VQ becomes equivalent to Scalar Quantisation. On the other hand, when is set to unity (K=l, dk=p) the Split-VQ becomes equivalent to Full

Search VQ. 48 The above Split VQ approach leads to an LPC filter bit rate of the order of 1.3 to 1.4Kbits/sec. In order to minimize further the bit rate of the voice coded system described in this document a Split Matrix VQ (SMQ) has been developed in the University of Manchester and reported in "Efficient Coding of LSP Parameters using Split Matrix Quantisation", C.Xydeas and C.Papanastasiou, Proc ICASSP-95, pp 740-743, 1995. This method results in transparent LPC quantisation at 900bits/sec and offers a flexible way to obtain, for a given quantisation accuracy, the required memory/complexity characteristics for Process III. An important feature of SMQ is a new weighted Euclidean distance which is defined in detail as follows.

(*)-l (29)

D{L_k (l), L;_k (l)) = ∑ ∑(LSP ' - LSP' ' )² w_t(s,t)² w,((γ

where L'_k (1) represents the kth (k=l,...,K) quantized submatrix and LSP _s ⁺/_Λ.l)+% are its elements. m(k) represents the spectral dimension of the kth submatrix and N is the SMQ

* A^' frame dimension. Note also that : S(k) = ∑m(j) , m(0) = 1 and ∑m(k) = p =0 k = \

En{t) w,{t) = \ - E*t)) ■E„(t) for transmission frames 0 < / < N - 1 (30)

Aver (En)

when the Ν LPC frames consist of both voiced and unvoiced frames

w,(f) = En(t)^αl otherwise

where Er(f) is the normalised energy of the prediction error of the (l+t)th frame, En(t) is the RMS value of the (l+t)th speech frame and Aver(En) is the average RMS value of the Ν LPC frames used in SMQ. The values of the constants α and αl are set to 0.2 and 0.15 respectively.

Also: w (^,t) = | S'^i'._I)+l|^P (31) 49 where P(l_k+,"^*' ) is the value of the power envelope spectrum of the (1+t) speech frame at the l _+s LSP '__i)+x frequency, β is equal to 0.15

The overall SMQ quantisation process that yields the quantised LSP coefficients vectors / ' to / ^1+N'¹ for the 1 to 1+N-l analysis frames is shown in Figure 26. This figure also includes the inverse process, which accepts the above / ^l+l vectors i=0,..,N-l and provides the corresponding LPC coefficients vector q_ to a ^*N~ . The a'^*' i=0,..,N-l, coefficients vectors are modified, prior to the LPC to LSP transformation, by a 10 Hz bandwidth expansion as indicated in Figure 26. A 5Hz bandwidth expansion is also included in the inverse quantisation process.

Process IV of Figure 1 will now be described. This process is concerned with the mixed voiced classification of harmonics. When the nth coding frame is classified as voiced, the residual signal Rⁿ(i) of length 160 samples centred at the middle M_n of the nth coding frame and the pitch period P_n for that frame are used to determine the strongly voiced (hv;=l)/weakly voiced (hV_j=0) classification associated with the jth harmonic cø_j". The flowchart of Process IV is given in Figure 27. The R" array of 160 samples is Hamming windowed and augmented to form a 512 size array, which is then FFT processed. The maximum and minimum values MGR_max, MGR_mm of the resulting 256 spectral magnitude values are determined, and a threshold THO is calculated. TH0 is then used to clip the magnitude spectrum. The clipped MGR array is searched to define peaks MGR(P) satisfying:

MGR(P)>MGR(P+1) and MGR(P)>MGR(P- 1 )

For each peak, MGR(P), "supported" by the MGR(P+1 ) and MGR(P-l) values a second order polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a location loc(MGR(P)). Further constraints are then imposed on these magnitude peaks. In particular peaks are rejected : „,_Λ,„_.,

WO 98/01848

50 a) if there are spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range (loc(MGR(P))-fo/2 to loc(MGR(P))+fo/2 where fo is the fundamental frequency in Hz), whose value is larger than 80% of MGR(P) or b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P). After applying these two constraints the remaining spectral peaks are characterised as "dominant" peaks. The objective of the remaining part of the process is to examine if there is a "dominant" peak near a given harmonic jχω₀, in which case the harmonic is classified as strongly voiced and hV_j=l, otherwise hv_j=0. In particular, two thresholds are defined as follows:

THl=0.15^χfo. TH2=(L5/P_n)^χfo with fo=(l/P_n)xfs and fs is the sampling frequency.

The difference (loc(MGR_d(A))- loc(MGR_d (A: - l)) is compared to 1.5χfo+TH2, and if

larger a related harmonic is not associated with a "dominant" peak and the corresponding

classification hv is zero (weakly voiced). (loc(MGR_d (k)) is the location of the kth dominant

peak and k=l,...,D where D is the number of dominant peaks. This procedure is described in

detail in Figure 28, in which it should be noted that the harmonic index j does not always

correspond to the magnitude spectrum peak index k, and loc(k) is the location of the kth

dominant peak, i.e. loc (MGR_d(k)) = loc(K).

In order to minimise the bit rate associated with the transmission of the hv, information, two schemes have been employed which coarsely represent hv.

Scheme I

The spectrum is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag Bhv is assigned for each band. The first and last 500Hz bands i.e. 0 to 500 and 3500 to 4000Hz are always regarded as strongly voiced (Bhv=l) and weakly voiced (Bhv=0) respectively. When V_n=l and V_n., = l the 500 to 1000 Hz band is classified as voiced i.e. Bhv=l . Furthermore, when V_n=l and V„.,=0 the 3000 to 3500 Hz band is classified as weakly voiced i.e. Bhv=0. The Bhv values of the remaining 5 bands are determined using a majority decision rule on the hv_j values of the j harmonics which fall within the band under consideration. When the number of harmonics for a given band is even and no clear majority can be established i.e. the number of harmonics with hV_j=l is equal to the number of harmonics with hV_j=0, then the value of Bhv for that band is set to the opposite of the value assigned to the immediately preceding band. At the decoding process the hv₍ of a specific harmonic j is equal to the Bhv value of the corresponding band. Thus the hv information may be transmitted with 5 bits.

Scheme II

In this case the 680 Hz to 3400 Hz range is represented by only two variable size bands. When V_n=l and V_n._!=0 the Fc frequency that separates these two bands can be one of the following:

(A) 680, 1360, 2040, 2720. whereas, when V_n=l and V_n._j— 1 , Fc can be one of the following frequencies:

(B) 1360, 2040, 2720, 3400.

Furthermore, the 0 to 680 and 3400 to 4000 Hz bands are always represented with Bhv=l and Bhv=0 respectively. The Fc frequency is selected by examining the three bands sequentially defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics which fall within a band. When a band with a mixed voiced classification Bhv=0 is found, i.e. the number of harmonics with hv_j=0 is larger than to the number of harmonics with hV_j=l , then Fc is set to the lower boundary of this band and the remaining spectral region is classified as Bhv=0. In this case only 2 bits are allocated to define Fc. The lower band is strongly voiced with Bhv=l, whereas the higher band is weakly voiced with Bhv=0. To illustrate the effect of the mixed voice classification on the speech synthesised from the

transmitted information, Figures 29 and 30 represent respectively an original speech

waveform obtained for the utterance "Industrial shares were mostly a" and frequency tracks

obtained for that utterance. The horizontal axis represents time in terms of frames each of

20msec duration. Figure 31 shows to a larger scale a section of Figure 30, and represents

frequency tracks by full lines for the case when the voiced frames are all deemed to be

strongly voiced (hv=l) and by dashed lines when the strongly/weakly voiced classification is

taken into account so as to introduce random perturbations when hv=0.

Figure 32 shows four waveforms A, B, C and D. Waveform A represents the magnitude

spectrum of a speech segment and the corresponding LPC spectral envelope (log₁₀ domain).

Waveforms B, C and D represent the normalised Short-Term magnitude spectrum of the

corresponding residual segment (B), the excitation segment obtained using the binary

(voiced unvoiced) excitation model (C), and the excitation segment obtained using the

strongly voiced/weakly voiced/unvoiced hybrid excitation model (D). It will be noted that

the hybrid model introduces an appropriate amount of randomness where required in the 3π/4

to π range such that curve D is a much closer approximation to curve B than curve C.

Process V of Figure 1 will now be described. Once the residual signal has been derived, a segment of P_n samples is obtained in the residual signal domain. The magnitude spectrum of the segment, which contains excitation source information, is derived by applying a P_n points DFT. An alternative solution, in order to avoid the computational complexity of the P„ points DFT, is to apply a fix length FFT (128 points) and to find the value of the magnitude spectrum at the desired points, using linear interpolation.

For a real-valued sequence x(i) of P points the DFT may be expressed as:

The P_n point DFT will yield a double-side spectrum. Thus, in order to represent the excitation signal as a superposition of sinusoidal signals, the magnitude of all the non DC components must be multiplied by a factor of 2. The total number of single side magnitude spectrum values, which are used in the reconstruction process, is equal to

Process VI of Figure 1 will now be described. The DFT (Process V) applied on the P_n samples of a pitch segment in the residual domain, yields |^~( P„ + 1) / 2^~| spectral magnitudes (MGjⁿ, 0<]<[(P„ + l) / 2 ) and [(P_H + l) /

phase values. The phase information is neglected. However, the continuity of the phase between adjacent voiced frames is preserved. Moreover, the contribution of the DC magnitude component is assumed to be negligible and thus, MG₀" is set to 0. In this way, the non-DC magnitude spectrum is assumed to contain all the perceptually important information.

Based on the assumption of an "approximately" flat shape magnitude spectrum for the pitch residual segment, various methods could be used to represent the entire magnitude spectrum with a single value. Specifically, a modified single value spectral amplitude representation (MSVSAR) technique is described below.

MSVSAR is based on the observation that some of the speech spectrum resonance and anti- resonance information is also present at the residual magnitude spectrum (G.S. Kang and S.S. Everett, "Improvement of the Excitation Source in the Narrow-Band Linear Prediction Vocoder", IEEE Trans. Acoust., Speech and Signal Proc, Vol. ASSP-33, pp.377-386, 1985). LPC inverse filtering can not produce a residual signal of absolutely flat magnitude spectrum mainly due to: a) the "cascade representation" of formats by the LPC filter 1/A(z), which results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the 1/A(z) all-pole filter and b) the LPC quantisation noise. As a consequence, the LPC residual signal is itself highly intelligible. Based on this observation the MG," magnitudes are obtained by spectral sampling at the harmonic locations, ω,^π, j=l ,...,

+ 1) / 2 |, of a modified LPC synthesis filter, that is defined as follows:

MP(z) = (32)

\ - G_R∑a;^>z-' l= \ where, a" , i=l,...,p represent the p quantised LPC coefficients of the nth coding frame and G_R and G_N are defined as follows:

G_* ~G_κjfl(l -K;) (33) ι=l and

where K,ⁿ , i=l,...,p are the reflection coefficients of the nth coding frame, x„^rι"(i) represents a sequence of 2P_n speech samples centred in the middle of the nth coding frame from which the mean value is being calculated and removed, MP(ω " ) and H(ω " ) represent the frequency response of the MP(z) and 1/A(z) filters respectively at the ω,ⁿ frequency. Notice that the MP(ω " ) values are calculated assuming G_N=1. The G_l( parameter represents a constant whose value is set to 0.25.

Equation 32 defines a modified LPC synthesis filter with reduced feedback gain, whose frequency response consists of nearly equalised resonant peaks, the locations of which are very close to the LPC synthesis resonant locations. Furthermore, the value of the feedback gain G_R is controlled by the performance of the LPC model (i.e. it is proportional to the normalised LPC prediction error). In addition Equation 34 ensures that the energy of the reproduced speech signal is equal to the energy of the original speech waveform. Robustness is increased by computing the speech RMS value over two pitch periods. Two alternative magnitude spectrum representation techniques are described below, which allow for better coding of the magnitude information and lead to a significant improvement in reconstructed speech quality.

The first of the alternative magnitude spectrum representations tecliniques is referred to below in the "Na amplitude system". The basic principle of this MG" quantisation system is to represent accurately those MG" values which correspond to the Na largest speech Short

Term (ST) spectral envelope values. In particular, given the LPC coefficients of the nth coding frame, the ST magnitude spectrum envelope is calculated (i.e. sampled) at the harmonic frequencies ω " and the locations lc(j), j=l ,...,Na of the largest Na spectral samples are determined. These locations indicate effectively which of the

+ 1) / 2^~|- 1 MG" magnitudes are subjectively more important for accurate quantization. The system subsequently selects MG_jn j=lc(l),...,lc(Na) and Vector Quantizes these values. If the minimum pitch value is 17 samples, the number of non-DC MG" amplitudes is equal to 8 and for this reason Na<8. Two variations of the "Na-amplitudes system" were developed with equivalent performance and their block diagrams are depicted in Figure 33 (a) and (b) respectively.

i) Na-amplitudes system with Mean Normalization Factor. In this variation, a pitch segment of P_n residual samples Rⁿ(i), centered about the middle M_n of the nth coding frame is obtained and DFT transformed. The mean value of the spectral magnitudes MG" , j=l,..., + 1) / 2j is calculated as:

m — -1 as)

Vill m is quantized and then used as the normalization factor of the Na selected amplitudes MG" , j=lc(l ),..., lc(Na). The resulting Na amplitudes are then vector quantized to MG" .

ii) Na-amplitudes system with RMS Normalization Factor. In this variation the RMS value of the pitch segment centered about the middle M_π of the nth coding frame, is calculated as:

g is quantized and then used as the normalization factor of the Na selected amplitudes MG" , j=lc(l),...,lc(Na). These normalized amplitudes are then Vector Quantised to MG]' . Notice that the P_n points DFT operation can be avoided in this case, since the magnitude spectrum of the pitch segment is calculated only at the Na selected harmonic frequencies ω " , j=lc(l),...,lc(Na).

In both cases the quantisation of the m and g factors, used to normalize the MG" values, is performed using an adaptive μ-law quantiser with a non-linear characteristic as:

c(A) sgn( ) with μ=255 (37)

This arrangement for the quantization of g or m extends the dynamic range of the coder to not less than 25dBs.

At the receiver end the decoder recovers the MG" magnitudes as MG" = MG'" x A , j=lc(l),...,lc(Na). The remaining

+ 1) / 2^~|- Nσ - 1 MG" values are set to a constant value A. (where A is either "m" or "g"). The block diagram of the adaptive μ-law quantiser is shown in Figure 34. The second of the alternative magnitude spectrum representation techniques is referred to below as the "Variable Size Spectral Vector Quantisation (VS/SVQ)" system. Coding systems, which employ the general synthesis formula of Equation (1) to recover speech, encounter the problem of coding a variable length, pitch dependant spectral amplitude vector MG . The "Na- amplitudes" MG" quantisation schemes described in Figure 33 avoid this problem by Vector Quantising the minimum expected number of spectral amplitudes and by setting the rest of the MG" amplitudes to a fixed value. However, such a partially spectrally flat excitation model has limitations in providing high recovered speech quality. Thus, in order to improve the output speech quality, the shape of the entire { MG" } magnitude spectrum should be quantised. Various techniques have been proposed for coding { MG" }. Originally ADPCM has been used across the MG" values associated to a specific coding frame. Also { MG" } has been DCT transformed and coded differentially across successive MG" magnitude spectra. However, these coding schemes are rather inefficient and operate with relatively high bit rates. The introduction of Vector Quantisation on the { MG" } spectral amplitude vectors allowed for the development of Sinusoidal and Prototype Interpolation systems which operate at around 2.4 Kbits/sec. Two known { MG" } VQ methods are described below which quantise a variable size (vs_n) input vector with a fixed size (fxs) codevector.

i) The first VQ method involves the transformation of the input vector to a fixed size vector followed by conventional Vector Quantisation. The inverse transformation on the quantised fixed size vector yields the recovered quantised MG" vector. Transformation techniques which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole modelling and Non-Square transformation. However, the overall distortion produced by this approach is the summation of the VQ noise and a component, which is introduced by the transformation process. 58 ii) The second VQ method achieves the direct quantisation of a variable input vector wit a fixed size code vector. This is based in selecting only vs„ elements from each codebook vector, to form a distortion measure between a codebook vector and an input MG" vector. Such a quantisation approach avoids the transformation distortion of the previous techniques mentioned in (i) and results in an overall distortion that is equal to the Vector Quantisation noise.

An improved VQ method will now be described which is referred to below as the Variable Size Spectral Vector Quantisation (VS/SVQ) scheme. This scheme was developed to take advantage of the underlying principle that the actual shape of the { MG" } magnitude spectrum is defined by a minimum

+ 1) / 2 of equally spaced samples. If we consider the maximum expected pitch estimate P,„_ax, then any { MG" } spectral shape can be represented adequately by

+ 1) /2^~] samples. This suggests that the fixed size fxs of the codebook vectors Sj^ representing ih MG" shapes should not be larger than[^~(/^, + 1) / 2^~|. Of course this also implies that given the

+ l) / 2^"j samples of a codebook vector, the complete spectral shape, defined at any frequency, is obtained via an interpolation process.

Figure 35 highlights the VS/SVQ process. The codebook CBS having cbs fixed fxs dimension vectors S' j=l,...,fxs and i=l ,...,cbs, where fxs is (P„ + I) / 2^"), is used to quantise an input vector MG" , j=l,...,vs_n of dimension vs_n. Interpolation (in this case linear) is used on the S' vectors to yield S]f_ vectors of dimension vs_n . The S' to S^ interpolation process is given by:

for i=l ,...,cbs and j= 1 ,...,vs_n This process effectively defines S^_ spectral shapes at the ω " frequencies of the MG" vector. A distortion measure D(S" , MG" ) is then defined between the S^_ and MG" vectors, and the codebook vector S^ that yields the minimum distortion is selected and its index I is transmitted. Of course in the receiver. Equation (38) is used to define MG" from

If we assume that P_max«120 then fxs=60. However this value can be reduced to 50 without significant degradation by low pass filtering the signal synthesised from Equation (1 ). This is achieved by setting to zero all the harmonics MG" in the region of 3.4 to 4.0KHz, in which case: ^"3400 x pi = vs„ if vs_n<50 (39) fs 50=vs_n otherwise. and vs_n≤fxs.

Amplitude vectors, obtained from adjacent residual frames, exhibit significant redundancy, which can be removed by means of backward prediction. Prediction is performed on a harmonic basis i.e. the amplitude value of each harmonic MG," is predicted from the amplitude value of the same harmonic in previous frames i.e. MG"^{' 1} . A fixed linear predictor MG = b MG may be incorporated in the VS/SVQ system, and the resulting DPCM structure is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ)). In particular, error vectors are formed as the difference between the original spectral amplitudes MG," and their predicted ones MG" , i.e.: E_j" = MG) - MG ' for j=l,...,vs_n.

where the predicted spectral amplitudes MG" are given as:

and 60

MG" = -^~ ∑ MG_k ⁿ . for vs_n.,<j<vs_n (41)

Furthermore the quantised spectral amplitudes MG] are given as:

where E" denotes the quantised error vector.

The quantisation of the E" l<j<vs_n error vector incorporates Mean Removal and Gain Shape Quantisation techniques, using the hierarchical VQ structure of Figure 36.

A weighted Mean Square Error is used in the VS/SVQ stage of the system. The weighting function is defined as the frequency response of the filter: W(z) = 1 / A (z /γ ) , where A_n(z) is the short-term linear prediction filter and γ is a constant, defined as γ=0.93. Such a weighting function that is proportional to the short-term envelope spectrum, results in substantially improved decoded speech quality. The weighting function W" is normalised so that:

The pdf of the mean value of F is very broad and, as a result, the mean value differs widely from one vector to another. This mean value can be regarded as statistically independent of the variation of the shape of the error vector E ^ and thus, can be quantised separately without paying a substantial penalty in compression efficiency. The mean value of an error vector is calculated as follows:

M is Optimum Scalar Quantised to M and is then removed from the original error vector to form Erm" - (E^_ - M) . The overall quantization distortion is attributed to the quantization of the "Mean Removed" error vectors ( Erm" ), which is performed by a Gain-Shape Vector Quantiser.

The objective of the Gain-Shape VQ process is to determine the gain value C and the shape vector S so as to minimise the distortion measure:

∑XErm" , G ^χ S) = ∑ W/' rm] - G x S, J ^' (45)

7-1

A gain optimised VQ search method, similar to techniques used in CELP systems, is employed to find the optimum G and S_. The shape Codebook (CBS) of vectors S is searched first to yield an index I, which maximises the quantity:

Q(0 = for i=l,...,cbs (46)

where cbs is the number of codevectors in the CBS. The optimum gain value is defined as:

and is Optimum Scalar Quantised to G .

During shape quantisation the principles of VS/SVQ are employed, in the sense that the S" , vs_n size vectors are produced using Linear Interpolation on fxs size codevectors S _^ . Both trained and randomly generated shape CBS codebooks were investigated. Although Erm" has noise-like characteristics, systems using randomly generated shape codebooks resulted in unsatisfactory muffled decoded speech and were inferior to systems employing trained shape codebooks. 62 A closed-loop joint predictor and VQ design process was employed to design the CBS codebook, the optimum scalar quantisers CBM and CBG of the mean M and gain G values respectively, and also to define the prediction coefficient b of Figure 36. In particular, the following steps take place in the design process.

STEP A0 (k=0). Given a training sequence of MGj" the predictor b° is calculated in an open loop fashion (i.e. MG] = b x MG]-* for l<j<f( „ + 1) / 2^~| when V_n.,=l, or MG] = 0 elsewhere). Furthermore, the CBM⁰ mean, CBG⁰ gain and CBS⁰ shape codebooks are designed independently and again in an open loop fashion using unquantized E^ . In particular: a) Given a training sequence of error vectors E^°, the mean value of each E" ° is calculated and used in the training process of an Optimum Scalar Quantiser (CBM⁰). b) Given a training sequence of error vectors E_^ ° and the CBM° mean quantiser, the mean value of each error vector is calculated, quantised using the CBM° quantiser and removed from the original error vectors F ° to yield a sequence of "Mean Removed" training vectors Erm" °. c) Given a training sequence of Erm" ° vectors, each "Mean Removed" training vector

is normalised to unit power (i.e. is divided by the factor G = I∑W_j Erm] ),

linear interpolated to fxs points, and then used in the training process of a conventional Vector Quantiser of fxs dimension. (CBS°). d) Given a training sequence of Erm" ° vectors and the CBS° shape codebook, each "Mean Removed" training vector is encoded using Equations 46 and 47 and the value G of Equation 47 is used in the training process of an Optimum Scalar Quantiser (CBG⁰). k is set to 1 (k=l). STEP A1 Given a training sequence of MGj and the mean, gain and shape codebooks of the previous k-1 iterations (i.e. CBM^-l, CBG^"¹ _' CBS^{k- l})ι the optimum prediction coefficient b^k is calculated. STEP A2 Given a training sequence of MGj , an optimum prediction coefficient b^k and

CBM^-l , CBG^- CBSk'l , a training sequence of error vectors E^ ^k is formed, which is then used for the design of new mean, gain and shape codebooks (i.e. CBM^k, CBG¹^ CBS^k). STEP A3 The performance of the kth iteration quantization system (i.e. b^k, CBM^k, CBG^k, CBS^k) is evaluated and compared against the quantization system of the previous iteration (i.e. b^k-> , CBM^^{" 1} , CBG^k-^ CBS^¹ ). If the quantization distortion converges to a minimum, the quantization design process stops. Otherwise, k=k+l and Steps A1 , A2 and A3 are repeated.

The performance of each quantizer (i.e. b^k , CBM^k, CBG^k _' CBS^k) has been evaluated using subjective tests and a LogSegSNR distortion measure, which was found to reflect the subjective performance of the system.

The design for the Mean-Shape-Gain Quantiser used in STEP A2 is performed using the following two steps :

STEP B1 Given a training sequence of error vectors E ^^k, the mean value of each E" ^k is calculated and used in the training process of an Optimum Scalar Quantiser (CBM^k).

STEP B2 Given a training sequence of error vectors E^^k and the CBM^k mean quantizer, the mean value of each residual vector is calculated, quantized and removed from the original residual vectors E_^^k to yield a sequence of "Mean Removed" training vectors Erm" , which are then used as the training data in the design of an optimum Gain Shape Quantizer (CBG^k and CBS^k). This involves steps Cl - C4 below. (The quantization design process is performed under the assumption of any independent gain shape quantiser structure, i.e. an input error vector Ejnr¹ can be represented by any possible combination of S' codebook shape vectors and G gain quantizer levels.) STEP C1 (v=0). Given a training sequence of vectors Erm" ^k and an initial CBG^{k 0} and

CBS^k,° gain and shape codebooks respectively, compute the overall average distortion distance D ₀ as in Equation 44. Set v equal to 1 (v=l ).

STEP C2 Given a training sequence of vectors Erm" ^k and the CBG^{k v}-' gain codebook from the previous iteration, compute the new shape codebook CBS^{k v} which minimises the VQ distortion measure. Notice that the optimum CBS^{k v} shape codebook is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in Ml _{k v} iterations.

STEP C3 Given a training sequence of vectors Erm" and the CBS ^v shape codebook, compute a new gain quantiser CBG^{k v}, which minimise the distortion measure of Equation (44). This optimum CBG^{k v} gain quantiser is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in M2_{k v} iterations.

STEP C4 Given a training sequence of vectors Erm" ^k and the shape and gain codebooks CBS^ and CBG^{k v}, compute the average overall distortion measure. If (D_{k v}._r D _vVD_{k v}<ε stop. Otherwise, v=v+ 1 and go back to STEP C2.

The centroids S*_U ^VJ" , i=l,...,cbs and u=l ,...,fxs of the shape Codebook CBS^{k v}-^m , are updated during the mth iteration performed in STEP C2 (m=l ,...,M l_{k v}) as follows:

where C^^ ^^G^-' x /,,_,,,,)² _,

HC_U , = W;G/' - f_{U J},(Erm" - G^S^ O - ./_H._/Λ) 65

C [1 ⁱf Λ,, ≤ i

H.J 1° ⁱf /..,„ > » ^'

Qi denotes the cluster of Erm" ^k error vectors which are quantised to the S, '^'"'"^"' codebook shape vector, cbs represents the total number of shape quantisation levels, J_n represents the CBG^{k v}-' gain codebook index which encodes the Erm" ^k error vector and 1 ≤j<vs_n.

The gain centroids, Gj^{c,v m} , i=l,...,cbg of the CBG^{k v m} gain quantiser, which are computed during the mth iteration in STEP C3 (m=l ,...,M2_{k v}), are given as: 66

where D_j denotes the cluster of Erm" ^k error vectors which are quantised to the G_l ^{k v}"^1"1 gain quantiser level, cbg represents the total number of gain quantisation levels, I_n represents the CBS^k-^v shape codebook index which encodes the Erm" ^k error vector and l<j<vs_n.

The above employed design process is applied to obtain the optimum shape codebook CBS, optimum gain and mean quantizers, CBG and CBM and the optimum prediction coefficient b which was finally set to b=0.35.

Process VII calculates the energy of the residual signal. The LPC analysis performed in Process II provides the prediction coefficients a, l<i<p and the reflection coefficients k, l<i<p. On the other hand, the Voiced/Unvoiced classification performed in Process I provides the short term autocorrelation coefficient for zero delay of the speech signal (RO) for the frame under consideration. Hence, the Energy of the residual signal E„ value is given as:

The above expression represents the minimum prediction error as it is obtained from the Linear Prediction process. However, because of quantization distortion the parameters of the LPC filter used in the coding-decoding process are slightly different from the ones that achieve minimum prediction error. Thus, Equation (50) gives a good approximation of the residual signal energy with low computational requirements. The accurate E„ value can be given as:

The resulting _Λ/E ^{1S en} Scalar Quantised using an adaptive μ-law quantised arrangement similar to the one depicted in Figure 34. In the case where more than one _Λ/E are used in the system i.e. the energy E_n is calculated for a number of subframes then £_{H t} is given by the general equation:

£„,=^~ M_x-∑ _M ^R"ⁱ⁺^^M °≤ξ≤≡ (52)

Notice that when Ξ= 1,M_S=M and for Ξ = 4, M,=M/4.

Claims

1. A speech synthesis system in which a speech signal is divided into a series

of frames, and each frame is converted into a coded signal including a

speech segment centred about a reference sample is defined in each frame, a

the maximum of multiple crosscorrelation values obtained from variable length

speech segments centred about the reference sample, the correlation values are

used to form a correlation function defining peaks, and the locations of the

peaks are determined and used to define a pitch estimate.

2. A system according to claim 1, wherein the pitch estimate is defined using

an iterative process.

3. A system according to claim 1 or 2, wherein a single reference sample may

be used, centred with respect to the respective frame.

4. A system according to claim 1 or 2, wherein multiple pitch estimates are

derived for each frame using different reference samples, the multiple pitch

estimates being combined to define a combined pitch estimate for the frame.

5. A system according to any preceding claim, wherein the pitch estimate is

modified by reference to a voiced/unvoiced status and/or pitch estimates of

adjacent frames to define a final pitch estimate.

6. A system according to any preceding claim, wherein the correlation

function is clipped using a threshold value, remaining peaks being rejected if

they are adjacent to larger peaks.

7. A system according to claim 6, wherein peaks are selected which are

larger that either adjacent peak and peaks are rejected if they are smaller than a

following peak by more than a predetermined factor.

8. A system according to any preceding claim, wherein the pitch estimation

procedure is based on a least squares error algorithm.

9. A system according to claim 8, wherein the pitch estimation algorithm

defines the pitch valve as a number whose multiples best fit the correlation

function peak locations.

10. A system according to any preceding claim, wherein possible pitch values

are limited to integral numbers which are not consecutive, the increment between two successive numbers being proportional to a constant multiplied by

the lower of those two numbers.

11. A speech synthesis system in which a speech signal is divided into a series

of frames, and each frame is converted into a coded signal including pitch

segment magnitude spectral information, a voiced/unvoiced classification, and a

mixed voiced classification which classifies harmonics in the magnitude spectrum

of voiced frames as strongly voiced or weakly voiced, wherein a series of samples

centred on the middle of the frame arc windowed to form a data array which is

Fourier transformed to produce a magnitude spectrum, a threshold value is

define dominant peaks, and harmonics not associated with a dominant peak are

classified as weakly voiced.

12. A system according to claim 11, wherein peaks arc located using a second

order polynomial

13. A system according to claim 11 or 12, wherein the samples are Hamming

windowed.

14. A system according to claim 11, 12 or 13, wherein the threshold value is

calculated by identifying the maximum and minimum magnitude spectrum

values and defining the threshold as a constant multiplied by the difference

between the maximum and minimum values.

15. A system according to any one of claims 11 to 14, wherein peaks are

defined as those values which are greater than the two adjacent values, a peak

being rejected from consideration if neighbouring peaks are of a similar

magnitude or if there arc spectral magnitudes in the same range of greater

magnitude.

16. A system according to any one of claims 11 to 15, wherein a harmonic is

considered as not being associated with a dominant peak if the difference

between two adjacent peaks is greater than a predetermined threshold value.

17. A system according to any one of claims 11 to 16, wherein the spectrum is

divided into bands of fixed width and a strongly/weakly voiced classification is

assigned for each band.

18. A system according to any one of claims 11 to 17, wherein the frequency

range is divided into two or more bands of variable width, adjacent bands being separated at a frequency selected by reference to the strongly/weakly voiced

classification of harmonics.

19. A system according to claim 17 or 18, wherein the lowest frequency band

is regarded as strongly voiced, whereas the highest frequency band is regarded

as weakly voiced.

20. A system according to claim 19, wherein the event that a current frame is

voiced, and the following frame is unvoiced, further bands within the current

frame will be automatically classified as weakly voiced.

21. A system according to claim 19 or 20, wherein the strongly/weakly voiced

classification is determined using a majority decision rule on the strongly/weakly

voiced classification of those harmonics which fail within the band in question.

22. A system according to claim 21, wherein, if there is no majority, alternate

bands are alternately assigned strongly voiced and weakly voiced classifications.

23. A speech synthesis system in which a speech signal is divided into a series

of frames, each frame is defined as voiced or unvoiced, each frame is converted

into a coded signal including a pitch period value, a frame voiced/unvoiced

classification and, for each voiced frame, a mixed voiced spectral band classification which classifies harmonics within spectral bands as either strongly

or weakly voiced, and the speech signal is reconstructed by generating an

filter, wherein for each weakly voiced spectral band, an excitation signal is

generated which includes a random component in the form of a function which is

dependent upon the respective pitch period value.

24. A system according to claim 23, wherein the spectrum is divided into

bands and a strongly/weakly voiced classification is assigned to each band.

25. A system according to claim 23 or 24, wherein the random component is

introduced by reducing the amplitude of harmonic oscillators assigned the

weakly voiced classification, disturbing the oscillator frequencies such that the

frequency is no longer a multiple of the fundamental frequency, and then adding

further random signals.

26. A system according to claim 25, wherein the phase of the oscillators is

randomised.

27. A speech synthesis system in which a speech signal is divided into a series

of frames, and each voiced frame is converted into a coded signal including a

pitch period value LPC coefficients and pitch segment spectral magnitude information, wherein the spectral magnitude information is quantized by

sampling the LPC short term magnitude spectrum at harmonic frequencies, the

magnitudes are relatively more important for accurate quantization, and the

magnitudes so identified are selected and vector quantized.

28. A system according to claim 27, wherein a pitch segment of P„ LPC

residual samples is obtained, where P_π is the pitch period value of the nth frame,

the pitch segment is DFT transformed, the mean value of the resultant spectral

magnitudes is calculated, the mean value is quantized and used as a

normalisation factor for the selected magnitudes, and the resulting normalised

amplitudes are quantized.

29. A system according to claim 27, wherein the RMS value of the pitch

segment is calculated, the RMS value is quantized and used as a normalisation

factor for the selected magnitudes, and the resulting normalised amplitudes are

quantized.

30. A system according to any one of claims 27 to 29, wherein , at the receiver,

the selected magnitudes are recovered, and each of the other magnitude values is

reproduced as a constant value.

31. A speech synthesis system in which a variable size input vector of

coefficients to be transmitted to a receiver for the reconstruction of a speech

signal is vector quantized using a codebook defined by vectors of fixed size, the

codebook vectors of fixed size are obtained from variable sized training vectors

and an interpolation technique which is an integral part of the codebook

generation process, codebook vectors arc compared to the variable sized input

vector using the interpolation process, and an index associated with the codebook

entry with the smallest difference from the comparison is transmitted, the index

being used to address a further codebook at the receiver and thereby derive an

associated fixed size codebook vector, and the interpolation process being used to

recover from the derived fixed sized codebook vector an approximation of the

variable sized input vector.

32. A system according to claim 31 , wherein the interpolation process is

linear, and for an input vector of given dimension, the interpolation process is

applied to produce from the codebook vectors a set of vectors of that given

dimension, a distortion measure is then derived to compare the interpolated set

of vectors and the input vector, and the codebook vector is selected which yields

the minimum distortion.

33. A system according to claim 32, wherein the dimension of the vectors is

reduced by taking into account only the harmonic amplitudes within an input

bandwidth range.

34. A system according to claim 33, wherein the remaining amplitudes are set

to a constant value.

35. A system according to claim 34, wherein the constant value is equal to the

mean value of the quantized amplitudes.

36. A system according to any one of claims 31 to 35, wherein redundancy

between amplitude vectors obtained from adjacent residual frames is removed

by means of backward prediction.

37. A system according to claim 36, wherein the backward prediction is

performed on a harmonic basis such that the amplitude value of each harmonic

of one frame is predicted from the amplitude value of the same harmonic in the

previous frame or frames.

38. A speech synthesis system in which a speech signal is divided into a series of frames, each frame is converted into a coded signal including an estimated

pitch period, an estimate of the energy of a speech segment the duration of which is a function of the estimated pitch period, and LPC filter coefficients defining an

LPC spectral envelope, and a speech signal of related power to the power of the

input speech signal is reconstructed by generating an excitation signal using

spectral amplitudes which are defined from a modified LPC spectral envelope

sampled at harmonic frequencies defined by the pitch period.

39. A system according to claim 38, wherein the magnitude values are

obtained by spectrally sampling a modified LPC synthesis filter characteristic at

the harmonic locations related to the pitch period.

40. A system according to claim 39, wherein the modified LPC synthesis filter

has reduced feed back gain and a frequency response which consists of equalised

resonant peaks, the locations of which are close to the LPC synthesis resonant

locations.

41. A system according to claim 40, wherein the value of the feed back gain is

controlled by the performance of the LPC model such that it is related to the

normalised LPC prediction error.

42. A system according to any one of claims 38 to 41, wherein the energy of

the reproduced speech signal is equal to the energy of the original speech

waveform.

43. A speech synthesis system in which a speech signal is divided into a series

of frames, each frame is converted into a coded signal including LPC filter

coefficients and at least one parameter associated with a pitch segment

magnitude, and the speech signal is reconstructed by generating two excitation

excitation signal generated on the basis of the pitch segment magnitude

parameter or parameters of one frame and a second excitation signal generated

on the basis of the pitch segment magnitude parameter or parameters of a

second frame which follows and is adjacent to the said one frame, applying the

first excitation signal to a first LPC filter the characteristics of which are

and combining the outputs of the first and second LPC filters to produce one

frame of a synthesised speech signal.

44. A system according to claim 43, wherein the first and second excitation

signals include the same phase function and different phase contributions from

the two LPC filters.

45. A system according to claim 44, wherein the outputs of the first and

second LPC filters arc weighted by half a window function such that the

magnitude of the output of the first filter is decreasing with time and the

magnitude of the output of the second filter is increasing with time.

46. A speech coding system which operates on a frame by frame basis, and in

which information is transmitted which represents each frame as either voiced or

unvoiced and, for each voiced frame, represents that frame by a pitch period

received pitch period value and magnitude spectral information being used to

generate residual signals at the receiver which are applied to LPC speech

filter coefficients, wherein each residual signal is synthesised according to

sinusoidal mixed excitation synthesis process, and a recovered speech signal is

derived from the residual signals.

47. A speech synthesis system substantially as hereinbefore described with reference to the accompany drawings.