WO2004025626A1

WO2004025626A1 - Phoneme to speech converter

Info

Publication number: WO2004025626A1
Application number: PCT/AU2003/001098
Authority: WO
Inventors: Leslie Doherty
Original assignee: Leslie Doherty
Priority date: 2002-09-10
Filing date: 2003-08-28
Publication date: 2004-03-25
Also published as: AU2003254398A1

Abstract

A method of generating speech waveforms from phoneme data is disclosed. The synthesizer generates speech directly from component waveforms. Structured waveforms with frequency and amplitude selected by input data are used to output phoneme signals that are perceived as voiced speech. Unvoiced phonemes are generated from stored data. Voiced affricates and voiced plosives are generated from a combination of structured waveforms and stored data. A method of converting phonetic symbols with pitch, duration and the amplitude levels into phoneme data is also provided that enables modulated signals with intonations resembling normal speech to be generated.

Description

PHONEME TO SPEECH CONVERTER

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention includes speech synthesizers, and more particularly to architectures of speech synthesisers and methods of producing speech from numerical and symbolic data.

2. Description of the Related Art

Speech synthesis is the computer generation of sound that resembles human speech.

Speech signals are divided into small sound units called phonemes. Phonemes have prosody characteristics such as pitch, amplitude and duration as well as special distinctive characteristics that enable them to be identified and distinguished as elements in the communication process in a similar manner to the letters used in written communications such as this document. Speech synthesizers recreate speech by generating concatenated phonemes from a series of numerical codes representing a speech signal. There are several techniques for implementing this process. One method of speech synthesis is the use of coded binary numbers representing excitation and filter parameters derived from the analysis of speech. Speech parameters, stored as binary numbers, are used to regenerate speech by excitation of a time-varying digital filter. A processor supplies overall control of speech production. The process of speech production is typically a digital process up to the point of an analogue-to-digital converter, which supplies an analogue signal to drive a loudspeaker.

In an alternative embodiment, the vocal tract is simulated by a dozen or so connected pipes of different diameter and excitation is represented by a pulse stream at the vocal- chord rate for voiced sounds or a random noise source for unvoiced parts of speech. The reflection coefficients at the junctions of the pipes are obtained from a linear prediction analysis of the speech waveform.

An alternative to the time-varying filter approach is a speech generation system that stores speech as digitised phonetic segments, usually in a compressed form, and regenerates speech by concatenating segments of the stored speech. The segments may be phrases such as "at the next roundabout", "at the end of the road", "turn", "left", "right", which can be put together in several alternative ways to give instructions such as "at the end of the road turn left" or "at the next roundabout turn left". Such synthesisers can use shorter speech elements such as syllables, diphones and monophones to generate words; for example "shi", "sli", "ni", "im" and "ip" can be put together to form "shim", "slim", "ship", "slip" and "nip". Some attempts have been made to generate speech by simply adding together modulated sine waves, these method use frequency analysis of real speech to determine the major frequency components and manipulate phase and modulation parameters to reproduce speech. These methods can sometimes produce understandable speech, but the quality of the speech is generally poor. The nearest published patent specification to this one is Kagoshima 2002/0138253 which uses repetitive waveforms in a similar way to this invention. There are a number of differences between the 2002/0138253 specification and this invention in that there is no requirement in the 2002/0138253 specification to reset the waveform generators at the start of each pitch period; according to the 2002/0138253 specification it is necessary to modify the waveform generator output by filters or multiplication by a second characteristic waveform; the 2002/0138253 specification gives no data for the generation of a voiced phoneme language set and no information relating to the implementation of diphthongs and glides.

SUMMARY OF THE INVENTION

Although vowels and structured consonants appear to consist of simple sinusoids, they are difficult to analyse by formal mathematical methods and produce complex spectra that are as difficult to interpret as the original waveform. This difficulty in analysis has led to confusion in understanding. Further, many implementations have used the Fourier method of waveform construction, adding together a fundamental frequency and an appropriate selection of harmonics, but, because the pitch is not a simple sub-harmonic of the vowel formants, an inordinate number of frequencies are required to generate a simple sinusoidal formant. However, the observation and analysis of speech waveforms leads to the understanding that a much simpler method can be used to reproduce voiced speech sounds using simple sinusoids.

Each identifiable voiced phoneme has between one and three characteristic formant frequencies. The first formant frequency lies within the range of 150 to 1400 Hz. The second formant, if present, lies between 700 and 1200 Hz, while the third formant, present in the |i| and |e| vowel sounds, is between 1200 and 5000 Hz. However, simply generating and adding these frequencies together will not produce anything other than music; adding a sine wave of the pitch frequency will only add to the cacophony and will not help. How to change this "music" into speech is the subject of this invention.

Figure 1 shows part of the "i" phoneme sampled at 25,000 Hz by a 10-bit ADC, zero is represented by a level of 512. Although there are small changes in the waveform with each repetition, it can be seen that the same waveform is repeated several times at an interval of 146 samples. The time of each repetition is the pitch period and by dividing the sampling frequency by the number of samples in the pitch period, the pitch frequency can be calculated; in the case shown in Figure 1 the pitch frequency is 171 Hz. Examination of the waveform in Figure 1 over the length a single pitch period shows that there are two major frequency components that can be determined by measuring the number of samples between waveform peaks. The lower frequency spans 56 samples and the higher frequency component varies from 8 to 11 samples according to its position. The characteristic formant frequencies are therefore 446 Hz for the lower formant and between 2273 Hz and 3125 Hz for the higher formant.

To synthesize the "i" phoneme, all that is necessary is to add together the characteristic component frequencies of 446 Hz and 2273 Hz and then repeatedly output a length of the composite waveform as dictated by the pitch period for the duration of the phoneme. Figure 2 shows the synthesised waveform using this process. The amplitude of the components in this example is 200 for the lower frequency and 60 for the higher frequency. It may be noticed that over the pitch period of the phoneme segments in both Figure 1 and Figure 2, the amplitude of the waveform reduces gradually with time. This reduction in amplitude over the pitch period is called "damping". Damping is applied to the synthesised waveform by multiplying the generated formant waves by a damping waveform that gradually reduces in amplitude over the pitch period.

Figure 3 and Figure 4 present typical settings for the English phonemes. These settings may be adjusted over a fairly wide range to accommodate a variety of accents and personal characteristics and can be worked out by trial and error. The first column is the phoneme symbol used to access the data. The second column contains words with examples of the phoneme's pronunciation. Columns 3 to 8 contain alternate frequency and amplitude formant setting data for up to 3 formants; the frequencies and amplitudes nominated are typical and variation about these values adds personality and lifelike variation into the synthesizer. Column 9 is default duration in milliseconds, column 10 is a number representing the extent of randomness and column 11 is the phoneme type indicator.

Figure 4 shows data for phonemes that contain noise. Where no first formant is present the phoneme contains only noise. Noise phonemes can be obtained from any white noise source such as a random number generator and filtering to the required bandwidth and centre frequency. Alternatively, pre-recorded segments of noise phonemes may be used; these may be generated or real speech samples. To generate phonemes with formants and noise, the noise is simply added to the synthesised voiced waveform at a level indicated in column 8 (a3) of Figure 4. Many voiced affricates have a great deal of variation in pitch and formants, the randomness figure in column 10 is increased for these phonemes to induce this variation as the waveform is being generated.

Diphthongs are complex vowel sounds and glides are complex combinations of consonants and vowels. The word "patent" has a diphthong as the first vowel. This particular diphthong starts with the "e" phoneme and ends with the "i" phoneme. No diphthongs or glides are listed in Figure 3 or Figure 4 because they may be generated using the data of the phonemes already included. To generate a diphthong or a glide interpolation is used over the duration of the phoneme to generate the appropriate waveform. This interpolation applies to the first formant settings and higher formants if they exist in both as in the case of "ei". In the case where one of the diphthong component phonemes contains a second or third formant and the other component does not, the second or third formant is switched in or out over the duration of the diphthong.

Input to this type of synthesizer may be direct pitch and formant parameters for voiced speech with type and related amplitude and duration for speech with noise. Quiet intervals between phonemes can be simulated with a blank type phoneme that produces no output. This type of input is usually derived from the analysis of real speech and produces a close approximation to the original speech. Alternatively, phonetic symbols may be used to generate speech from look up tables based on data similar to that in Figures 3 and 4 and uses either default values or input parameters for pitch, level and duration. Text-to-speech converters that convert words and phrases to compatible phonetic symbols with pitch, level and duration information can also be incorporated into the design.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings have been referred to in the foregoing summary of the invention: Figure 1 is a sample of recorded speech;

Figure 2 is a sample of synthesized speech;

Figure 3 is a table of phonemes that do not contain noise and

Figure 4 is a table of phonemes that contain noise. The following drawings are provided to clarify the detailed description of the preferred embodiment of the invention:

Figure 5 is a block schematic illustrating the preferred hardware embodiment of the invention;

Figure 6 is a schematic block diagram of a formant generator; Figure 7 is a block schematic of a noise formant generator;

Figure 8 is the overall level and damping control schematic block diagram;

Figure 9 is a schematic diagram of the mixer and output circuit and

Figure 10 is a block schematic of the clock generator and controller.

Figure 11 is a flow chart of the control sequence required to operate the synthesizer. DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The architecture of a speech synthesizer according to this invention is shown in Figure 5. A controller 1 is used to organise and translate data received through an input port 2. The said controller 1 coordinates three formant generators 3, 4 and 5 as well as a noise generator 6. The outputs of 3, 4, 5 and 6 are combined and amplified by the mixer 7 before being output to the loudspeaker. Timing of the synthesizer originates from the clock generator 9. A data buss 10 is used to carry parameters from the controller 1 to all the generators 3, 4, 5 and 6 and the level control circuit 11, which is a means of adjusting the overall output signal level as well as generating the damping waveform.

There are three identical formant generators 3, 4 and 5; Figure 6 and this description apply to all three. Frequency selection 12 is a register and counter that controls the clock rate of the formant generator. Data on the buss 10 is latched into the register when the set frequency control line 13 is logical zero. The high-speed clock 14 is divided by the number in the register producing a pulse that clocks the address counter 15 through connection 16. At the end of every pitch period the address counter is reset to zero when the reset connection 17 is logical zero. The address counter 16 outputs are connected to the wave look up table 18 inputs via an address buss 19. The wave look up table 18 contains the formant wave shape, which may be a sine wave, in digital form; this is converted into analogue form by the multiplying digital to analogue converter (DAC) 20. The wave look up table 18 is connected to the DAC 20 through connector 21. The formant amplitude is controlled through the amplitude memory register 22, which is set from the data buss 10 when the set amplitude 23 is logical zero. A multiplying DAC 24 converts the digital output 25 of the amplitude memory 22 into a voltage. This voltage is governed by the input damping 26 to the DAC as well as the amplitude digital output 25. The voltage is connected to the reference voltage input of the wave generator multiplying DAC 20 through connector 27 and controls the amplitude of the formant output 28.

Figure 7 is a block schematic of the noise generator 6. In the preferred embodiment, ten different types of speech noise are digitised into 1024 byte segments and stored in a noise look up table 29, any one of the speech noise types or no noise may be set into the noise select register 30 from the data buss 10 when the set noise type control 31 is logical zero. The address counter 32 divides the clock 33 and the more significant bits of the address counter 32 are input to the sequence mixer 34 through connectors 35. The sequence mixer 35 randomises the address so that sequences longer than 1024 samples can be output from the noise look up table 29. The address select inputs of the noise look up table 29 are controlled by the noise select register 30 through connector 36, the output of the sequence mixer 34 through connector 37a and the lesser significant bits of the address counter 32 through connector 37b. The combined address inputs select the noise data output 38 of the noise look up table 29. If longer noise segments are used, the address counter 32 may be connected directly to the noise look up table 29 without the necessity of a sequence mixer 34. The noise level is controlled through the noise level memory register 39, which is set from the data buss 10 when the set noise level 40 is logical zero. A multiplying DAC 41 converts the digital output 42 of the noise level memory 42 into a voltage. This voltage is governed by the input level 43 to the DAC as well as the noise level digital output 42. The voltage is connected to the reference voltage input of the noise generator multiplying DAC 44 through connector 45 and controls the amplitude of the noise output 46. The digital noise output 38 is converted into an analogue signal by the noise generator multiplying DAC 44.

The mixer circuit 7 is shown in Figure 8. Formant inputs 28(3), 28(4) and 28(5) from the three formant generators 3, 4 and 5 are combined with the noise input 46 through a resistor network of amplifier 57. The combined signal output from the amplifier 57 is fed to a loudspeaker 8.

The overall amplitude level and damping are controlled by the level control 11 illustrated by Figure 9. The level memory 47 is set to the parameter on the data buss 10 when the set level control 48 is a logical zero. A multiplying DAC 49 converts the value 50 in the level memory 47 to an analogue voltage, level 43, which is a proportion of the reference voltage 51. The damping output, damping 26, is generated by the damping look up table 52, which stores the damping profile as a series of digital levels. The address counter 53 is reset at the beginning of each pitch period by the reset control 17. The clock input 33 causes the address counter 53 through connection 54 to select each word of the damping look up table 52 in turn. The multiplying DAC 55 converts the digital output 56 of the damping look up table 52 into an analogue voltage, damping 26.

The overall functioning of the synthesiser is controlled for simplicity of implementation by a micro-controller 1. The operation of the controller requires a highspeed clock 14 provided by the clock generator 58. The high-speed clock pulse 14 is divided down to the lower frequency sample rate clock 33 by the clock divider 59. Outputs from the controller 1 are the data buss 10 through which all parameters are passed to the formant, noise and level circuits and control lines are used for setting the parameters into particular registers. The control lines are normally at a logical one level but drop to a logical zero level when the particular register is selected; when the control lines return to a logical one level the data from the buss 10 is latched into the register.

The control process is indicated in Figure 11, which presents a flow chart of the operation. On a timer interrupt, the reset control 17 is set to a logical zero. Next the overall level and pitch period is obtained from input data 2 along with formant and noise parameters. The pitch period depends on the frequency of the clock 33. Normally the pitch frequency is higher than 60 Hz and to calculate the pitch period all that needs to be done is to divide the frequency of the clock 33 by the pitch period. If the pitch frequency is zero, the duration of the phoneme is used as the pitch period. A useful method for facilitating the synthesizer to sing is to select a pitch period based on musical notes. This can be implemented by selecting a pitch period from a table containing the sequence of pitch periods related to musical notes indexed by numbers from 1 and 60. When the pitch period has been obtained the timer interrupt is set to interrupt at the end of the pitch period. Following the set up of the timer the output segment routine is called and each formant set frequency 13 and set amplitude 23 control line is set to logical zero for a short time in turn while the data is output on the data buss 10. After the formant parameters have been set into the formant generators 3, 4 and 5, the set noise type 31 and set noise level 40 are set to logical zero for a short time in turn as the noise type and noise level are output on the data buss 10. Following this process, the overall level is output on the data buss 10 and the set level 48 is set to logical zero for a short time and then the reset is released by the interrupt routine, which then returns to the main program. The main function the controller 1 may be simply a program to set up the interrupt for first time and then input phoneme data on a pitch by pitch basis; this is useful for a vocoder application when speech data has been analysed and the input data to the synthesizer is in the form of pitch, level, noise and formant parameters. Alternatively, the controller may input only phoneme symbols, in which case the symbols have to be translated into duration, pitch, level, noise and formant parameters. Phoneme symbols may be supplemented by prosody data in which case the default duration, pitch and level default parameters would be replaced by the input data. A further option would be to include a text-to-phoneme converter in the controller. Such converters use pronunciation dictionaries and prosody models to convert the input text to phonetic symbols and prosody data. The phonetic symbols and prosody data can then be used to generate speech.

Alternative implementations include:

I. Overall level control may be omitted or applied after combining formants and noise 2. Damping control may be omitted or applied after combining formants

3. The use of gain controlled amplifiers instead of multiplying DACs

4. The use of digital arithmetic for setting levels instead of multiplying DACs

5. The use of digital addition of formants and noise instead of analogue summing

6. Formant generators may be replaced by re-settable variable frequency analogue or digital sine wave generators

7. White noise generators or exciters with bandwidth and frequency control may replace noise generators.

8. The controller may change the wave, noise and damping look up table data

9. Application specific integrated circuit design with or without an embedded controller may be used to implement the synthesizer

10. Field programmable logic arrays or electrically programmable logic devices may be used to implement the digital circuits

I I. Implementation by computer software.

Claims

CLAIMSWhat is claimed is:

1. A speech synthesis apparatus comprising: one or more formant wave generators that are reset synchronously with the pitch period and a noise generator.

2. An apparatus according to claim 1, wherein: damping is applied to the formant waveforms.

3. An apparatus according to claim 1 or 2, wherein: overall level control is applied to the synthesizer.

4. An apparatus according to claim 1, 2 or 3, wherein: the formant wave and noise generators are controlled by a micro-controller.

5. An apparatus according to claim 4, wherein: stored parameters are used to control the formant wave and noise generator

6. An apparatus according to claim 5, wherein: stored prosody parameters are used to control pitch period, level and duration

7. An apparatus according to claim 5 or 6, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate formant and noise parameters, pitch, level and duration.

8. An apparatus according to claim 5 or 6, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate phoneme symbols, pitch, level and duration.

9. A method according to claim 8, wherein: a dictionary includes storing sampled words and phonics and an encoding designating the pronunciation of the words and phonics; and storing a context list.

10. A computer system comprising: a processor; a memory coupled to the processor and program code executable on the processor for generating speech sounds by one or more formant wave generators that are reset synchronously with the pitch period and a noise generator.

11. A computer system according to claim 10, wherein: damping is applied to the formant waveforms.

12. A computer system according to claim 10 or 11, wherein: overall level control is applied to the synthesizer.

13. A computer system according to claim 10, 11 or 12, wherein: the formant wave and noise generators are controlled by a micro-controller.

14. A computer system according to claim 13, wherein: stored parameters are used to control the formant wave and noise generators

15. A computer system according to claim 14, wherein: stored prosody parameters are used to control pitch period, level and duration

16. A computer system according to claim 14 or 15, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate formant and noise parameters, pitch, level and duration.

17. A computer system according to claim 14 or 15, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate phoneme symbols, pitch, level and duration.

18. A computer system according to claim 17, wherein: a dictionary includes storing sampled words and phonics and an encoding designating the pronunciation of the words and phonics; and storing a context list.

19. A telephone system comprising: a telephone; a controller coupled to the telephone; a speech synthesis apparatus according to claim 1, 2 or 3, wherein the formant wave and noise generators are controlled by the said controller.

20. A communication apparatus comprising: an interface for connecting to a communication system; and a speech apparatus according to any of the claims 1 to 18 coupled to the interface.

21. A communication apparatus according to claim 20 wherein the interface communicates with a modem.

22. A speech synthesis apparatus that includes an apparatus for generating specific preset pitch periods such as musical tones from a set of encoded pitch input values.