WO2004025626A1 - Phoneme to speech converter - Google Patents

Phoneme to speech converter Download PDF

Info

Publication number
WO2004025626A1
WO2004025626A1 PCT/AU2003/001098 AU0301098W WO2004025626A1 WO 2004025626 A1 WO2004025626 A1 WO 2004025626A1 AU 0301098 W AU0301098 W AU 0301098W WO 2004025626 A1 WO2004025626 A1 WO 2004025626A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
formant
noise
pitch
level
Prior art date
Application number
PCT/AU2003/001098
Other languages
French (fr)
Inventor
Leslie Doherty
Original Assignee
Leslie Doherty
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Leslie Doherty filed Critical Leslie Doherty
Priority to AU2003254398A priority Critical patent/AU2003254398A1/en
Publication of WO2004025626A1 publication Critical patent/WO2004025626A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • This invention includes speech synthesizers, and more particularly to architectures of speech synthesisers and methods of producing speech from numerical and symbolic data.
  • Speech synthesis is the computer generation of sound that resembles human speech.
  • Speech signals are divided into small sound units called phonemes.
  • Phonemes have prosody characteristics such as pitch, amplitude and duration as well as special distinctive characteristics that enable them to be identified and distinguished as elements in the communication process in a similar manner to the letters used in written communications such as this document.
  • Speech synthesizers recreate speech by generating concatenated phonemes from a series of numerical codes representing a speech signal. There are several techniques for implementing this process.
  • One method of speech synthesis is the use of coded binary numbers representing excitation and filter parameters derived from the analysis of speech. Speech parameters, stored as binary numbers, are used to regenerate speech by excitation of a time-varying digital filter.
  • a processor supplies overall control of speech production.
  • the process of speech production is typically a digital process up to the point of an analogue-to-digital converter, which supplies an analogue signal to drive a loudspeaker.
  • the vocal tract is simulated by a dozen or so connected pipes of different diameter and excitation is represented by a pulse stream at the vocal- chord rate for voiced sounds or a random noise source for unvoiced parts of speech.
  • the reflection coefficients at the junctions of the pipes are obtained from a linear prediction analysis of the speech waveform.
  • An alternative to the time-varying filter approach is a speech generation system that stores speech as digitised phonetic segments, usually in a compressed form, and regenerates speech by concatenating segments of the stored speech.
  • the segments may be phrases such as "at the next roundabout”, “at the end of the road”, “turn”, “left”, “right”, which can be put together in several alternative ways to give instructions such as “at the end of the road turn left” or “at the next roundabout turn left”.
  • Such synthesisers can use shorter speech elements such as syllables, diphones and monophones to generate words; for example “shi”, “sli”, “ni”, “im” and “ip” can be put together to form “shim”, “slim”, “ship”, “slip” and “nip”.
  • vowels and structured consonants appear to consist of simple sinusoids, they are difficult to analyse by formal mathematical methods and produce complex spectra that are as difficult to interpret as the original waveform. This difficulty in analysis has led to confusion in understanding. Further, many implementations have used the Fourier method of waveform construction, adding together a fundamental frequency and an appropriate selection of harmonics, but, because the pitch is not a simple sub-harmonic of the vowel formants, an inordinate number of frequencies are required to generate a simple sinusoidal formant. However, the observation and analysis of speech waveforms leads to the understanding that a much simpler method can be used to reproduce voiced speech sounds using simple sinusoids.
  • Each identifiable voiced phoneme has between one and three characteristic formant frequencies.
  • the first formant frequency lies within the range of 150 to 1400 Hz.
  • the second formant if present, lies between 700 and 1200 Hz, while the third formant, present in the
  • simply generating and adding these frequencies together will not produce anything other than music; adding a sine wave of the pitch frequency will only add to the cacophony and will not help. How to change this "music" into speech is the subject of this invention.
  • Figure 1 shows part of the "i" phoneme sampled at 25,000 Hz by a 10-bit ADC, zero is represented by a level of 512.
  • the time of each repetition is the pitch period and by dividing the sampling frequency by the number of samples in the pitch period, the pitch frequency can be calculated; in the case shown in Figure 1 the pitch frequency is 171 Hz.
  • Examination of the waveform in Figure 1 over the length a single pitch period shows that there are two major frequency components that can be determined by measuring the number of samples between waveform peaks.
  • the lower frequency spans 56 samples and the higher frequency component varies from 8 to 11 samples according to its position.
  • the characteristic formant frequencies are therefore 446 Hz for the lower formant and between 2273 Hz and 3125 Hz for the higher formant.
  • Figure 3 and Figure 4 present typical settings for the English phonemes. These settings may be adjusted over a fairly wide range to accommodate a variety of accents and personal characteristics and can be worked out by trial and error.
  • the first column is the phoneme symbol used to access the data.
  • the second column contains words with examples of the phoneme's pronunciation.
  • Columns 3 to 8 contain alternate frequency and amplitude formant setting data for up to 3 formants; the frequencies and amplitudes nominated are typical and variation about these values adds personality and lifelike variation into the synthesizer.
  • Column 9 is default duration in milliseconds
  • column 10 is a number representing the extent of randomness
  • column 11 is the phoneme type indicator.
  • Figure 4 shows data for phonemes that contain noise. Where no first formant is present the phoneme contains only noise.
  • Noise phonemes can be obtained from any white noise source such as a random number generator and filtering to the required bandwidth and centre frequency. Alternatively, pre-recorded segments of noise phonemes may be used; these may be generated or real speech samples.
  • the noise is simply added to the synthesised voiced waveform at a level indicated in column 8 (a3) of Figure 4. Many voiced affricates have a great deal of variation in pitch and formants, the randomness figure in column 10 is increased for these phonemes to induce this variation as the waveform is being generated.
  • Diphthongs are complex vowel sounds and glides are complex combinations of consonants and vowels.
  • the word "patent” has a diphthong as the first vowel. This particular diphthong starts with the "e” phoneme and ends with the "i" phoneme. No diphthongs or glides are listed in Figure 3 or Figure 4 because they may be generated using the data of the phonemes already included. To generate a diphthong or a glide interpolation is used over the duration of the phoneme to generate the appropriate waveform. This interpolation applies to the first formant settings and higher formants if they exist in both as in the case of "ei”. In the case where one of the diphthong component phonemes contains a second or third formant and the other component does not, the second or third formant is switched in or out over the duration of the diphthong.
  • Input to this type of synthesizer may be direct pitch and formant parameters for voiced speech with type and related amplitude and duration for speech with noise.
  • Quiet intervals between phonemes can be simulated with a blank type phoneme that produces no output.
  • This type of input is usually derived from the analysis of real speech and produces a close approximation to the original speech.
  • phonetic symbols may be used to generate speech from look up tables based on data similar to that in Figures 3 and 4 and uses either default values or input parameters for pitch, level and duration.
  • Text-to-speech converters that convert words and phrases to compatible phonetic symbols with pitch, level and duration information can also be incorporated into the design.
  • Figure 1 is a sample of recorded speech
  • Figure 2 is a sample of synthesized speech
  • Figure 3 is a table of phonemes that do not contain noise
  • FIG. 4 is a table of phonemes that contain noise.
  • the following drawings are provided to clarify the detailed description of the preferred embodiment of the invention:
  • FIG. 5 is a block schematic illustrating the preferred hardware embodiment of the invention.
  • Figure 6 is a schematic block diagram of a formant generator
  • Figure 7 is a block schematic of a noise formant generator
  • Figure 8 is the overall level and damping control schematic block diagram
  • Figure 9 is a schematic diagram of the mixer and output circuit
  • Figure 10 is a block schematic of the clock generator and controller.
  • Figure 11 is a flow chart of the control sequence required to operate the synthesizer.
  • a controller 1 is used to organise and translate data received through an input port 2.
  • the said controller 1 coordinates three formant generators 3, 4 and 5 as well as a noise generator 6.
  • the outputs of 3, 4, 5 and 6 are combined and amplified by the mixer 7 before being output to the loudspeaker.
  • Timing of the synthesizer originates from the clock generator 9.
  • a data buss 10 is used to carry parameters from the controller 1 to all the generators 3, 4, 5 and 6 and the level control circuit 11, which is a means of adjusting the overall output signal level as well as generating the damping waveform.
  • Frequency selection 12 is a register and counter that controls the clock rate of the formant generator. Data on the buss 10 is latched into the register when the set frequency control line 13 is logical zero.
  • the high-speed clock 14 is divided by the number in the register producing a pulse that clocks the address counter 15 through connection 16. At the end of every pitch period the address counter is reset to zero when the reset connection 17 is logical zero.
  • the address counter 16 outputs are connected to the wave look up table 18 inputs via an address buss 19.
  • the wave look up table 18 contains the formant wave shape, which may be a sine wave, in digital form; this is converted into analogue form by the multiplying digital to analogue converter (DAC) 20.
  • the wave look up table 18 is connected to the DAC 20 through connector 21.
  • the formant amplitude is controlled through the amplitude memory register 22, which is set from the data buss 10 when the set amplitude 23 is logical zero.
  • a multiplying DAC 24 converts the digital output 25 of the amplitude memory 22 into a voltage. This voltage is governed by the input damping 26 to the DAC as well as the amplitude digital output 25.
  • the voltage is connected to the reference voltage input of the wave generator multiplying DAC 20 through connector 27 and controls the amplitude of the formant output 28.
  • Figure 7 is a block schematic of the noise generator 6.
  • ten different types of speech noise are digitised into 1024 byte segments and stored in a noise look up table 29, any one of the speech noise types or no noise may be set into the noise select register 30 from the data buss 10 when the set noise type control 31 is logical zero.
  • the address counter 32 divides the clock 33 and the more significant bits of the address counter 32 are input to the sequence mixer 34 through connectors 35.
  • the sequence mixer 35 randomises the address so that sequences longer than 1024 samples can be output from the noise look up table 29.
  • the address select inputs of the noise look up table 29 are controlled by the noise select register 30 through connector 36, the output of the sequence mixer 34 through connector 37a and the lesser significant bits of the address counter 32 through connector 37b.
  • the combined address inputs select the noise data output 38 of the noise look up table 29. If longer noise segments are used, the address counter 32 may be connected directly to the noise look up table 29 without the necessity of a sequence mixer 34.
  • the noise level is controlled through the noise level memory register 39, which is set from the data buss 10 when the set noise level 40 is logical zero.
  • a multiplying DAC 41 converts the digital output 42 of the noise level memory 42 into a voltage. This voltage is governed by the input level 43 to the DAC as well as the noise level digital output 42. The voltage is connected to the reference voltage input of the noise generator multiplying DAC 44 through connector 45 and controls the amplitude of the noise output 46.
  • the digital noise output 38 is converted into an analogue signal by the noise generator multiplying DAC 44.
  • the mixer circuit 7 is shown in Figure 8.
  • Formant inputs 28(3), 28(4) and 28(5) from the three formant generators 3, 4 and 5 are combined with the noise input 46 through a resistor network of amplifier 57.
  • the combined signal output from the amplifier 57 is fed to a loudspeaker 8.
  • the overall amplitude level and damping are controlled by the level control 11 illustrated by Figure 9.
  • the level memory 47 is set to the parameter on the data buss 10 when the set level control 48 is a logical zero.
  • a multiplying DAC 49 converts the value 50 in the level memory 47 to an analogue voltage, level 43, which is a proportion of the reference voltage 51.
  • the damping output, damping 26, is generated by the damping look up table 52, which stores the damping profile as a series of digital levels.
  • the address counter 53 is reset at the beginning of each pitch period by the reset control 17.
  • the clock input 33 causes the address counter 53 through connection 54 to select each word of the damping look up table 52 in turn.
  • the multiplying DAC 55 converts the digital output 56 of the damping look up table 52 into an analogue voltage, damping 26.
  • the overall functioning of the synthesiser is controlled for simplicity of implementation by a micro-controller 1.
  • the operation of the controller requires a highspeed clock 14 provided by the clock generator 58.
  • the high-speed clock pulse 14 is divided down to the lower frequency sample rate clock 33 by the clock divider 59.
  • Outputs from the controller 1 are the data buss 10 through which all parameters are passed to the formant, noise and level circuits and control lines are used for setting the parameters into particular registers.
  • the control lines are normally at a logical one level but drop to a logical zero level when the particular register is selected; when the control lines return to a logical one level the data from the buss 10 is latched into the register.
  • the control process is indicated in Figure 11, which presents a flow chart of the operation.
  • the reset control 17 is set to a logical zero.
  • the overall level and pitch period is obtained from input data 2 along with formant and noise parameters.
  • the pitch period depends on the frequency of the clock 33. Normally the pitch frequency is higher than 60 Hz and to calculate the pitch period all that needs to be done is to divide the frequency of the clock 33 by the pitch period. If the pitch frequency is zero, the duration of the phoneme is used as the pitch period.
  • a useful method for facilitating the synthesizer to sing is to select a pitch period based on musical notes. This can be implemented by selecting a pitch period from a table containing the sequence of pitch periods related to musical notes indexed by numbers from 1 and 60.
  • the timer interrupt is set to interrupt at the end of the pitch period.
  • the output segment routine is called and each formant set frequency 13 and set amplitude 23 control line is set to logical zero for a short time in turn while the data is output on the data buss 10.
  • the set noise type 31 and set noise level 40 are set to logical zero for a short time in turn as the noise type and noise level are output on the data buss 10.
  • the overall level is output on the data buss 10 and the set level 48 is set to logical zero for a short time and then the reset is released by the interrupt routine, which then returns to the main program.
  • the main function the controller 1 may be simply a program to set up the interrupt for first time and then input phoneme data on a pitch by pitch basis; this is useful for a vocoder application when speech data has been analysed and the input data to the synthesizer is in the form of pitch, level, noise and formant parameters.
  • the controller may input only phoneme symbols, in which case the symbols have to be translated into duration, pitch, level, noise and formant parameters.
  • Phoneme symbols may be supplemented by prosody data in which case the default duration, pitch and level default parameters would be replaced by the input data.
  • a further option would be to include a text-to-phoneme converter in the controller. Such converters use pronunciation dictionaries and prosody models to convert the input text to phonetic symbols and prosody data. The phonetic symbols and prosody data can then be used to generate speech.
  • Formant generators may be replaced by re-settable variable frequency analogue or digital sine wave generators
  • White noise generators or exciters with bandwidth and frequency control may replace noise generators.
  • the controller may change the wave, noise and damping look up table data

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

A method of generating speech waveforms from phoneme data is disclosed. The synthesizer generates speech directly from component waveforms. Structured waveforms with frequency and amplitude selected by input data are used to output phoneme signals that are perceived as voiced speech. Unvoiced phonemes are generated from stored data. Voiced affricates and voiced plosives are generated from a combination of structured waveforms and stored data. A method of converting phonetic symbols with pitch, duration and the amplitude levels into phoneme data is also provided that enables modulated signals with intonations resembling normal speech to be generated.

Description

PHONEME TO SPEECH CONVERTER
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention includes speech synthesizers, and more particularly to architectures of speech synthesisers and methods of producing speech from numerical and symbolic data.
2. Description of the Related Art
Speech synthesis is the computer generation of sound that resembles human speech.
Speech signals are divided into small sound units called phonemes. Phonemes have prosody characteristics such as pitch, amplitude and duration as well as special distinctive characteristics that enable them to be identified and distinguished as elements in the communication process in a similar manner to the letters used in written communications such as this document. Speech synthesizers recreate speech by generating concatenated phonemes from a series of numerical codes representing a speech signal. There are several techniques for implementing this process. One method of speech synthesis is the use of coded binary numbers representing excitation and filter parameters derived from the analysis of speech. Speech parameters, stored as binary numbers, are used to regenerate speech by excitation of a time-varying digital filter. A processor supplies overall control of speech production. The process of speech production is typically a digital process up to the point of an analogue-to-digital converter, which supplies an analogue signal to drive a loudspeaker.
In an alternative embodiment, the vocal tract is simulated by a dozen or so connected pipes of different diameter and excitation is represented by a pulse stream at the vocal- chord rate for voiced sounds or a random noise source for unvoiced parts of speech. The reflection coefficients at the junctions of the pipes are obtained from a linear prediction analysis of the speech waveform.
An alternative to the time-varying filter approach is a speech generation system that stores speech as digitised phonetic segments, usually in a compressed form, and regenerates speech by concatenating segments of the stored speech. The segments may be phrases such as "at the next roundabout", "at the end of the road", "turn", "left", "right", which can be put together in several alternative ways to give instructions such as "at the end of the road turn left" or "at the next roundabout turn left". Such synthesisers can use shorter speech elements such as syllables, diphones and monophones to generate words; for example "shi", "sli", "ni", "im" and "ip" can be put together to form "shim", "slim", "ship", "slip" and "nip". Some attempts have been made to generate speech by simply adding together modulated sine waves, these method use frequency analysis of real speech to determine the major frequency components and manipulate phase and modulation parameters to reproduce speech. These methods can sometimes produce understandable speech, but the quality of the speech is generally poor. The nearest published patent specification to this one is Kagoshima 2002/0138253 which uses repetitive waveforms in a similar way to this invention. There are a number of differences between the 2002/0138253 specification and this invention in that there is no requirement in the 2002/0138253 specification to reset the waveform generators at the start of each pitch period; according to the 2002/0138253 specification it is necessary to modify the waveform generator output by filters or multiplication by a second characteristic waveform; the 2002/0138253 specification gives no data for the generation of a voiced phoneme language set and no information relating to the implementation of diphthongs and glides.
SUMMARY OF THE INVENTION
Although vowels and structured consonants appear to consist of simple sinusoids, they are difficult to analyse by formal mathematical methods and produce complex spectra that are as difficult to interpret as the original waveform. This difficulty in analysis has led to confusion in understanding. Further, many implementations have used the Fourier method of waveform construction, adding together a fundamental frequency and an appropriate selection of harmonics, but, because the pitch is not a simple sub-harmonic of the vowel formants, an inordinate number of frequencies are required to generate a simple sinusoidal formant. However, the observation and analysis of speech waveforms leads to the understanding that a much simpler method can be used to reproduce voiced speech sounds using simple sinusoids.
Each identifiable voiced phoneme has between one and three characteristic formant frequencies. The first formant frequency lies within the range of 150 to 1400 Hz. The second formant, if present, lies between 700 and 1200 Hz, while the third formant, present in the |i| and |e| vowel sounds, is between 1200 and 5000 Hz. However, simply generating and adding these frequencies together will not produce anything other than music; adding a sine wave of the pitch frequency will only add to the cacophony and will not help. How to change this "music" into speech is the subject of this invention.
Figure 1 shows part of the "i" phoneme sampled at 25,000 Hz by a 10-bit ADC, zero is represented by a level of 512. Although there are small changes in the waveform with each repetition, it can be seen that the same waveform is repeated several times at an interval of 146 samples. The time of each repetition is the pitch period and by dividing the sampling frequency by the number of samples in the pitch period, the pitch frequency can be calculated; in the case shown in Figure 1 the pitch frequency is 171 Hz. Examination of the waveform in Figure 1 over the length a single pitch period shows that there are two major frequency components that can be determined by measuring the number of samples between waveform peaks. The lower frequency spans 56 samples and the higher frequency component varies from 8 to 11 samples according to its position. The characteristic formant frequencies are therefore 446 Hz for the lower formant and between 2273 Hz and 3125 Hz for the higher formant.
To synthesize the "i" phoneme, all that is necessary is to add together the characteristic component frequencies of 446 Hz and 2273 Hz and then repeatedly output a length of the composite waveform as dictated by the pitch period for the duration of the phoneme. Figure 2 shows the synthesised waveform using this process. The amplitude of the components in this example is 200 for the lower frequency and 60 for the higher frequency. It may be noticed that over the pitch period of the phoneme segments in both Figure 1 and Figure 2, the amplitude of the waveform reduces gradually with time. This reduction in amplitude over the pitch period is called "damping". Damping is applied to the synthesised waveform by multiplying the generated formant waves by a damping waveform that gradually reduces in amplitude over the pitch period.
Figure 3 and Figure 4 present typical settings for the English phonemes. These settings may be adjusted over a fairly wide range to accommodate a variety of accents and personal characteristics and can be worked out by trial and error. The first column is the phoneme symbol used to access the data. The second column contains words with examples of the phoneme's pronunciation. Columns 3 to 8 contain alternate frequency and amplitude formant setting data for up to 3 formants; the frequencies and amplitudes nominated are typical and variation about these values adds personality and lifelike variation into the synthesizer. Column 9 is default duration in milliseconds, column 10 is a number representing the extent of randomness and column 11 is the phoneme type indicator.
Figure 4 shows data for phonemes that contain noise. Where no first formant is present the phoneme contains only noise. Noise phonemes can be obtained from any white noise source such as a random number generator and filtering to the required bandwidth and centre frequency. Alternatively, pre-recorded segments of noise phonemes may be used; these may be generated or real speech samples. To generate phonemes with formants and noise, the noise is simply added to the synthesised voiced waveform at a level indicated in column 8 (a3) of Figure 4. Many voiced affricates have a great deal of variation in pitch and formants, the randomness figure in column 10 is increased for these phonemes to induce this variation as the waveform is being generated.
Diphthongs are complex vowel sounds and glides are complex combinations of consonants and vowels. The word "patent" has a diphthong as the first vowel. This particular diphthong starts with the "e" phoneme and ends with the "i" phoneme. No diphthongs or glides are listed in Figure 3 or Figure 4 because they may be generated using the data of the phonemes already included. To generate a diphthong or a glide interpolation is used over the duration of the phoneme to generate the appropriate waveform. This interpolation applies to the first formant settings and higher formants if they exist in both as in the case of "ei". In the case where one of the diphthong component phonemes contains a second or third formant and the other component does not, the second or third formant is switched in or out over the duration of the diphthong.
Input to this type of synthesizer may be direct pitch and formant parameters for voiced speech with type and related amplitude and duration for speech with noise. Quiet intervals between phonemes can be simulated with a blank type phoneme that produces no output. This type of input is usually derived from the analysis of real speech and produces a close approximation to the original speech. Alternatively, phonetic symbols may be used to generate speech from look up tables based on data similar to that in Figures 3 and 4 and uses either default values or input parameters for pitch, level and duration. Text-to-speech converters that convert words and phrases to compatible phonetic symbols with pitch, level and duration information can also be incorporated into the design.
BRIEF DESCRIPTION OF THE DRAWINGS
The following drawings have been referred to in the foregoing summary of the invention: Figure 1 is a sample of recorded speech;
Figure 2 is a sample of synthesized speech;
Figure 3 is a table of phonemes that do not contain noise and
Figure 4 is a table of phonemes that contain noise. The following drawings are provided to clarify the detailed description of the preferred embodiment of the invention:
Figure 5 is a block schematic illustrating the preferred hardware embodiment of the invention;
Figure 6 is a schematic block diagram of a formant generator; Figure 7 is a block schematic of a noise formant generator;
Figure 8 is the overall level and damping control schematic block diagram;
Figure 9 is a schematic diagram of the mixer and output circuit and
Figure 10 is a block schematic of the clock generator and controller.
Figure 11 is a flow chart of the control sequence required to operate the synthesizer. DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The architecture of a speech synthesizer according to this invention is shown in Figure 5. A controller 1 is used to organise and translate data received through an input port 2. The said controller 1 coordinates three formant generators 3, 4 and 5 as well as a noise generator 6. The outputs of 3, 4, 5 and 6 are combined and amplified by the mixer 7 before being output to the loudspeaker. Timing of the synthesizer originates from the clock generator 9. A data buss 10 is used to carry parameters from the controller 1 to all the generators 3, 4, 5 and 6 and the level control circuit 11, which is a means of adjusting the overall output signal level as well as generating the damping waveform.
There are three identical formant generators 3, 4 and 5; Figure 6 and this description apply to all three. Frequency selection 12 is a register and counter that controls the clock rate of the formant generator. Data on the buss 10 is latched into the register when the set frequency control line 13 is logical zero. The high-speed clock 14 is divided by the number in the register producing a pulse that clocks the address counter 15 through connection 16. At the end of every pitch period the address counter is reset to zero when the reset connection 17 is logical zero. The address counter 16 outputs are connected to the wave look up table 18 inputs via an address buss 19. The wave look up table 18 contains the formant wave shape, which may be a sine wave, in digital form; this is converted into analogue form by the multiplying digital to analogue converter (DAC) 20. The wave look up table 18 is connected to the DAC 20 through connector 21. The formant amplitude is controlled through the amplitude memory register 22, which is set from the data buss 10 when the set amplitude 23 is logical zero. A multiplying DAC 24 converts the digital output 25 of the amplitude memory 22 into a voltage. This voltage is governed by the input damping 26 to the DAC as well as the amplitude digital output 25. The voltage is connected to the reference voltage input of the wave generator multiplying DAC 20 through connector 27 and controls the amplitude of the formant output 28.
Figure 7 is a block schematic of the noise generator 6. In the preferred embodiment, ten different types of speech noise are digitised into 1024 byte segments and stored in a noise look up table 29, any one of the speech noise types or no noise may be set into the noise select register 30 from the data buss 10 when the set noise type control 31 is logical zero. The address counter 32 divides the clock 33 and the more significant bits of the address counter 32 are input to the sequence mixer 34 through connectors 35. The sequence mixer 35 randomises the address so that sequences longer than 1024 samples can be output from the noise look up table 29. The address select inputs of the noise look up table 29 are controlled by the noise select register 30 through connector 36, the output of the sequence mixer 34 through connector 37a and the lesser significant bits of the address counter 32 through connector 37b. The combined address inputs select the noise data output 38 of the noise look up table 29. If longer noise segments are used, the address counter 32 may be connected directly to the noise look up table 29 without the necessity of a sequence mixer 34. The noise level is controlled through the noise level memory register 39, which is set from the data buss 10 when the set noise level 40 is logical zero. A multiplying DAC 41 converts the digital output 42 of the noise level memory 42 into a voltage. This voltage is governed by the input level 43 to the DAC as well as the noise level digital output 42. The voltage is connected to the reference voltage input of the noise generator multiplying DAC 44 through connector 45 and controls the amplitude of the noise output 46. The digital noise output 38 is converted into an analogue signal by the noise generator multiplying DAC 44.
The mixer circuit 7 is shown in Figure 8. Formant inputs 28(3), 28(4) and 28(5) from the three formant generators 3, 4 and 5 are combined with the noise input 46 through a resistor network of amplifier 57. The combined signal output from the amplifier 57 is fed to a loudspeaker 8.
The overall amplitude level and damping are controlled by the level control 11 illustrated by Figure 9. The level memory 47 is set to the parameter on the data buss 10 when the set level control 48 is a logical zero. A multiplying DAC 49 converts the value 50 in the level memory 47 to an analogue voltage, level 43, which is a proportion of the reference voltage 51. The damping output, damping 26, is generated by the damping look up table 52, which stores the damping profile as a series of digital levels. The address counter 53 is reset at the beginning of each pitch period by the reset control 17. The clock input 33 causes the address counter 53 through connection 54 to select each word of the damping look up table 52 in turn. The multiplying DAC 55 converts the digital output 56 of the damping look up table 52 into an analogue voltage, damping 26.
The overall functioning of the synthesiser is controlled for simplicity of implementation by a micro-controller 1. The operation of the controller requires a highspeed clock 14 provided by the clock generator 58. The high-speed clock pulse 14 is divided down to the lower frequency sample rate clock 33 by the clock divider 59. Outputs from the controller 1 are the data buss 10 through which all parameters are passed to the formant, noise and level circuits and control lines are used for setting the parameters into particular registers. The control lines are normally at a logical one level but drop to a logical zero level when the particular register is selected; when the control lines return to a logical one level the data from the buss 10 is latched into the register.
The control process is indicated in Figure 11, which presents a flow chart of the operation. On a timer interrupt, the reset control 17 is set to a logical zero. Next the overall level and pitch period is obtained from input data 2 along with formant and noise parameters. The pitch period depends on the frequency of the clock 33. Normally the pitch frequency is higher than 60 Hz and to calculate the pitch period all that needs to be done is to divide the frequency of the clock 33 by the pitch period. If the pitch frequency is zero, the duration of the phoneme is used as the pitch period. A useful method for facilitating the synthesizer to sing is to select a pitch period based on musical notes. This can be implemented by selecting a pitch period from a table containing the sequence of pitch periods related to musical notes indexed by numbers from 1 and 60. When the pitch period has been obtained the timer interrupt is set to interrupt at the end of the pitch period. Following the set up of the timer the output segment routine is called and each formant set frequency 13 and set amplitude 23 control line is set to logical zero for a short time in turn while the data is output on the data buss 10. After the formant parameters have been set into the formant generators 3, 4 and 5, the set noise type 31 and set noise level 40 are set to logical zero for a short time in turn as the noise type and noise level are output on the data buss 10. Following this process, the overall level is output on the data buss 10 and the set level 48 is set to logical zero for a short time and then the reset is released by the interrupt routine, which then returns to the main program. The main function the controller 1 may be simply a program to set up the interrupt for first time and then input phoneme data on a pitch by pitch basis; this is useful for a vocoder application when speech data has been analysed and the input data to the synthesizer is in the form of pitch, level, noise and formant parameters. Alternatively, the controller may input only phoneme symbols, in which case the symbols have to be translated into duration, pitch, level, noise and formant parameters. Phoneme symbols may be supplemented by prosody data in which case the default duration, pitch and level default parameters would be replaced by the input data. A further option would be to include a text-to-phoneme converter in the controller. Such converters use pronunciation dictionaries and prosody models to convert the input text to phonetic symbols and prosody data. The phonetic symbols and prosody data can then be used to generate speech.
Alternative implementations include:
I. Overall level control may be omitted or applied after combining formants and noise 2. Damping control may be omitted or applied after combining formants
3. The use of gain controlled amplifiers instead of multiplying DACs
4. The use of digital arithmetic for setting levels instead of multiplying DACs
5. The use of digital addition of formants and noise instead of analogue summing
6. Formant generators may be replaced by re-settable variable frequency analogue or digital sine wave generators
7. White noise generators or exciters with bandwidth and frequency control may replace noise generators.
8. The controller may change the wave, noise and damping look up table data
9. Application specific integrated circuit design with or without an embedded controller may be used to implement the synthesizer
10. Field programmable logic arrays or electrically programmable logic devices may be used to implement the digital circuits
I I. Implementation by computer software.

Claims

CLAIMSWhat is claimed is:
1. A speech synthesis apparatus comprising: one or more formant wave generators that are reset synchronously with the pitch period and a noise generator.
2. An apparatus according to claim 1, wherein: damping is applied to the formant waveforms.
3. An apparatus according to claim 1 or 2, wherein: overall level control is applied to the synthesizer.
4. An apparatus according to claim 1, 2 or 3, wherein: the formant wave and noise generators are controlled by a micro-controller.
5. An apparatus according to claim 4, wherein: stored parameters are used to control the formant wave and noise generator
6. An apparatus according to claim 5, wherein: stored prosody parameters are used to control pitch period, level and duration
7. An apparatus according to claim 5 or 6, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate formant and noise parameters, pitch, level and duration.
8. An apparatus according to claim 5 or 6, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate phoneme symbols, pitch, level and duration.
9. A method according to claim 8, wherein: a dictionary includes storing sampled words and phonics and an encoding designating the pronunciation of the words and phonics; and storing a context list.
10. A computer system comprising: a processor; a memory coupled to the processor and program code executable on the processor for generating speech sounds by one or more formant wave generators that are reset synchronously with the pitch period and a noise generator.
11. A computer system according to claim 10, wherein: damping is applied to the formant waveforms.
12. A computer system according to claim 10 or 11, wherein: overall level control is applied to the synthesizer.
13. A computer system according to claim 10, 11 or 12, wherein: the formant wave and noise generators are controlled by a micro-controller.
14. A computer system according to claim 13, wherein: stored parameters are used to control the formant wave and noise generators
15. A computer system according to claim 14, wherein: stored prosody parameters are used to control pitch period, level and duration
16. A computer system according to claim 14 or 15, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate formant and noise parameters, pitch, level and duration.
17. A computer system according to claim 14 or 15, wherein: a dictionary, a context list and an heuristic rules list in a speech reference database are used to generate phoneme symbols, pitch, level and duration.
18. A computer system according to claim 17, wherein: a dictionary includes storing sampled words and phonics and an encoding designating the pronunciation of the words and phonics; and storing a context list.
19. A telephone system comprising: a telephone; a controller coupled to the telephone; a speech synthesis apparatus according to claim 1, 2 or 3, wherein the formant wave and noise generators are controlled by the said controller.
20. A communication apparatus comprising: an interface for connecting to a communication system; and a speech apparatus according to any of the claims 1 to 18 coupled to the interface.
21. A communication apparatus according to claim 20 wherein the interface communicates with a modem.
22. A speech synthesis apparatus that includes an apparatus for generating specific preset pitch periods such as musical tones from a set of encoded pitch input values.
PCT/AU2003/001098 2002-09-10 2003-08-28 Phoneme to speech converter WO2004025626A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003254398A AU2003254398A1 (en) 2002-09-10 2003-08-28 Phoneme to speech converter

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US40955302P 2002-09-10 2002-09-10
US60/409,553 2002-09-10

Publications (1)

Publication Number Publication Date
WO2004025626A1 true WO2004025626A1 (en) 2004-03-25

Family

ID=31993976

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2003/001098 WO2004025626A1 (en) 2002-09-10 2003-08-28 Phoneme to speech converter

Country Status (2)

Country Link
AU (1) AU2003254398A1 (en)
WO (1) WO2004025626A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2935212A1 (en) * 2008-08-19 2010-02-26 Sagem Defense Securite Data signal i.e. voice signal, transmission method for telephonic network in e.g. hotel, involves decoding voice signal, at level of receiver, by extracting structural component and comparing component with look-up table to retrieve data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5727125A (en) * 1994-12-05 1998-03-10 Motorola, Inc. Method and apparatus for synthesis of speech excitation waveforms
US5884253A (en) * 1992-04-09 1999-03-16 Lucent Technologies, Inc. Prototype waveform speech coding with interpolation of pitch, pitch-period waveforms, and synthesis filter
US5970440A (en) * 1995-11-22 1999-10-19 U.S. Philips Corporation Method and device for short-time Fourier-converting and resynthesizing a speech signal, used as a vehicle for manipulating duration or pitch
US6240384B1 (en) * 1995-12-04 2001-05-29 Kabushiki Kaisha Toshiba Speech synthesis method
EP1246163A2 (en) * 2001-03-26 2002-10-02 Kabushiki Kaisha Toshiba Speech synthesis method and speech synthesizer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5884253A (en) * 1992-04-09 1999-03-16 Lucent Technologies, Inc. Prototype waveform speech coding with interpolation of pitch, pitch-period waveforms, and synthesis filter
US5727125A (en) * 1994-12-05 1998-03-10 Motorola, Inc. Method and apparatus for synthesis of speech excitation waveforms
US5970440A (en) * 1995-11-22 1999-10-19 U.S. Philips Corporation Method and device for short-time Fourier-converting and resynthesizing a speech signal, used as a vehicle for manipulating duration or pitch
US6240384B1 (en) * 1995-12-04 2001-05-29 Kabushiki Kaisha Toshiba Speech synthesis method
US6332121B1 (en) * 1995-12-04 2001-12-18 Kabushiki Kaisha Toshiba Speech synthesis method
EP1246163A2 (en) * 2001-03-26 2002-10-02 Kabushiki Kaisha Toshiba Speech synthesis method and speech synthesizer

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2935212A1 (en) * 2008-08-19 2010-02-26 Sagem Defense Securite Data signal i.e. voice signal, transmission method for telephonic network in e.g. hotel, involves decoding voice signal, at level of receiver, by extracting structural component and comparing component with look-up table to retrieve data

Also Published As

Publication number Publication date
AU2003254398A1 (en) 2004-04-30

Similar Documents

Publication Publication Date Title
US4692941A (en) Real-time text-to-speech conversion system
US4624012A (en) Method and apparatus for converting voice characteristics of synthesized speech
KR940002854B1 (en) Sound synthesizing system
Bonada et al. Synthesis of the singing voice by performance sampling and spectral models
US4398059A (en) Speech producing system
US5915237A (en) Representing speech using MIDI
EP0059880A2 (en) Text-to-speech synthesis system
JP2564641B2 (en) Speech synthesizer
US6829577B1 (en) Generating non-stationary additive noise for addition to synthesized speech
Lerner Computers: Products that talk: Speech-synthesis devices are being incorporated into dozens of products as difficult technical problems are solved
WO2004025626A1 (en) Phoneme to speech converter
d’Alessandro et al. The speech conductor: gestural control of speech synthesis
O'Shaughnessy Design of a real-time French text-to-speech system
Peterson et al. Objectives and techniques of speech synthesis
JP2008058379A (en) Speech synthesis system and filter device
Lukaszewicz et al. Microphonemic method of speech synthesis
Quarmby et al. Implementation of a parallel-formant speech synthesiser using a single-chip programmable signal processor
Santos et al. Text-to-speech conversion in Spanish a complete rule-based synthesis system
JP3081300B2 (en) Residual driven speech synthesizer
JP3994333B2 (en) Speech dictionary creation device, speech dictionary creation method, and program
KR970003093B1 (en) Synthesis unit drawing-up method for high quality korean text to speech transformation
Datta et al. Epoch Synchronous Overlap Add (ESOLA)
KR100202539B1 (en) Voice synthetic method
Muralishankar et al. Human touch to Tamil speech synthesizer
JP4305022B2 (en) Data creation device, program, and tone synthesis device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CN JP SE SG TR UA US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP