WO2004072952A1 - 音声合成処理システム - Google Patents
音声合成処理システム Download PDFInfo
- Publication number
- WO2004072952A1 WO2004072952A1 PCT/JP2004/001712 JP2004001712W WO2004072952A1 WO 2004072952 A1 WO2004072952 A1 WO 2004072952A1 JP 2004001712 W JP2004001712 W JP 2004001712W WO 2004072952 A1 WO2004072952 A1 WO 2004072952A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- signal
- pitch
- voice
- audio
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims description 159
- 230000015572 biosynthetic process Effects 0.000 title claims description 28
- 238000003786 synthesis reaction Methods 0.000 title claims description 28
- 230000005236 sound signal Effects 0.000 claims description 193
- 238000005070 sampling Methods 0.000 claims description 133
- 238000000034 method Methods 0.000 claims description 106
- 238000007906 compression Methods 0.000 claims description 104
- 230000006835 compression Effects 0.000 claims description 103
- 230000006870 function Effects 0.000 claims description 74
- 238000013144 data compression Methods 0.000 claims description 45
- 238000013139 quantization Methods 0.000 claims description 32
- 238000001914 filtration Methods 0.000 claims description 29
- 238000013500 data storage Methods 0.000 claims description 17
- 230000002194 synthesizing effect Effects 0.000 claims description 17
- 238000001308 synthesis method Methods 0.000 claims description 6
- 239000011295 pitch Substances 0.000 description 441
- 238000004458 analytical method Methods 0.000 description 39
- 238000006243 chemical reaction Methods 0.000 description 27
- 238000004364 calculation method Methods 0.000 description 18
- 238000004891 communication Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 13
- 230000006837 decompression Effects 0.000 description 10
- 238000000605 extraction Methods 0.000 description 10
- 238000005311 autocorrelation function Methods 0.000 description 9
- 238000010219 correlation analysis Methods 0.000 description 7
- 235000016496 Panda oleosa Nutrition 0.000 description 5
- 240000000220 Panda oleosa Species 0.000 description 5
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 5
- 230000010363 phase shift Effects 0.000 description 5
- 238000012952 Resampling Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 235000019800 disodium phosphate Nutrition 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/097—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters using prototype waveform decomposition or prototype waveform interpolative [PWI] coders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
Definitions
- the present invention relates to a pitch waveform signal division device, an audio signal compression device, a database, an audio signal restoration device, an audio synthesis device, a pitch waveform signal division method, an audio signal compression method, an audio signal restoration method, an audio synthesis method, a recording medium, and a recording medium.
- a pitch waveform signal division device an audio signal compression device, a database, an audio signal restoration device, an audio synthesis device, a pitch waveform signal division method, an audio signal compression method, an audio signal restoration method, an audio synthesis method, a recording medium, and a recording medium.
- Speech synthesis specifies the words, phrases and interdependencies between sentences that are represented by text data, and specifies how to read a sentence based on the specified words, phrases and interdependencies. . Then, based on the phonetic character string representing the specified reading, the waveform of the phonemes constituting the voice, and the pattern of the duration and pitch (fundamental frequency) are determined. Is determined, and a sound having the determined waveform is output.
- a speech dictionary in which speech data representing the speech waveform is integrated is searched.
- the speech dictionary To make the synthesized speech natural, the speech dictionary must accumulate a huge number of speech data.
- the size of a storage device that stores a speech dictionary used by the device generally needs to be reduced in size. If the size of the storage device is reduced, it is generally unavoidable to reduce the storage capacity.
- entropy coding which is a method of compressing data by focusing on the regularity of the data (specifically, arithmetic coding ⁇ Huffman coding, etc.), is used to represent speech uttered by humans.
- compression efficiency was low because the audio data as a whole did not necessarily have a clear periodicity.
- the waveform of a human uttered voice is composed of sections of various lengths with regularity and sections without clear regularity. Therefore, when entropy encoding is applied to the entire audio data representing the human voice, the compression efficiency is low.
- Pitch fluctuation was also a problem. Pitch is easily influenced by human emotions and consciousness, and although it is a cycle that can be regarded as one to some extent, in reality, it slightly fluctuates. Therefore, when the same speaker utters the same word (phoneme) for a plurality of pitches, the pitch interval is usually not constant. Therefore, the waveform representing one phoneme often did not have accurate regularity, and the efficiency of compression by entropy coding was often low.
- the present invention has been made in view of the above situation, and has a pitch waveform signal dividing apparatus, a pitch waveform signal dividing method, and a recording method capable of efficiently compressing a data capacity of a data representing voice.
- the purpose is to provide media and programs.
- the present invention provides an audio signal compression device and an audio signal compression method for efficiently compressing the data capacity of data representing audio, and data compressed by such an audio signal compression device and the audio signal compression method. Audio signal restoring apparatus and audio signal restoring method for restoring audio data, a database and a recording medium holding data compressed by such an audio signal compressing apparatus and audio signal compressing method, and the like.
- An object of the present invention is to provide a voice synthesizing device and a voice synthesizing method for performing voice synthesis using data compressed by a voice signal compression device and a voice signal compression method.
- a pitch waveform signal splitting device includes:
- Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and for each section, adjusting a phase based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- Pitch waveform signal dividing means for detecting a boundary between adjacent phonemes contained in the voice represented by the pitch waveform signal and / or an end of the voice, and dividing the pitch waveform signal at the detected boundary and / or end;
- the pitch waveform signal dividing means determines whether or not the intensity of the difference between two adjacent unit pitches of the pitch waveform signal is greater than or equal to a predetermined amount. Alternatively, when it is determined that it is equal to or more than the predetermined amount, the boundary between the two sections may be detected as the boundary between adjacent phonemes or the end of speech.
- the pitch waveform signal dividing means determines whether or not the two sections represent a fricative sound based on the intensity of a portion belonging to the two sections of the pitch signal, and determines that the two sections represent a fricative sound. In such a case, regardless of whether or not the intensity of the difference between the two sections is equal to or greater than a predetermined amount, it is determined that the boundary between the two sections is not a boundary between adjacent phonemes or an end of speech. It may be.
- the pitch waveform signal dividing means determines whether or not the intensity of a portion belonging to the two sections of the pitch signal is equal to or less than a predetermined amount. Regardless of whether or not the strength of the difference between the sections is equal to or greater than a predetermined amount, the boundary between the two sections may be determined not to be the boundary between adjacent phonemes or the end of speech.
- the pitch waveform signal dividing device obtains an audio signal representing an audio waveform, and divides the audio signal into a plurality of sections corresponding to a unit pitch of the audio. Audio signal processing means for processing the audio signal into a pitch waveform signal by making the phases of the sections substantially the same.
- Pitch waveform signal dividing means for detecting a boundary between adjacent phonemes included in the voice represented by the pitch waveform signal and / or an end of the voice, and dividing the pitch waveform signal at the detected boundary and Z or end; and ,
- the pitch waveform signal dividing device provides a pitch waveform signal representing a waveform of an audio, a boundary between adjacent phonemes included in the audio represented by the pitch waveform signal, and / or Means for detecting the end of
- the audio signal compression device includes:
- Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and for each section, adjusting a phase based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- Data generation means
- Data compression means for compressing the data by subjecting the generated phoneme data to event speech coding
- the pitch waveform signal dividing means determines whether or not the strength of the difference between two adjacent unit pitches of the pitch waveform signal is equal to or greater than a predetermined amount, and determines that the difference intensity is equal to or greater than the predetermined amount. At this time, the boundary between the two sections may be detected as the boundary between adjacent phonemes or the end of speech.
- the pitch waveform signal dividing means determines whether or not the two sections represent a fricative sound based on the intensity of a portion belonging to the two sections of the pitch signal, and determines that the two sections represent a fricative sound. In such a case, regardless of whether or not the intensity of the difference between the two sections is equal to or greater than a predetermined amount, it is determined that the boundary between the two sections is not a boundary between adjacent phonemes or an end of speech. It may be.
- the pitch waveform signal dividing means determines whether or not the intensity of a portion belonging to the two sections of the pitch signal is equal to or less than a predetermined amount. Regardless of whether or not the strength of the difference between the sections is equal to or greater than a predetermined amount, the boundary between the two sections may be determined not to be the boundary between adjacent phonemes or the end of speech.
- the audio signal compression device includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- Phoneme data generation means Generates phoneme data by detecting boundaries between adjacent phonemes contained in the voice represented by the pitch waveform signal and / or edges of the voice, and dividing the pitch waveform signal at the detected boundaries and / or edges.
- Phoneme data generation means Generates phoneme data by detecting boundaries between adjacent phonemes contained in the voice represented by the pitch waveform signal and / or edges of the voice, and dividing the pitch waveform signal at the detected boundaries and / or edges.
- Data compression means for compressing the generated phoneme data by subjecting the generated phoneme data to event-to-peak coding
- the audio signal compression device according to the sixth aspect of the present invention.
- Phoneme data generating means for generating phoneme data by dividing the pitch waveform signal at the detected boundary and at the end or at the end;
- Data compression means for performing data compression by performing entropy coding on the generated phoneme data
- the data compressing means performs data compression by subjecting the generated phoneme data to non-linear quantized results by anent-peak coding. You may use it.
- the data compression unit acquires phoneme data that has been subjected to data compression, determines the quantization characteristic of the non-linear quantization based on the acquired data amount of the phoneme data, and matches the determined quantization characteristic. As described above, the non-linear quantization may be performed.
- the audio signal compression device may further include a unit that sends out the compressed phoneme data to the outside via a network.
- the audio signal compression device may further include means for recording the data-compressed phoneme data on a recording medium readable by a computer.
- the database according to the seventh aspect of the present invention includes:
- the pitch waveform signal represents a pitch waveform signal obtained by making the phases of these intervals substantially the same. It is characterized in that it stores the boundary between adjacent phonemes contained in the voice, and Z or phoneme data obtained by dividing at the end of the voice.
- the database according to the eighth aspect of the present invention includes:
- It stores the phoneme data obtained by dividing the pitch waveform signal representing the waveform of the voice at the boundary between adjacent phonemes included in the voice represented by the pitch waveform signal and / or at the end of the voice. It is characterized by the following.
- a computer-readable recording medium includes:
- the pitch waveform signal represents a pitch waveform signal obtained by making the phases of these intervals substantially the same.
- the feature is to record the boundary between adjacent phonemes included in the voice and / or the phoneme data obtained by dividing at the end of the voice.
- a computer-readable recording medium includes:
- the phoneme data may have been subjected to event-to-peak coding. Further, the phoneme data may be subjected to the non-linear quantization and then to the entropy coding.
- the audio signal restoring device when the audio signal representing the audio waveform is divided into a plurality of intervals corresponding to a unit pitch of the audio, the phases of these intervals are substantially changed.
- the pitch waveform signal obtained by performing the same alignment process is converted into phoneme data obtained by dividing the boundary between adjacent phonemes included in the voice represented by the pitch waveform signal and / or at the end of the voice. Data acquisition means to be acquired;
- Restoring means for decoding the obtained phoneme data
- the phoneme data may have been subjected to ent-peak coding, and the restoring means may decode the obtained phoneme data, and change the phase of the decoded phoneme data to a phase before performing the processing. May be restored.
- the phoneme data may be subjected to the non-linear quantization and then to the eventual speech coding,
- the restoring means may decode the obtained phoneme data and perform nonlinear inverse quantization, and restore the phase of the decoded and nonlinear inversely quantized phoneme data to the phase before performing the processing. Good.
- the data acquisition means is configured to store the phoneme data via a network. It may be provided with a means for obtaining from a unit.
- the data acquisition unit may include a unit that acquires the phoneme data by reading the phoneme data from a computer-readable recording medium that records the phoneme data.
- Phoneme data storage means for recording the obtained phoneme data or the decoded phoneme data
- a text input means for inputting text information representing the text
- Synthesizing means for generating data representing synthetic speech by searching for phoneme data representing a waveform of a phoneme constituting the sentence from the phoneme data storage means, and combining the searched phoneme data with each other;
- Sound piece storage means for storing a plurality of voice data representing sound pieces
- Prosody prediction means for predicting the prosody of a speech unit constituting the input sentence
- the combining means includes:
- the sounds that make up the sound piece that could not be selected Missing part synthesis for synthesizing data representing a speech element that could not be selected by retrieving phoneme data representing elementary waveforms from the phoneme data storage means and combining the retrieved phoneme data together.
- the speech unit storage means may store measured prosody data representing a temporal change in pitch of the speech unit represented by the audio data in association with the audio data,
- the selecting means represents a waveform of a voice unit having the same reading as a voice unit constituting the sentence among the voice data, and a pitch of the pitch represented by the actually measured prosody data associated with the voice unit. It may be possible to select the audio data whose time change is closest to the prosody prediction result.
- the storage means may store phonetic data representing reading of voice data in association with the voice data,
- the selecting means may convert voice data associated with phonetic data representing a reading that matches the reading of a speech piece constituting the sentence as voice data representing a waveform of a voice piece having a common reading with the relevant voice piece. It may be handled.
- the data acquisition means may include means for acquiring the phoneme data from outside via a network.
- the data acquisition unit may include a unit that acquires the phoneme data by reading the phoneme data from a computer-readable recording medium that records the phoneme data.
- a pitch waveform signal dividing method obtains an audio signal representing an audio waveform, extracts the pitch signal by filtering the audio signal,
- the audio signal is divided into sections based on the extracted pitch signal, and the phase of each section is adjusted based on the correlation with the pitch signal.
- a sampling length is determined based on the phase, and a sampling signal is generated by performing sampling according to the sampling length.
- the sampling signal is processed into a pitch waveform signal
- the pitch waveform signal dividing method is a method for obtaining a sound signal representing a sound waveform and dividing the sound signal into a plurality of sections corresponding to a unit pitch of the sound. By making the phases of these sections substantially the same, the audio signal is processed into a pitch waveform signal,
- the pitch waveform signal dividing method is a method for dividing a pitch waveform signal representing a waveform of a voice, the boundary between adjacent phonemes included in the voice represented by the pitch waveform signal, and / or To detect the end of
- the audio signal compression method obtains an audio signal representing an audio waveform, filters the audio signal to extract a pitch signal,
- the audio signal based on the pitch signal extracted by the filter Is divided into sections, and for each section, the phase is adjusted based on the correlation with the pitch signal,
- a sampling length is determined based on the phase, and a sampling signal is generated by performing sampling according to the sampling length.
- the sampling signal is processed into a pitch waveform signal
- the generated phoneme data is subjected to event speech coding to compress the data.
- the audio signal compression method provides an audio signal compression method for acquiring an audio signal representing a waveform of an audio and dividing the audio signal into a plurality of sections corresponding to a unit pitch of the audio.
- an audio signal compression method for acquiring an audio signal representing a waveform of an audio and dividing the audio signal into a plurality of sections corresponding to a unit pitch of the audio.
- the generated phoneme data is subjected to end-to-end P coding to compress the data.
- the generated phoneme data is subjected to event speech coding to compress the data.
- the audio signal restoring method is characterized in that, when an audio signal representing a waveform of an audio is divided into a plurality of intervals of a unit pitch of the audio, the phases of these intervals are substantially changed Acquire phoneme data obtained by dividing the pitch waveform signal obtained by performing the same alignment process at the boundary between adjacent phonemes included in the voice represented by the pitch waveform signal and / or at the end of the voice.
- a speech synthesis method includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- the acquired phoneme data or the decoded phoneme data is stored, and sentence information representing a sentence is input.
- Phoneme data representing the waveform of phonemes constituting the sentence is searched for from the stored phoneme data, and the searched phoneme data is combined with each other to generate data representing a synthesized speech.
- the program according to the twenty-first aspect of the present invention includes:
- a filter for acquiring an audio signal representing the audio waveform, filtering the audio signal to extract a pitch signal, Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and for each section, adjusting a phase based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- Pitch waveform signal dividing means for detecting a boundary between adjacent phonemes included in the voice represented by the pitch waveform signal, and Z or an end of the voice, and dividing the pitch waveform signal at the detected boundary and Z or end; and ,
- the program according to the twenty-second aspect of the present invention includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- Pitch waveform signal dividing means for detecting a boundary between adjacent phonemes included in the voice represented by the pitch waveform signal, and Z or an end of the voice, and dividing the pitch waveform signal at the detected boundary and Z or end; and ,
- the program according to the twenty-third aspect of the present invention includes:
- the program according to the twenty-fourth aspect of the present invention includes:
- Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and for each section, adjusting a phase based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- Data compression means for compressing the generated phoneme data by subjecting the generated phoneme data to event speech coding
- the program according to the twenty-fifth aspect of the present invention includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- Data compression means for performing data compression by entropy encoding the generated phoneme data
- the program according to the twenty-sixth aspect of the present invention includes:
- Phoneme data generation means for generating phoneme data by dividing the pitch waveform signal at the detected boundary and / or end,
- Data compression means for compressing the generated phoneme data by subjecting the generated phoneme data to event speech coding
- the program according to the twenty-seventh aspect of the present invention includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- a program according to a twenty-eighth aspect of the present invention includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- Phoneme data storage means for storing the obtained phoneme data or the decoded phoneme data
- a text input means for inputting text information representing the text
- Synthesizing means for generating data representing synthetic speech by searching for phoneme data representing a waveform of a phoneme constituting the sentence from the phoneme data storage means, and combining the searched phoneme data with each other;
- a computer-readable recording medium includes:
- Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and adjusting a phase of each section based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- a pitch waveform signal dividing means for detecting a boundary between adjacent phonemes included in the voice represented by the pitch waveform signal, and Z or an end of the voice, and dividing the pitch waveform signal at the detected boundary and / or end; and ,
- a computer-readable recording medium includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- a pitch waveform signal dividing unit that detects boundaries between adjacent phonemes included in the voice represented by the pitch waveform signal and edges of the voice, and divides the pitch waveform signal at the detected boundaries and edges;
- a computer-readable recording medium includes:
- the computer-readable recording medium includes:
- the audio signal based on the pitch signal extracted by the filter Is divided into sections, and for each of the sections, phase adjustment means for adjusting the phase based on the correlation with the pitch signal,
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- a data compression means for compressing the data by subjecting the generated phoneme data to an end-to-end coding
- a computer-readable recording medium includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- -20-It is characterized by recording a program to make it function.
- a computer-readable recording medium includes:
- Phoneme data generating means for generating phoneme data by dividing the pitch waveform signal at the detected boundary and Z or end,
- Data compression means for compressing the generated phoneme data by subjecting the generated phoneme data to event speech coding
- a program for causing the computer to function is a program for causing the computer to function.
- a computer-readable recording medium includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- a program for causing the computer to function is a program for causing the computer to function.
- a computer-readable recording medium includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- Phoneme data storage means for storing the obtained phoneme data or the decoded phoneme data
- a text input means for inputting text information representing the text
- Synthesizing means for searching for phoneme data representing a waveform of a phoneme constituting the sentence from the phoneme data storage means, and combining the searched phoneme data with each other to generate a data representing a synthesized voice;
- a computer-readable recording medium includes:
- Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and adjusting a phase of each section based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the phase for each section whose phase has been adjusted by the phase adjusting means and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- Pitch waveform signal dividing means for detecting a boundary between adjacent phonemes included in the voice represented by the pitch waveform signal and / or an end of the voice, and dividing the pitch waveform signal at the detected boundary and / or end;
- a computer-readable recording medium includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- Pitch waveform signal dividing means for detecting a boundary between adjacent phonemes contained in the voice represented by the pitch waveform signal and / or an end of the voice, and dividing the pitch waveform signal at the detected boundary and / or end;
- a computer-readable recording medium includes:
- a computer-readable recording medium includes:
- Phase adjusting means for dividing the audio signal into sections based on the pitch signal extracted by the filter, and adjusting a phase of each section based on a correlation with the pitch signal;
- Sampling means for determining a sampling length based on the sampling length and performing sampling in accordance with the sampling length to generate a sampling signal
- Audio signal processing means for processing the sampling signal into a pitch waveform signal based on the result of the adjustment by the phase adjustment means and the value of the sampling length;
- Data compression means for performing data compression by performing entropy coding on the generated phoneme data
- a computer-readable recording medium includes:
- An audio signal representing a waveform of an audio is acquired, and when the audio signal is divided into a plurality of sections corresponding to a unit pitch of the audio, the phases of these sections are made substantially the same to thereby convert the audio signal.
- Data compression means for compressing the generated phoneme data by subjecting the generated phoneme data to event mouth coding
- a computer-readable recording medium includes:
- Phoneme data generation means for generating phoneme data by dividing the pitch waveform signal at the detected boundary and / or end,
- Data compression means for compressing the generated phoneme data by subjecting the generated phoneme data to an eventual speech coding
- a computer-readable recording medium includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- a computer-readable recording medium includes:
- a pitch waveform signal obtained by performing processing for making the phases of these intervals substantially the same is referred to as the pitch waveform.
- Phoneme data storage means for storing the obtained phoneme data or the phoneme data whose phase has been restored
- a text input means for inputting text information representing the text
- Synthesizing means for generating data representing synthetic speech by searching for phoneme data representing the waveform of phonemes constituting the sentence from the phoneme data storage means, and combining the searched phoneme data with each other;
- a program for causing the computer to function is a program for causing the computer to function.
- a pitch waveform signal division device a pitch waveform signal division method, and a program for realizing efficient compression of the data capacity of data representing voice are realized.
- an audio signal compression device and an audio signal compression method for efficiently compressing the data volume of data representing audio, and data compressed by such an audio signal compression device and the audio signal compression method
- Audio signal decompression device and method for decompressing audio data a data base for storing data compressed by such an audio signal compression device and an audio signal compression method, a recording medium, and such an audio signal compression device
- a voice synthesizing apparatus and a voice synthesizing method for performing voice synthesis using data compressed by the voice signal compression method are realized.
- FIG. 1 is a block diagram showing a configuration of a pitch waveform data divider according to a first embodiment of the present invention.
- FIG. 2 is a diagram showing the first half of the operation flow of the pitch waveform data divider of FIG.
- FIG. 3 is a diagram showing the latter half of the operation flow of the pitch waveform data divider in FIG.
- Fig. 4 (a) and (b) are graphs showing the waveform of the audio data before the phase shift, and (c) is the graph showing the waveform of the audio data after the phase shift. It is rough.
- FIG. 5 (a) is a graph showing the timing at which the pitch waveform data divider of FIG. 1 or FIG. 6 separates the waveform of FIG. 170 (a)
- FIG. 5 (b) is a graph showing the timing of FIG. FIG. 6 is a graph showing timings at which the pitch waveform data divider of FIG. 6 separates the waveform of FIG. 17 (b).
- FIG. 6 is a block diagram showing a configuration of a pitch waveform data divider according to a second embodiment of the present invention.
- FIG. 7 is a block diagram showing a configuration of a pitch waveform extracting unit of the pitch waveform data divider.
- FIG. 8 is a block diagram showing a configuration of a phoneme data compression unit showing a configuration of a synthesized speech using system according to a third embodiment of the present invention. It is a lock figure.
- FIG. 9 is a block diagram showing a configuration of the speech synthesis unit.
- FIG. 10 is a block diagram showing the configuration of the speech synthesis unit.
- FIG. 11 is a diagram schematically showing the data structure of a speech unit database.
- FIG. 12 is a flowchart showing processing of a personal computer that performs the function of a phoneme data supply unit.
- FIG. 13 is a flowchart showing a process in which a personal combination performing the function of the phoneme data utilization unit acquires phoneme data.
- FIG. 14 is a flowchart showing a speech synthesis process when a personal combination performing the function of the phoneme data utilizing unit acquires a free text data.
- FIG. 15 is a flowchart showing a process when a personal combination performing the function of the phoneme data using unit acquires distribution character string data.
- FIG. 16 is a flowchart showing a speech synthesis process when the personal computer performing the function of the phoneme data utilization unit acquires the standard message data and the utterance speed data.
- FIG. 17 (a) is a graph showing an example of a waveform of a voice uttered by a person
- FIG. 17 (b) is a graph for explaining the timing of dividing the waveform in the conventional technology.
- FIG. 1 is a diagram showing a configuration of a pitch waveform data divider according to a first embodiment of the present invention. As shown in the figure, this pitch waveform data divider is configured to read data recorded on a recording medium (for example, a flexible disk or a CD-R (Compact Disc-Recordable)). , CD-ROM drive, etc.) and a computer C 1 connected to a recording medium drive device 200.
- a recording medium for example, a flexible disk or a CD-R (Compact Disc-Recordable)
- CD-ROM drive etc.
- the computer 100 is composed of a CPU (Central Processing Unit) and a DSP (Digital Signal Processor), and a volatile device consisting of a LAN interface processor 101 and a RAM (Random Access Memory).
- Memory 102 non-volatile memory 104 such as a hard disk device, input unit 105 such as a keyboard, display unit 106 such as a liquid crystal display, and USB (Universal Serial Bus).
- serial communication control unit 103 which consists of an interface circuit and controls serial communication with the outside.
- the computer C1 stores a phoneme separation program in advance, and executes the phoneme separation program to perform processing described later. (First embodiment: operation)
- FIG. 2 and FIG. 2 and 3 are diagrams showing the operation flow of the pitch waveform data divider of FIG.
- the user sets the recording medium on which the audio data representing the audio waveform is recorded in the recording medium drive SMD, and sets the computer C1 in the phoneme domain. 1712
- the computer C1 When instructing to start the cutoff program, the computer C1 starts processing of the phoneme separation program.
- the computer C1 reads audio data from the recording medium via the recording medium drive device SMD (FIG. 2, step S1). It is assumed that the audio data has a digital signal format modulated by, for example, PCM (Pulse Code Modulation), and represents audio sampled at a constant period that is sufficiently shorter than the audio pitch.
- PCM Pulse Code Modulation
- the computer C1 generates filtered voice data (pitch signal) by filtering the voice data read from the recording medium (step S2).
- the pitch signal shall consist of digital data having a sampling interval substantially equal to the sampling interval of audio data.
- the computer C1 performs a feedback process based on a pitch length described later and a time at which the instantaneous value of the pitch signal becomes 0 (time at which a zero crossing occurs) based on the characteristics of the filtering performed to generate the pitch signal. Determined by doing.
- the computer C 1 performs, for example, cepstrum analysis or analysis based on an autocorrelation function on the read audio data to identify the fundamental frequency of the audio represented by the audio data, and calculates the reciprocal of the fundamental frequency.
- the absolute value (ie, pitch length) of is determined (step S3).
- computer C1 identifies both fundamental frequencies by performing both cepstrum analysis and analysis based on the autocorrelation function, and uses the average of the absolute values of the reciprocals of these two fundamental frequencies as the pitch length. You may ask for it.)
- the intensity of the read audio data is first converted to a value substantially equal to the logarithm of the original value (the base of the logarithm is arbitrary), and the value is converted.
- the spectrum of the audio data ie, cepstrum
- cepstrum is converted to a fast Fourier transform technique (or any other method that produces data representing the result of Fourier transform of a discrete variable). Method).
- the minimum value of the frequencies giving the maximum value of this cepstrum is specified as the fundamental frequency.
- the autocorrelation function r (1) represented by the right side of Equation 1 is specified using the read speech data. Then, among the frequencies giving the maximum value of the function (periodogram) obtained as a result of Fourier transform of the autocorrelation function r (1), the minimum value exceeding a predetermined lower limit is specified as the fundamental frequency.
- the computer CI specifies the timing when the time when the pitch signal crosses zero is reached (step S4). Then, the computer C 1 determines whether or not the pitch length and the cycle of the zero cross of the pitch signal are different from each other by the running amount or more (step S 5).
- the above-described filtering is performed with bandpass filter characteristics such that the center frequency is the reciprocal (step S6). On the other hand, if it is determined that the difference is equal to or more than the predetermined amount, the above-described filtering is performed using the characteristics of the band-pass filter such that the center frequency is the reciprocal of the pitch length (step S7). In any case, it is desirable that the pass band width of the filtering is such that the upper limit of the pass band is always within the double of the fundamental frequency of the voice represented by the voice signal.
- the computer C1 outputs the audio data read from the recording medium at a timing when the boundary of the generated pitch signal unit period (for example, one cycle) comes (specifically, a timing when the pitch signal crosses zero). Break (step S8). Then, for each of the divided sections, the correlation between the variously changed phases of the audio data in this section and the pitch signal in this section is determined. The phase of the audio data is specified as the phase of the audio data in this section (step Step S 9). Then, the respective sections of the audio data are shifted so that they have substantially the same phase (step S10).
- the computer C 1 changes the value cor represented by the right-hand side of Equation 2 into the value of ⁇ (where ⁇ is an integer of 0 or more) representing the phase in various ways in each section. Ask for each case. Then, the value ⁇ of ⁇ that maximizes the value cor is specified as a value representing the phase of the voice data in this section. As a result, the phase value at which the correlation with the pitch signal is the highest is determined for this section. Then, the computer C 1 shifts the phase of the voice data in this section by ( ⁇ ).
- Fig. 4 (c) shows an example of the waveform represented by the data obtained by shifting the phase of the audio data as described above.
- the two sections shown as "# 1" and “# 2" have pitch fluctuations as shown in Fig. 4 (b). Have different phases due to the influence of.
- the sections # 1 and # 2 of the waveform represented by the phase-shifted audio data as shown in FIG. 4 (c)
- the effects of the pitch fluctuation are removed and the phases are uniform.
- the value of the starting point of each section is close to zero.
- the time length of the section is about one pitch.
- the longer the interval the greater the number of samples in the interval and the greater the amount of data in the pitch waveform data, or the greater the sampling interval, resulting in inaccurate speech represented by the pitch waveform data. Occurs.
- the computer C1 performs Lagrange interpolation on the phase-shifted audio data (step S11). That is, data representing a value to be interpolated between samples of the phase-shifted audio data by the Lagrange interpolation method is generated.
- the phase-shifted audio data and the Lagrange interpolation data constitute the interpolated audio data.
- the computer C1 samples each section of the interpolated audio data. Re-ring (resampling). Also, pitch information, which is data indicating the original number of samples in each section, is generated (step S12). It is assumed that the computer C1 performs sampling so that the number of samples in each section of the pitch waveform data is substantially equal to each other, and the intervals are equal in the same section.
- the pitch information functions as information indicating the original time length of the unit pitch of the audio data.
- the computer C1 determines that the difference data of the audio data (ie, pitch waveform data) of which the time lengths of the respective sections have been aligned in step S12 after the second one-pitch section from the beginning is still obtained.
- the data that represents the sum and sum of the differences between the instantaneous value of the waveform represented by the one pitch and the instantaneous value of the waveform represented by the immediately preceding pitch that is, , Difference data (FIG. 3, step S13).
- step S 13 the computer C 1, for example, if the k-th one pitch from the beginning is specified, temporarily stores the (k ⁇ 1) -th one pitch in advance, and specifies Using the k-th one pitch and the temporarily stored (k-1) th one pitch, data representing the value k on the right side of Equation 3 may be generated.
- the computer C1 performs a filtering process on the latest difference data generated in step S13 using a mouth-pass filter.
- the pass band characteristic of the filtering of the absolute value of the difference data and the pitch signal in step S14 is determined by the error that the computer C1 or the like suddenly generates in the difference data and the pitch signal is performed in step S15. It is only necessary that the characteristic be such that the probability of causing the error is sufficiently low. In general, it is good if the passband characteristics are those of a second-order IIR (Infinite Impulse Response) type low-pass filter.
- IIR Infinite Impulse Response
- the computer C1 determines that the boundary between the section for the latest pitch of the pitch waveform data and the section for the immediately preceding pitch is the boundary between two phonemes (or the end of speech), It is determined whether it is in the middle of a phoneme, in the middle of a fricative sound, or in the middle of a silent state (step S15).
- the computer C1 makes a determination using, for example, the fact that a voice uttered by a person has the following properties (a) and (b). That is,
- the fricative sound has few spectral components corresponding to the fundamental frequency components and harmonic components of the sound emitted from the vocal cords, and has no clear periodicity. Correlation between two sections of is low
- step S15 the computer C1 performs determination according to the following determination conditions (1) to (4). That is,
- the difference data used for generating the difference data is used.
- the boundary is determined to be the boundary between two different phonemes (or the end of the voice),
- the two sections used to generate the difference data Is determined to be in the middle of one phoneme.
- the intensity of the filtered pitch signal for example, a peak value of an absolute value, an effective value, or an average value of the absolute values may be used.
- step S15 the computer C1 determines that the boundary between the latest one pitch section of the pitch waveform data and the immediately preceding pitch section is the boundary between two phonemes different from each other (or If it is determined that the edge is the end of the voice (that is, if the above case (1) is satisfied), the pitch waveform data is divided at the boundary between these two sections (step S16). On the other hand, if it is determined that the boundary is not the boundary between two different phonemes (or the end of speech), the process returns to step S13.
- the pitch waveform data is divided into a set of sections (phoneme data) corresponding to one phoneme.
- the computer C1 outputs these phoneme data and the pitch information generated in step S12 to the outside via its own serial communication control unit (step S17).
- the phoneme data obtained as a result of performing the above-described processing on the voice data having the waveform shown in FIG. 17 (a) is obtained by converting the voice data into different phonemes, for example, as shown in FIG. 5 (a). It is obtained by dividing by the timing "t1" to "t19" which is the boundary (or the end of the voice).
- the pitch waveform data is audio data in which the time length of a section corresponding to a unit pitch is standardized and the influence of pitch fluctuation is removed. For this reason, each phoneme data has an accurate periodicity throughout.
- phoneme data has the features described above, if phoneme data is subjected to data compression using an ent-speech coding method (specifically, a method such as arithmetic coding or Huffman coding), the phoneme data can be efficiently processed. It is compressed.
- an ent-speech coding method specifically, a method such as arithmetic coding or Huffman coding
- the sound data is processed into pitch waveform data to remove the effects of pitch fluctuations.
- the sum of the differences between two adjacent one-pitch sections represented by pitch waveform data is If the two sections represent the same phoneme waveform, the value is sufficiently small. Therefore, the risk of an error occurring in the determination in step S15 is reduced.
- the time length of each section of the pitch waveform data must be restored to the time length of the original voice data.
- the original audio data can be easily restored.
- the configuration of the pitch waveform data divider is not limited to the above.
- the computer C1 may acquire audio data serially transmitted from the outside via the serial communication control unit.
- audio data may be obtained from outside via a communication line such as a telephone line, a dedicated line, or a satellite line.
- the computer C1 only needs to include, for example, a modem and a DSU (Data Service Unit). Further, if audio data is obtained from a device other than the recording medium drive SMD, the computer C1 does not necessarily need to include the recording medium drive SMD.
- the computer C1 may include a sound collecting device including a microphone, an AF amplifier, a sampler, an A / D (Analog-to-Digital) converter, a PCM encoder, and the like.
- the sound collector amplifies the sound signal representing the sound collected by its own microphone, samples it, performs A / D conversion, and performs PCM modulation on the sampled sound signal to convert the sound data. You only need to get it.
- the audio data obtained by the computer C1 does not necessarily need to be a PCM signal.
- the computer C1 may write the phoneme data to a recording medium set in the recording medium drive SMD via the recording medium drive SMD. Alternatively, the data may be written to an external storage device such as a hard disk device. In these cases, the computer C1 only needs to include a control circuit such as a recording medium drive device or a hard disk controller.
- the computer C 1 may perform entropy encoding on the phoneme data and output the entropy-encoded phoneme data according to the control of the phoneme delimiter program or other programs stored therein.
- the computer C1 does not need to perform either the cepstrum analysis or the analysis based on the autocorrelation coefficient.
- the reciprocal of the fundamental frequency obtained by one of the method based on the system analysis or the analysis based on the autocorrelation coefficient may be directly treated as the pitch length.
- the amount by which the computer C 1 shifts the phase of the audio data in each section of the audio data does not need to be (_ ⁇ ).
- the computer C 1 sets a real number common to each section representing the initial phase to ⁇
- the phase of the audio data may be shifted by (— ⁇ + ⁇ ).
- the position at which the computer C1 separates the audio data does not necessarily need to be the timing at which the pitch signal crosses zero, and may be, for example, the timing at which the pitch signal has a predetermined non-zero value.
- the initial phase ⁇ is set to 0 and the audio data is divided at the timing when the pitch signal crosses zero, the value of the start point of each section becomes a value close to 0, and the audio data is divided into each section. In particular, the amount of noise included in each section is reduced.
- difference data does not necessarily need to be generated sequentially according to the arrangement order of each section of the audio data, and each piece of difference data representing the sum of differences between adjacent one-pitch sections in the pitch waveform data is arbitrarily determined. They may be generated in order or in parallel.
- the filtering of the difference data need not be performed sequentially, but may be performed in an arbitrary order or in parallel.
- the interpolation of the phase-shifted audio data does not necessarily have to be performed by the Lagrange interpolation method.
- a linear interpolation method may be used, or the interpolation itself may be omitted.
- the computer C1 may generate and output information for identifying which of the phoneme data indicates a fricative or silence state.
- step S13 If the fluctuation of the pitch of the voice data to be processed into the phoneme data is negligible, the computer C1 does not need to shift the phase of the voice data, and the voice data is regarded as pitch waveform data. Then, the processing after step S13 may be performed. Also, audio Evening interpolation and resampling is not necessarily required.
- the computer C1 does not need to be a dedicated system, but may be a personal computer or the like.
- the phoneme separation program may be installed on the computer C1 from a medium (CD-R ⁇ M, MO, flexible disk, etc.) storing the phoneme separation program, or a communication board bulletin board (BBS)
- BSS communication board bulletin board
- a phoneme-separated program may be uploaded to the Internet and distributed via a communication line.
- the carrier wave may be modulated by a signal representing the phoneme separation program, the obtained modulation wave may be transmitted, and the device receiving this modulation wave may demodulate the modulation wave to restore the phoneme separation program. .
- the phoneme separation program can execute the above-described processing by being activated and executed by the computer C1 in the same manner as other application programs under the control of ⁇ S.
- the phoneme separation program stored in the recording medium may be a program excluding a part that controls the processing.
- FIG. 6 is a diagram showing a configuration of a pitch waveform data divider according to a second embodiment of the present invention.
- the pitch waveform data divider includes a speech input unit 1, a pitch waveform extraction unit 2, a difference calculation unit 3, a difference data filter unit 4, a pitch absolute value signal generation unit 5, a pitch It comprises a logarithmic signal filter unit 6, a comparison unit 7, and an output unit 8.
- the audio input unit 1 is configured by, for example, a recording medium drive similar to the recording medium drive SMD in the first embodiment.
- the voice input unit 1 obtains voice data representing a voice waveform by reading it from a recording medium on which the voice data is recorded, and supplies the voice data to the pitch waveform extraction unit 2.
- the audio data is in the form of a PCM-modulated digital signal, and is sampled at a fixed period that is sufficiently shorter than the audio pitch. It is assumed that the sound represents a pulled sound.
- the pitch waveform extraction section 2, difference calculation section 3, difference data filter section 4, pitch absolute value signal generation section 5, pitch absolute value signal filter section 6, comparison section 7, and output section 8 are all DSPs, CPUs, etc. And a memory for storing a program to be executed by the processor.
- pitch waveform extraction unit 2 difference calculation unit 3
- difference data filter unit 4 difference data filter unit 4
- pitch absolute value signal generation unit 5 pitch absolute value signal filter unit 6
- comparison unit 7 output unit 8
- the pitch waveform extracting unit 2 divides the audio data supplied from the audio input unit 1 into sections corresponding to a unit pitch (for example, one pitch) of the audio represented by the audio data. Then, by performing phase shift and resampling of each section obtained by the division, the time length and the phase of each section are aligned to be substantially the same.
- audio data (pitch waveform data) in which the phase and time length of each section are aligned is supplied to the difference calculator 3.
- the pitch waveform extraction unit 2 generates a pitch signal described later, uses the pitch signal by itself as described later, and supplies the pitch signal to the pitch absolute value signal generation unit 5.
- the pitch waveform extraction unit 2 generates sample number information indicating the original number of samples in each section of the audio data, and supplies the information to the output unit 8.
- the pitch waveform extraction unit 2 includes a cepstrum analysis unit 201, an autocorrelation analysis unit 202, a weight calculation unit 203, and a BPF (bandpass Filter) Coefficient calculation unit 204, ⁇ %, Doppler filter 205, Zero cross analysis unit 206, Waveform correlation analysis unit 207, Phase adjustment unit 208, Interpolation unit 2 9 and a pitch length adjusting unit 210.
- BPF bandpass Filter
- the cepstrum analysis unit 201, the autocorrelation analysis unit 202, the weight meter Calculation section 203, BPF coefficient calculation section 204, bandpass fill section 205, zero-cross analysis section 206, waveform correlation analysis section 207, phase adjustment section 209, interpolation section 209 and A part of or all of the functions of the pitch length adjusting unit 210 may be performed by a single processor.
- the pitch waveform extraction unit 2 specifies the pitch length by using both the cepstrum analysis and the analysis based on the autocorrelation function.
- the cepstrum analysis unit 201 specifies the fundamental frequency of the sound represented by the sound data by performing cepstrum analysis on the sound data supplied from the sound input unit 1 and indicates the specified fundamental frequency.
- the data is generated and supplied to the weight calculator 203.
- the cepstrum analysis unit 201 first converts the intensity of the audio data into a value substantially equal to the logarithm of the original value. I do. (The base of the logarithm is arbitrary.) Next, the cepstrum analysis unit 201 converts the spectrum of the converted speech data (that is, the cepstrum) into a fast Fourier transform method (or a discrete variable Fourier transform). Any other method that generates data representing the result of the conversion).
- the minimum value of the frequencies giving the maximum value of the cepstrum is specified as the fundamental frequency, data indicating the specified fundamental frequency is generated, and supplied to the weight calculation unit 203.
- the autocorrelation analysis unit 202 identifies the fundamental frequency of the audio represented by the audio data based on the autocorrelation function of the waveform of the audio data. Then, data indicating the specified fundamental frequency is generated and supplied to the weight calculator 203.
- the autocorrelation analysis unit 202 is supplied with the audio data from the audio input unit 1, first, the autocorrelation function r (1) is specified. Then, among the frequencies giving the maximum value of the periodogram obtained as a result of Fourier transform of the specified autocorrelation function r (1), the minimum value exceeding a predetermined lower limit is specified as the fundamental frequency, and the specified fundamental frequency is determined. Indicates The data is generated and supplied to the weight calculator 203.
- the BPF coefficient calculation unit 204 receives the data indicating the average pitch length from the weight calculation unit 203 and receives the zero-cross signal after the zero-cross analysis unit 206 when the zero-cross signal is supplied. Based on the data and the zero-cross signal, it is determined whether or not the average pitch length and the zero-cross period are different from each other by a predetermined amount or more. If it is determined that they are not different, the frequency characteristic of the band-pass filter 205 is set so that the reciprocal of the zero-cross period is set as the center frequency (the center frequency of the pass band of the band-pass filter 205). Control. On the other hand, when it is determined that the difference is equal to or more than the predetermined amount, the frequency characteristic of the bandpass filter 205 is controlled so that the reciprocal of the average pitch length is used as the center frequency.
- the bandpass filter 205 performs the function of a FIR (Finite Impulse Response) type filter whose center frequency is variable.
- FIR Finite Impulse Response
- the band-pass filter 205 sets its own center frequency to a value according to the control of the BPF coefficient calculation unit 204.
- the audio data supplied from the audio input unit 1 is filtered, and the filtered audio data (pitch signal) is converted into a zero-cross analysis unit 206, a waveform correlation analysis unit 206, and a pitch absolute value signal generation unit. Supply to 5.
- the pitch signal is composed of digital data having a sampling interval substantially equal to the sampling interval of the audio data. It is desirable that the bandwidth of the band-pass filter 205 is such that the upper limit of the pass band of the band-pass filter 205 always falls within twice the fundamental frequency of the voice represented by the voice data.
- the zero-cross analysis unit 206 is supplied from the band-pass filter 205.
- the timing at which the instant when the instantaneous value of the obtained pitch signal becomes 0 (time at which zero crossing occurs) is specified, and a signal representing the specified timing (zero crossing signal) is supplied to the BPF coefficient calculator 204. In this way, the length of the pitch of the audio data is specified.
- the zero-cross analysis unit 206 specifies the timing at which the instant when the instantaneous value of the pitch signal reaches a predetermined value other than 0, and replaces the signal representing the identified evening timing with the zero-cross signal with the BPF coefficient. It may be supplied to the calculation unit 204.
- the waveform correlation analysis unit 207 comes to a boundary of a unit period (for example, one period) of the pitch signal. Separate audio data at timing. Then, for each of the divided sections, the correlation between the variously changed phases of the audio data in this section and the pitch signal in this section is determined, and the phase of the audio data when the correlation is highest is obtained. Is specified as the phase of the audio data in this section. In this way, the phase of the audio data is specified for each section.
- the waveform correlation analysis unit 2007 specifies the value ⁇ described above for each section, generates data indicating the value ⁇ , and indicates the phase of the audio data in this section. It is supplied to the phase adjustment unit 208 as phase data. It is desirable that the time length of the section is about one pitch.
- the phase adjustment unit 208 receives the audio data from the audio input unit 1 and the data indicating the phase ⁇ of each interval of the audio data from the waveform correlation analysis unit 207. By shifting the data phase by (— ⁇ ), the phases of each section are aligned. Then, the phase-shifted audio data is supplied to the interpolation unit 209.
- the interpolation unit 209 performs Lagrange interpolation on the audio data (phase-shifted audio data) supplied from the phase adjustment unit 208 and supplies the result to the pitch length adjustment unit 210.
- the pitch length adjustment unit 210 resamples each interval of the supplied audio data, thereby obtaining a time length of each interval. Are aligned so that they are substantially identical to each other. Then, the audio data (that is, pitch waveform data) in which the time length of each section is aligned is supplied to the difference calculation unit 3.
- the pitch length adjustment unit 210 is configured to calculate the original number of samples of each section of this audio data (each section of this audio data at the time when it is supplied from the audio input unit 1 to the pitch length adjustment unit 210).
- the number of samples information indicating the number of samples is generated and supplied to the output unit 8.
- the sample number information is information for specifying the original time length of each section of the pitch waveform data, and corresponds to the pitch information in the first embodiment.
- the difference calculation unit 3 calculates each difference data (specifically, for example, the above-mentioned value, which represents the sum of the differences between the section for one pitch in the pitch waveform data and the section for one pitch immediately before the section. Is generated for each section of one pitch after the second from the beginning of the pitch waveform data, and is supplied to the difference data filter unit 4.
- the difference data filter unit 4 generates data (filtered difference data) representing the result of filtering each difference data supplied from the difference calculation unit 3 with a mouth-to-pass filter, and performs comparison. Supply to Part 7.
- the pass band characteristics of the filtering of the difference data by the difference data filtering unit 4 are such that the probability that a later-described determination performed by the comparing unit 7 becomes erroneous due to a sudden error in the difference data is sufficiently low. It only needs to be a characteristic.
- the pass band characteristics of the differential data filter unit 4 be the pass band characteristics of the second-order IIR type low-pass filter.
- the pitch absolute value signal generator 5 generates a signal (pitch absolute value signal) representing the absolute value of the instantaneous value of the pitch signal supplied from the pitch waveform extractor 2, and generates a pitch absolute value signal filter 6 To supply.
- Pitch absolute value signal filter 6 is from pitch absolute value signal generator 5. Data (filtered pitch signal) representing the result of filtering the supplied pitch absolute value signal with a low-pass filter is generated and supplied to the comparison unit 7.
- the pass band characteristics of the filtering by the pitch absolute value signal filter unit 6 are such that the probability that the discrimination performed by the comparison unit 7 becomes erroneous due to an error suddenly occurring in the pitch absolute value signal is sufficiently low. Any characteristics are acceptable. In general, it is preferable that the pass band characteristics of the pitch absolute value signal filter unit 6 be the pass band characteristics of the second-order IIR type low-pass filter.
- the comparison unit 7 determines that the boundary between adjacent one-pitch intervals in the pitch waveform data is the boundary between two different phonemes (or the end of speech), the middle of one phoneme, the middle of a fricative sound, It is determined for each boundary whether it is or during the silent state.
- the above-described determination by the comparing unit 7 may be performed based on the above-described properties (a) and (b) of the voice uttered by a person. For example, the determination is performed according to the above-described determination conditions (1) to (4). Should be performed.
- a specific value of the intensity of the filtered pitch signal for example, a peak value of an absolute value, an effective value, or an average value of the absolute values may be used.
- the comparing unit 7 determines the pitch between the boundaries between two different phonemes (or the end of the voice) among the boundaries between one-pitch sections adjacent to each other in the pitch waveform data. Divide the waveform data. Then, each data (that is, phoneme data) obtained by dividing the pitch waveform data is supplied to the output unit 8.
- the output unit 8 includes, for example, a control circuit that controls serial communication with the outside in accordance with a standard such as RS232C, a processor such as a CPU (and a memory that stores a program to be executed by the processor). Etc.).
- the output unit 8 receives the phoneme data generated by the comparison unit 7 and the sample number information generated by the pitch waveform extraction unit 2, and receives the phoneme data and sample data. A pit stream representing the number of pulls is generated and output.
- the pitch waveform data divider shown in FIG. 6 also processes voice data having the waveform shown in FIG. 17 (a) into pitch waveform data, and then processes the timing “t1” shown in FIG. 5 (a). Separate with "t1 9".
- the boundary "TO" between two adjacent phonemes is generated as shown in Fig. 5 (b). Select the correct timing for the division.
- each phoneme data generated by the pitch waveform data divider shown in FIG. 6 is not a mixture of a plurality of phoneme waveforms, and each phoneme data is accurate throughout. It has periodicity. Therefore, if the pitch waveform data divider shown in FIG. 6 performs data compression on the generated phoneme data by the method of event-to-pea coding, this phoneme data is efficiently compressed.
- the time length of each section of the pitch waveform data can be specified using the sample number information, the time length of each section of the pitch waveform data is restored to the time length of the original voice data. By doing so, the original voice data can be easily restored.
- the configuration of the pitch waveform data divider is not limited to the above.
- the voice input unit 1 may acquire voice data from outside via a communication line such as a telephone line, a dedicated line, or a satellite line.
- a communication control unit including, for example, a modem and a DSU.
- the sound input unit 1 may include a sound collection device including a microphone, an AF amplifier, a sampler, an A / D converter, a PCM encoder, and the like.
- the sound collector collects the sound collected by its own microphone. After amplifying and sampling the sampled audio signal and performing AZD conversion, PCM modulation is applied to the sampled audio signal to obtain audio data.
- the audio data acquired by the audio input unit 1 does not necessarily have to be a PCM signal.
- the pitch waveform extraction unit 2 may not include the cepstrum analysis unit 201 (or the autocorrelation analysis unit 202).
- the weight calculation unit 203 includes the cepstrum analysis unit 2
- the reciprocal of the fundamental frequency obtained by 01 may be used as the average pitch length as it is.
- the zero-cross analysis unit 206 may supply the pitch signal supplied from the non-pass filter 205 as it is to the BPF coefficient calculation unit 204 as a zero-cross signal.
- the output unit 8 may output the phoneme data and the sample number information to the outside via a communication line or the like.
- the output unit 8 only needs to include a communication control unit composed of, for example, a modem or a DSU.
- the output unit 8 may include a recording medium drive device.
- the output unit 8 stores the phoneme data and the sample number information in a storage area of a recording medium set in the recording medium drive device. You may make it write in.
- a single modem, a DSU, or a recording medium drive may constitute the audio input unit 1 and the output unit 8.
- the amount by which the phase adjustment unit 208 shifts the audio data in each section of the audio data is not required to be (__), and the waveform correlation analysis unit 207 separates the audio data.
- the position does not necessarily need to be the timing when the pitch signal crosses zero.
- the interpolation unit 209 does not necessarily need to perform the interpolation of the phase-shifted audio data by the Lagrange interpolation method.
- the interpolation unit 209 may employ a linear interpolation method.
- the adjustment unit 208 is an audio One night may be immediately supplied to the pitch length adjustment unit 210.
- the comparing unit 7 may generate and output information for specifying which one of the phoneme data indicates a fricative sound or a silent state.
- the comparison unit 7 may perform entropy coding on the generated phoneme data and then supply the generated phoneme data to the output unit 8.
- FIG. 8 is a diagram showing the configuration of this synthesized speech utilization system.
- this synthesized speech utilization system is composed of a phoneme data supply unit T and a phoneme data utilization unit U.
- the phoneme data supply unit T generates phoneme data, performs data compression, and outputs the data as compressed phoneme data, which will be described later.
- the phoneme data use unit U includes a compressed phoneme output from the phoneme data supply unit T.
- the phoneme data is restored by inputting data, and speech synthesis is performed using the restored phoneme data.
- the phoneme data supply unit T includes, for example, an audio data division unit T1, a phoneme data compression unit T2, and a compressed phoneme data output unit T3.
- the audio data division unit T1 has, for example, substantially the same configuration as the pitch waveform data divider according to the above-described first or second embodiment.
- the audio de-multiplexer T1 obtains the audio data from the outside, processes this audio data into pitch waveform data, and then divides it into a set of sections corresponding to one phoneme. Generates phoneme data and pitch information (sample number information) for the phoneme data compression unit T2.
- the phoneme data division unit T1 acquires the speech data used to generate the phoneme data—information representing the text read out in the evening, and converts this information into a phonetic character string representing the phoneme by a known method.
- Each phonetic character included in the converted phonetic character string obtained by the conversion may be added (labeled) to a phoneme data representing a phoneme to read out the phonetic character.
- Each of the phoneme data compression unit T2 and the compressed phoneme data output unit # 3 includes a processor such as a DS # and a CPU, a memory for storing a program to be executed by the processor, and the like. Note that a single processor may perform some or all of the functions of the phoneme data compression unit # 2 and the compressed phoneme data output unit # 3, and may perform the function of the audio data division unit # 1. The processor may further perform a part or all of the functions of the phoneme data compression unit ⁇ 2 and the compressed phoneme data output unit ⁇ 3. As shown in FIG. 9, it includes a non-linear quantization section # 21, a compression ratio setting section # 22, and an entropy coding section # 23.
- the nonlinear quantizing unit # 21 applies a nonlinear compression to the instantaneous value of the waveform represented by the phonemic data (specifically, for example, , A value obtained by substituting the instantaneous value into an upwardly convex function) generates a non-linear quantized phoneme equivalent to a quantized version of. Then, the generated non-linear quantized phoneme data is supplied to the entropy coding unit # 23.
- the non-linear quantization unit T 21 obtains compression characteristic data from the compression ratio setting unit ⁇ 22 to specify the correspondence between the pre-compression value and the post-compression value of the instantaneous value. Compression is performed according to the specified correspondence.
- the non-linear quantization unit T 21 uses the data specifying the function global—gain (xi) included on the right side of Equation 4 as compression characteristic data from the compression ratio setting unit T 22. get. Then, the instantaneous value of each frequency component after the nonlinear compression is calculated by the function X r i shown on the right side of Equation 4.
- Non-linear quantization is performed by changing (x i) to a value that is substantially equal to the quantized value.
- the compression ratio setting unit T22 performs the above-described compression for specifying the correspondence between the values before and after the compression of the instantaneous values by the nonlinear quantization unit T21 (hereinafter referred to as compression characteristics).
- the characteristic data is generated and supplied to the non-linear quantization unit T 21 and the entropy coding unit E 23.
- compression characteristic data for specifying the above-mentioned function global-gain (xi) is generated and supplied to the non-linear quantization unit T21 and the ent-peak coding unit T23.
- the compression ratio setting unit T22 obtains a compressed phoneme data from the entropy coding unit T23, for example, to determine the compression characteristics. Then, the ratio of the data amount of the compressed phoneme data obtained from the entropy coding unit T23 to the data amount of the phoneme data obtained from the voice data overnight dividing unit T1 is obtained. It is determined whether or not the compression ratio is larger than a predetermined compression ratio (for example, about 1/100). When it is determined that the obtained ratio is larger than the target compression ratio, the compression ratio setting unit T22 determines the compression characteristics so that the compression ratio becomes smaller than the current one. On the other hand, when it is determined that the obtained ratio is equal to or less than the target compression ratio, the compression characteristic is determined so that the compression ratio becomes larger than the current one.
- a predetermined compression ratio for example, about 1/100
- the entropy encoder T 23 includes the non-linear quantized phoneme data supplied from the non-linear quantizer T 21, the pitch information supplied from the audio data divider T 1, and a compression ratio setting unit T 22 Entropy encoding of the supplied compression characteristic data (specifically, for example, conversion into an arithmetic code or Huffman code), and the entropy-encoded data is compressed as compressed phoneme data. It is supplied to the rate setting unit T22 and the compressed phoneme data output unit T3.
- the compressed phoneme data output unit T3 outputs the compressed phoneme data supplied from the entropy coding unit T23.
- the method of outputting is arbitrary.
- a computer-readable recording medium for example, a CD (Compact Disc), DVD (Digital Versatile Disc), flexible disc, etc.), or conform to standards such as Ethernet (registered trademark), USB (Universal Serial Bus), IE EE1394 or RS232C.
- Serial transmission may be performed in a compliant manner.
- the compressed phoneme data may be transmitted in parallel.
- the compressed phoneme data output unit T3 may distribute the compressed phoneme data by a method such as applying the compressed phoneme data to an external server via a network such as an Internet network.
- the compressed phoneme data output unit T3 is suitable for recording compressed phoneme data on a recording medium, for example, if it further includes a recording medium drive device that writes data to the recording medium in accordance with instructions from a processor or the like.
- a control circuit that controls external serial communication in accordance with standards such as Ethernet (registered trademark), USB, IEEE 1394, or RS232C is required. I just need more.
- the phoneme data use unit U includes a compressed phoneme data input unit U1, an entropy code decoding unit U2, a nonlinear inverse quantization unit U3, and a phoneme data restoration unit U4. And a speech synthesis unit U5.
- the compressed phoneme data input unit U1, the entropy code decoding unit U2, the nonlinear inverse quantization unit U3, and the phoneme data restoration unit U4 are all processors such as DSPs and CPUs, and executed by this processor. It is composed of a memory for storing programs to be executed. Note that a single processor performs part or all of the functions of the compressed phoneme data input unit Ul, the entropy code decoding unit U2, the nonlinear inverse quantization unit U3, and the phoneme data overnight restoration unit U4. You may do so.
- the compressed phoneme data input unit U1 acquires the above-mentioned compressed phoneme data from the outside, and supplies the acquired compressed phoneme data to the event mouth P-code decoding unit U2.
- the method by which the compressed phoneme data input unit U1 acquires compressed phoneme data is arbitrary, and may be, for example, recorded on a computer-readable recording medium. It may be obtained by reading the recorded compressed phoneme data, or transmitted serially in a form conforming to standards such as Ethernet (registered trademark), USB, IEEE1394 or RS232C.
- the compressed phoneme data may be obtained by receiving compressed phoneme data transmitted in parallel or in parallel.
- the compressed phoneme data input unit U1 may acquire the compressed phoneme data by a method such as downloading the compressed phoneme data stored in an external server via a network such as the Internet.
- the compressed phoneme data input unit U1 reads compressed phoneme data from a recording medium
- the apparatus further includes a recording medium drive device that reads data from the recording medium in accordance with instructions from a processor or the like. Good. Also, when receiving serially transmitted compressed phonemes,
- the entropy code decoding unit U2 receives the compressed phoneme data supplied from the compressed phoneme data input unit U1 (that is, the non-linear quantized phoneme data, pitch information, and compression characteristic data are subjected to the entrance-to-end encoding. , The nonlinear quantized phoneme data, pitch information, and compression characteristic data are restored. Then, the restored nonlinear quantized phoneme data and compression characteristic data are supplied to the nonlinear inverse quantizer U3, and the restored pitch information is supplied to the phoneme data restorer U4.
- the nonlinear inverse quantizer U3 calculates the instantaneous value of the waveform represented by the nonlinear quantized phoneme data.
- the phoneme data before the non-linear quantization is restored by changing the compression characteristics indicated by the compression characteristics data according to the characteristics that are inversely related to each other. Then, the restored phoneme data is supplied to the phoneme data restoration unit U4.
- the phoneme data restoration unit U4 uses the sound supplied from the nonlinear inverse quantization unit U3.
- the time length of each section of the raw data is changed so as to be the time length indicated by the pitch information supplied from the entropy code decoding unit U2.
- the time length of the section may be changed by, for example, changing the interval and / or the number of samples in the section.
- the phoneme data restoration unit U4 supplies the phoneme data in which the time length of each section is changed, that is, the restored phoneme data, to a waveform data base U506 of the speech synthesis unit U5 described later. I do.
- the speech synthesis unit U5 includes a language processing unit U501, a word dictionary U502, a sound processing unit U503, a search unit U504, Expansion unit U505, waveform database U506, speech unit editing unit U507, search unit U508, speech unit base U509, speech speed conversion It consists of a unit U510 and a speech unit registration unit R.
- Each of the U510s includes a processor such as a CPU and a DSP, a memory for storing a program to be executed by the processor, and the like, and performs processing described later.
- a processor such as a CPU and a DSP
- memory for storing a program to be executed by the processor, and the like, and performs processing described later.
- the language processing unit U501, sound processing unit U503, search unit U504, decompression unit U505, speech unit editing unit U507, search unit U508 and speech speed A single processor may perform a part or all of the functions of the conversion unit U510. Further, a processor that performs the functions of the compressed phoneme data input unit U1, the entropy code decoding unit U2, the nonlinear inverse quantization unit U3, or the phoneme data restoration unit U4 includes a language processing unit U501 and a sound processing unit. U503, search unit U504, decompression unit U505, speech unit editing unit U507, search unit U508, and part or all of functions of speech speed conversion unit U510 May be further performed.
- the word dictionary U502 is an EEPPROM (Electrically
- the processor may perform the function of this control circuit.
- the compressed phoneme data input unit U1, entropy code decoding unit U2, nonlinear inverse quantization unit U3, phoneme data restoration unit U4, language processing Unit U501, sound processing unit U503, search unit U504, decompression unit U505, speech unit editing unit U507, search unit U508 and speech speed conversion unit U5 A processor that performs a part or all of the functions of 10 may perform the function of the control circuit of the word dictionary U502.
- word dictionary U502 words including ideographic characters (for example, kanji) and phonograms (for example, kana and phonetic symbols) representing readings of the words and the like are stored in the speech synthesis system. Are stored in association with each other in advance by the manufacturer or the like.
- the word dictionary 53 acquires a word or the like including an ideographic character and a phonogram representing the reading of the word or the like from outside according to a user operation, and stores them in association with each other.
- a portion for storing data stored in advance is constituted by a non-rewritable nonvolatile memory such as a PROM (Programmable Read Only Memory). Is also good.
- the waveform data base U506 is composed of a data rewritable nonvolatile memory such as an EPROM and a hard disk device, and a control circuit for controlling writing of data to the nonvolatile memory.
- the processor may perform the function of this control circuit.
- the processor that performs part or all of the functions of the unit 08 and the speech speed conversion unit U510 may perform the function of the control circuit of the waveform database U506.
- the waveform database U506 contains phonograms and phoneme data representing the waveform of the phoneme represented by the phonograms. Are stored in association with each other in advance. Further, the waveform database U506 stores the phoneme data supplied from the phoneme data restoration unit U4 and phonetic characters representing phonemes whose waveforms are represented by the phoneme data in association with each other. Note that, of the nonvolatile memory constituting the waveform data base U506, a portion for storing data stored in advance may be constituted by a non-rewritable nonvolatile memory such as a PROM.
- the waveform database U506 may store, together with the phoneme data, data representing voice separated by units such as VCV (Vowel-Consonant-Vowel) syllables.
- VCV Vehicle-Consonant-Vowel
- the sound piece database U509 is composed of a data rewritable nonvolatile memory such as an EPROM hard disk device.
- the speech unit database U509 stores, for example, data having a data structure shown in FIG. That is, as shown in the figure, the data stored in the U-509 of the speech unit is divided into four types: a header portion HDR, an index portion IDX, a directory portion DIR, and a data portion DAT.
- the storage of data in the speech unit database U509 is performed in advance by, for example, the manufacturer of the speech synthesis system and / or performed by the speech unit registration unit R performing an operation described later. Be done.
- a portion that stores data that is stored in advance is composed of a non-rewritable non-volatile memory such as a PROM. Is also good.
- the header HDR shows the data for identifying the speech unit database U509, the index part IDX, the directory part DIR, and the data part DAT data amount, data format, attribution of copyright, etc.
- the data is stored.
- the data section DAT stores a compressed speech unit data obtained by performing an ent-opening speech coding on the speech unit data representing the waveform of the speech unit.
- a speech unit refers to one continuous section including one or more phonemes in a voice, and usually includes one or more words.
- the speech unit data before the entropy encoding need only be composed of data in the same format as the phoneme data (for example, digital format data subjected to PCM).
- FIG. 11 shows the data included in the DAT DAT as “Saitama The compressed speech piece data of 1401 h bytes, which represents the waveform of the speech piece that is stored at the logical position starting at address 0 1 A 3 6 A 6 h, is stored. The case is illustrated. (In addition, in this specification and the drawings, the number suffixed with "h" represents a hexadecimal number.)
- At least the data of (A) (that is, the speech unit reading data) of the data set of (A) to (E) described above is ranked according to the phonetic character represented by the phonetic unit reading data.
- the phonetic characters are kana, if they are in alphabetical order, they are arranged in descending order of address
- stored in the storage area of the U-509 I have.
- the pitch component data described above approximates the frequency of the pitch component of the speech unit with a linear function of the elapsed time from the beginning of the speech unit.
- the data consists of data indicating the intercept / 3 of the linear function and the value of the gradient ⁇ .
- the unit of the gradient ⁇ may be, for example, [Hertz second]
- the unit of the intercept j8 may be, for example, [Hertz].
- the pitch component data further includes data (not shown) indicating whether or not the sound piece represented by the compressed sound piece data has been muddled and whether or not it has been devoiced.
- the index section IDX stores data for specifying the approximate logical position of the data in the directory section DIR based on the speech unit reading data. Specifically, for example, assuming that the speech unit reading data represents kana, the kana character and the speech unit reading data in which the first character is this kana character are in what range of addresses. The data (directory address) indicating whether or not there is an address are stored in association with each other. Note that a single non-volatile memory may perform some or all of the functions of the word dictionary U502, the waveform database U506, and the speech unit database U509.
- the speech unit registration unit R includes a recorded speech unit data set storage unit U511, a speech unit database creation unit U512, and a compression unit U513. It consists of. Note that the speech unit registration unit R may be detachably connected to the speech unit data base U509, and in this case, new data is stored in the speech unit data base U509. Except when writing, the unit unit M may be made to perform the operations described below with the sound unit registration unit R separated from the unit unit M.
- the recorded sound piece data set storage unit U511 is composed of a non-volatile rewritable memory such as a hard disk device, and is connected to the sound piece data base creation unit U5112. I have. Note that the recorded speech piece data set storage unit U511 may be connected to the speech piece database creation unit U511 via a network.
- the recorded speech unit data set storage unit U5 11 1 displays phonograms that represent readings of speech units, and waveforms obtained by collecting the actual utterances of these sound units.
- the speech unit is stored in advance by the manufacturer of the speech synthesis system in association with each other.
- the sound piece data may be composed of, for example, PCM-formatted digital data.
- the speech unit database creation unit U512 and the compression unit U513 include a processor such as a CPU, a memory for storing a program to be executed by the processor, and the like, and a process described later according to this program. I do.
- a single processor may perform part or all of the functions of the speech unit database creation unit U 5 12 and the compression unit U 5 13, and the compressed phoneme data input unit U 5 1, Entropy code decoding unit U2, nonlinear inverse quantization unit U3, phoneme data restoration unit U4, language processing unit U501, sound processing unit U503, search unit U504, decompression unit U505, speech unit editing unit U507, search unit U508 and speech speed conversion unit
- Processor that performs part or all of functions of U510 is a speech unit database creation unit U512
- the function of the compression unit U513 may be further performed.
- the processor that performs the functions of the speech unit data creation unit U5 12 and the compression unit U5 13 may also have the function of the control circuit of the recorded speech unit data set storage unit U511. .
- the speech unit database creation unit U512 reads the phonogram and speech unit data that are associated with each other from the recorded speech unit data set storage unit U511, and the pitch component of the speech represented by the speech unit data.
- the time change of the frequency and the utterance speed are specified.
- the utterance speed may be specified, for example, by counting the number of samples of this voice unit.
- the time change of the frequency of the pitch component may be specified by performing cepstrum analysis on the sound piece data, for example. Specifically, for example, the waveform represented by the speech piece data is divided into a number of small parts on the time axis, and the intensity of each obtained small part is calculated as the logarithm of the original value (the base of the logarithm is arbitrary).
- This small portion of the spectrum (that is, the cepstrum) is converted to a substantially equal value, and the result of the fast Fourier transform (or the result of the Fourier transform of a discrete variable) is used. Other than generating data representing Any method). Then, the minimum value of the frequencies giving the maximum value of the cepstrum is specified as the frequency of the pitch component in this small portion.
- the time change of the frequency of the pitch component is, for example, substantially the same as the method performed by the pitch waveform data divider according to the first or second embodiment or the method performed by the audio data dividing unit T1.
- the pitch signal is extracted by filtering the speech unit data, and based on the extracted pitch signal, the waveform represented by the speech unit data is divided into sections of unit pitch length.
- the speech unit can be converted to a pitch waveform signal.
- the time change of the frequency of the pitch component may be specified by performing cepstrum analysis or the like using the obtained pitch waveform signal as the sound piece data.
- the speech unit database creation unit U512 supplies the speech unit data read out from the recorded speech unit data set storage unit U511 to the compression unit U513.
- the compression unit U5 13 creates the compressed speech unit data by performing an event-to-Pe coding on the speech unit data supplied from the speech unit data creation unit U5 1 2 and generates the speech unit data. It is returned to the base preparation unit U 5 1 2.
- the utterance speed of the speech unit data and the temporal change of the frequency of the pitch component are specified, and this speech unit data is subjected to the ent speech coding, and returned as a compressed speech unit data from the compression unit U513.
- the speech unit database creation unit U512 writes the compressed speech unit data into the storage area of the speech unit database U509 as the data constituting the data DAT.
- the speech unit database creation unit U 5 1 1 2 reads the phonograms read from the recorded speech unit data storage unit U 5 1 1 as indicating the reading of the speech unit represented by the written compressed speech unit 1 As a sound piece reading Evening base Write to U509 storage area.
- the head address of the written compressed speech piece data in the storage area of the speech piece database U509 is specified, and this address is used as the above-mentioned (B) data to produce the speech data base U509. Write to storage area 9. Further, the data length of the compressed speech piece data is specified, and the specified data length is written in the storage area of the speech piece database U509 as the data of (C).
- the language processing unit U501 obtains from the outside a free text file that describes a sentence (free text) containing ideographic characters prepared by the user as a target for synthesizing speech with this speech synthesis system. explain.
- the method by which the language processing unit U501 acquires the free text data is arbitrary.
- the language processing unit U501 may acquire the text data from an external device network via an interface circuit (not shown),
- the recording medium may be read from a recording medium (for example, a floppy (registered trademark) disk or CD-ROM) set in a recording medium drive (not shown) via the recording medium drive.
- the processor performing the function of the language processing unit U501 uses the text data used in other processing being executed by itself as free text data, and processes the data in the language processing unit U501. It may be delivered to.
- the language processing unit U501 identifies the phonogram representing the reading of each ideographic character included in the free text by searching the word dictionary U502. . Then, this ideographic character is replaced with the specified phonogram. Then, the language processing unit U501 sets all ideographs in the free text to phonetic sentences. The phonetic character string obtained as a result of the substitution into the character is supplied to the sound processing unit U503.
- the sound processing unit U503 receives, for each phonogram included in the phonogram string, the unit voice represented by the phonogram.
- the search unit U504 is instructed to search for the waveform of.
- the search unit U504 searches the waveform database U506 to find phoneme data representing the waveform of the unit speech represented by each phonetic character included in the phonetic character string. . Then, the retrieved phoneme data is supplied to the acoustic processing unit U503 as a search result.
- the sound processing unit U503 combines the phoneme data supplied from the search unit U504 with the order of each phonetic character in the phonetic character string supplied from the language processing unit U501. Then, it is supplied to the sound piece editing unit U507.
- the speech unit editing unit U507 Upon receiving the phoneme data from the sound processing unit U503, the speech unit editing unit U507 combines the phoneme data with each other in the order in which they are supplied, and generates data representing a synthesized voice (synthesized voice). Data). This synthesized speech synthesized based on free text is equivalent to the speech synthesized by the rule synthesis method.
- the method by which the sound piece editing unit U507 outputs synthesized speech data is arbitrary.
- the synthesized speech data is output via a D / A (Digital-to-Analog) converter (not shown).
- the synthesized voice represented by the data may be reproduced.
- the data may be transmitted to an external device or a network via an interface circuit (not shown), or may be written to a recording medium set in a recording medium drive device (not shown) via the recording medium drive device.
- the processor performing the function of the sound piece editing unit U507 may transfer the synthesized speech data to another process executed by itself.
- the sound processing unit U503 acquires data representing a phonogram string (distribution string data overnight) distributed from the outside. (Note that sound processing The method by which the unit U503 acquires distribution character string data is also optional.For example, the language processing unit U503 acquires distribution character string data in the same manner as the method of acquiring free text data. Just fine. )
- the sound processing unit U503 handles the phonetic character string represented by the distribution character string data in the same manner as the phonetic character string supplied from the language processing unit U501.
- the search unit U504 searches for phoneme data corresponding to phonetic characters included in the phonetic character string represented by the distribution character string data.
- the retrieved phoneme data is supplied to the speech unit editing unit U507 via the acoustic processing unit U503, and the speech unit editing unit U507 converts the phoneme data into the distribution character string data.
- Each phonetic character in the phonetic character string represented by Ichigo is combined with each other in the order according to the sequence and output as synthesized speech data.
- This synthesized speech data synthesized based on the distribution character string data also represents the speech synthesized by the rule synthesis method.
- the speech piece editing unit U507 has acquired the fixed message data, the utterance speed data, and the collation level data.
- the fixed message data is data representing a fixed message as a phonetic character string
- the utterance speed data is a specified value of the utterance speed of the fixed message represented by the fixed message data (the utterance of this fixed message is (The specified value of the time length).
- the collation level data is data specifying search conditions in a search process described later performed by the search unit U508, and hereinafter, takes any value of "1", "2", or "3". And "3" indicates the strictest search condition.
- the method by which the speech unit editing unit U507 obtains fixed message data, utterance speed data, and collation level data is arbitrary.
- the method in which the language processing unit U501 obtains free text data may be used.
- the same method can be used to obtain fixed message data, utterance speed data, and verification level data.
- the speech unit editing unit U507 When the standard message data, utterance speed data, and verification level data are supplied to the speech unit editing unit U507, the speech unit editing unit U507
- the search unit U508 is instructed to search for all the compressed speech unit data associated with the phonetic character that matches the phonetic character representing the reading of the speech unit included in the type message.
- the search unit U508 searches the speech unit database U509 in response to the instruction of the speech unit editing unit U507, and searches the corresponding compressed speech unit data and the corresponding compressed speech unit.
- the above-described speech piece reading data, speed initial value data, and pitch component data associated with the data are retrieved, and the retrieved compressed speech piece data is supplied to the expansion unit U505. Even when a plurality of compressed speech piece data correspond to one speech piece, all of the corresponding compressed speech piece data are searched for as candidates for the data used for voice synthesis.
- the search unit U508 when there is a speech unit for which compressed speech unit data could not be found, the search unit U508 generates a data (hereinafter referred to as missing portion identification data) for identifying the corresponding speech unit. I do.
- the decompression unit U505 restores the compressed speech piece data supplied from the search unit U508 to the speech piece data before being compressed, and returns it to the search unit U508.
- the search unit U508 communicates the speech unit data returned from the expansion unit U505 with the retrieved speech unit read data, speed initial value data, and pitch component data as search results. Supply to the speed converter U510.
- the missing part identification data is also supplied to the speech speed conversion unit U510.
- the speech unit editing unit U507 converts the speech unit data supplied to the speech speed conversion unit U510 into the speech speed conversion unit U510, and Indicates that the time length of the sound segment represented by the evening matches the speed indicated by the utterance speed data.
- the speech speed conversion unit U510 responds to the instruction of the speech unit editing unit U507, converts the speech unit data supplied from the search unit U508 to match the instruction, and converts the speech unit. Supplied to editorial department U507. Specifically, for example, the original time length of the speech piece data supplied from the search unit U508 is specified based on the retrieved speed initial value data, and this speech piece data is Resampling Then, the number of samples in the speech piece data may be set to a time length that matches the speed indicated by the speech piece editing unit U507.
- the speech speed conversion unit U510 also supplies the speech unit reading data and the pitch component data supplied from the search unit U508 to the speech unit editing unit U507, and the missing part identification data. Is supplied from the search unit U508, the missing part identification data is also supplied to the speech unit editing unit U507.
- the speech unit editing unit U507 When the utterance speed data is not supplied to the speech unit editing unit U507, the speech unit editing unit U507 is connected to the speech speed conversion unit U510. What is necessary is just to instruct the speech unit editing unit U507 to supply the speech unit data supplied to U510 without conversion, and the speech speed conversion unit U510 responds to this instruction. Then, the speech unit data supplied from the search unit U508 may be supplied to the speech unit editing unit U507 as it is.
- the speech unit editing unit U507 When the speech unit editing unit U507 is supplied with the speech unit data, the speech unit reading data and the pitch component data from the speech speed conversion unit U510, the supplied speech unit data From among them, select one piece of speech piece data that represents a waveform that can be approximated to the waveform of the speech piece that makes up the fixed message. However, the sound piece editing unit U507 sets the condition that satisfies the condition as a waveform close to the sound piece of the fixed message according to the acquired collation level data.
- the speech unit editing unit U507 uses the fixed message represented by the fixed message data as a prosody prediction method such as the Fujisaki model or To BI (Tone and Break Indices). By adding analysis based on this, we predict the prosody (accent, intonation, stress, etc.) of this fixed message.
- a prosody prediction method such as the Fujisaki model or To BI (Tone and Break Indices).
- the speech unit data supplied from the speech speed conversion unit U510 that is, the speech unit data whose reading matches the speech unit in the fixed message
- the condition of (1) that is, the condition of matching phonetic characters indicating the pronunciation
- the frequency of the pitch component of the speech piece data is further satisfied.
- the predicted result of the accent of the speech unit in the fixed message can be specified from the predicted result of the prosody of the fixed message, and the sound unit editing unit U507, for example, determines that the frequency of the pitch component is the lowest.
- the position predicted to be high may be interpreted as the predicted position of the accent.
- the position of the accent of the sound piece represented by the sound piece data for example, the position where the frequency of the pitch component is the highest is specified based on the above-described pitch component data, and this position is regarded as the position of the accent. I just need to interpret it.
- the condition of (2) that is, the condition of matching phonetic characters and accents for reading
- this unit is selected as the one close to the waveform of the unit in the standard message .
- the speech unit editing unit U507 can determine whether or not the voice represented by the speech unit is muddy or unvoiced based on the pitch component data supplied from the speech speed conversion unit U510. Good.
- the speech piece editing unit U507 will strictly specify these multiple pieces of speech data according to the set conditions. According to various conditions. Specifically, for example, if the set condition is equivalent to the value “1” of the collation level data, and if there are a plurality of corresponding speech piece data, it is equivalent to the value “2” of the collation level data Select one that also matches the search conditions, and if more than one piece of speech data is selected, From, perform operations such as further selecting a search condition that also matches the search condition corresponding to the value “3” of the collation level data. When multiple pieces of speech piece data remain after narrowing down by the search condition equivalent to the value “3” of the collation level data, the remaining one may be narrowed down to one by an arbitrary standard.
- the speech piece editing unit U507 will use the phonogram representing the reading of the speech piece indicated by the missing part identification data.
- the sequence is extracted from the fixed message data and supplied to the sound processing unit U503, which instructs to synthesize the waveform of the speech unit.
- the sound processing unit U503 that receives the instruction handles the phonetic character string supplied from the speech unit editing unit U507 in the same manner as the phonetic character string represented by the distribution character string data.
- phoneme data representing the waveform of the voice indicated by the phonetic character included in the phonetic character string is retrieved by the search unit U504, and the phoneme data is retrieved from the search unit U504 to the sound processing unit U504.
- the sound processing unit U503 supplies the phoneme data to the speech unit editing unit U507.
- the speech unit editing unit U507 Upon receiving the phoneme data from the sound processing unit U503, the speech unit editing unit U507 receives the phoneme data and the speech unit of the speech unit data supplied from the speech speed conversion unit U510.
- the one selected by the editing unit U507 is combined with each other in the order according to the arrangement of each sound piece in the fixed message indicated by the fixed message data, and is output as data representing the synthesized speech.
- the sound processing unit U503 immediately instructs the sound processing unit to synthesize the waveform. Speech unit data selected by the segment editing unit U507 is combined with each sound in the standard message indicated by the standard message data. .
- the configuration of the synthesized speech utilization system is not limited to the above-described configuration.
- the speech unit database U509 does not necessarily need to store the speech unit data in a compressed state.
- the speech unit database U509 stores waveform data and speech unit data in a state where they are not compressed In this case, the speech synthesis unit U5 does not need to include the decompression unit U505.
- the waveform database U506 may store phoneme data in a compressed state.
- the decompression unit U505 stores the phoneme data retrieved from the waveform database U506 by the search unit U504. What is necessary is just to retrieve the evening from the search unit U504, expand it, and return it to the search unit U504. Then, the search unit U504 may treat the returned phoneme data as a search result.
- the speech unit database creation unit U512 generates a new compression from the recording medium set in the recording medium drive unit (not shown) to the speech unit database U509 via this recording medium drive unit. It is also possible to read the sound piece data and phonetic character strings that are the material of the sound piece data.
- the speech unit registration unit R does not necessarily need to include the recorded speech unit data set storage unit U511.
- the pitch component data may be data representing a temporal change of the pitch length of the sound piece represented by the sound piece data.
- the sound piece editing unit U507 may specify the position having the shortest pitch length based on the pitch component data, and interpret this position as the position of the accent.
- the speech unit editing unit U507 stores the prosody registration data representing the prosody of the specific speech unit in advance, and if the specific message includes this particular prosody, the prosody registration data
- the prosody represented by may be treated as the result of prosody prediction.
- the speech unit editing unit U507 may newly store a result of past prosody prediction as a prosody registration data.
- the sound piece database creation unit U512 may include a microphone, an amplifier, a sampling circuit, an AZD (Analog-to-Digital) converter, a PCM encoder, and the like.
- the speech unit database creation unit U 5 12 sends the speech unit data from the recorded speech unit data set storage unit 12. Instead of acquiring an overnight, the sound signal representing the sound collected by its own microphone is amplified, sampled and converted to AZD, and then the sampled sound signal is subjected to PCM modulation to produce speech unit data. May be created.
- the speech unit editing unit U507 supplies the waveform data returned from the sound processing unit U503 to the speech speed conversion unit 11 to determine the time length of the waveform represented by the waveform data. You may make it match the speed indicated by Speed Day.
- the speech unit editing unit U507 acquires, for example, the free text data together with the language processing unit U501, and obtains a waveform close to the waveform of the speech unit included in the free text represented by the free text data. ⁇ , which is selected by performing processing that is substantially the same as the processing of selecting a sound piece data that represents a waveform close to the waveform of the sound piece included in the fixed message, for synthesizing voice. May be used.
- the sound processing unit U503 searches the search unit 5 for phoneme data representing the waveform of the speech unit selected by the speech unit editing unit U507, and representing the waveform of the speech unit. You do not have to put them out. Note that the sound piece editing unit U507 notifies the sound processing unit U503 of sound pieces that need not be synthesized by the sound processing unit U503, and the sound processing unit 4 responds to this notification. However, the search for the waveform of the unit voice constituting this speech unit may be stopped.
- the speech unit editing unit U507 acquires, for example, the distribution character string data together with the sound processing unit U503, and the waveform similar to the waveform of the speech unit included in the distribution character string represented by the distribution character string data. Is selected by performing substantially the same processing as the processing for selecting the sound piece data representing a waveform close to the waveform of the sound piece contained in the fixed message. It may be used for In this case, the sound processing unit U503 searches the search unit 5 for a sound element represented by the sound element data selected by the sound element editing unit U507 and representing the waveform of the sound element. You do not have to put them out.
- the phoneme data supply unit T and the phoneme data use unit U are both dedicated It doesn't have to be a system. Therefore, by installing the program from a recording medium storing a program for causing the personal computer to execute the operations of the above-described audio data division unit T1, phoneme data compression unit T2, and compressed phoneme data output unit T3, It is possible to configure a phoneme data supply unit T that performs the above-described processing. Also, in order for the personal computer to execute the operations of the compressed phoneme data input unit U1, the entropy code decoding unit U2, the nonlinear inverse quantization unit U3, the phoneme data restoration unit U4, and the voice synthesis unit U5. By installing the program from a recording medium storing the program, a phoneme data using unit U that executes the above-described processing can be configured.
- a personal computer that executes the above-described program and functions as the phoneme data supply unit T performs the process shown in FIG. 12 as a process corresponding to the operation of the phoneme data supply unit T in FIG. I do.
- FIG. 12 is a flowchart showing the processing of the personal computer for performing the function of the phoneme data supply unit T.
- a personal computer that performs the function of the phoneme data supply unit T acquires a speech data representing a speech waveform (FIG. 12, step S 00).
- the phoneme data supply computer performs substantially the same processing as Steps S2 to S16 performed by the computer C1 of the first embodiment, thereby obtaining phoneme data and pitch information. Is generated (step S 002).
- the phoneme data supply computer generates the above-mentioned compression characteristic data (step S003), and according to the compression characteristic data, generates the waveform represented by the phoneme data generated in step S002.
- Non-linear quantized phoneme data corresponding to a value obtained by performing non-linear compression on the instantaneous value is generated (step S004), and the generated non-linear quantized phoneme data and step S004 are generated.
- the compressed phoneme data is generated by subjecting the pitch information generated in step 2 and the compression characteristic data generated in step S003 to event mouth coding (step SO05).
- the phoneme data supply computer calculates the ratio of the data amount of the compressed phoneme data most recently generated in step S005 to the data amount of the phoneme data generated in step S002 (that is, the current compression rate). Rate) has reached the target predetermined compression rate (step S 006), and if it has been reached, the process proceeds to step S 07, and if it has not been reached, The process returns to step S003.
- step S003 When the process returns to step S003 from step S006, if the current compression ratio is higher than the target compression ratio, the compression characteristic of the phoneme data supply computer is set so that the compression ratio becomes smaller than the current compression ratio. To determine. On the other hand, if the current compression ratio is smaller than the target compression ratio, the compression characteristics are determined so that the compression ratio becomes larger than the current one.
- step S07 the phoneme data supply computer outputs the most recently generated compressed phoneme data in step S05.
- a personal computer that executes the above-described program and functions as the phoneme data utilization unit U performs a process shown in FIGS. 13 to 16 as a process corresponding to the operation of the phoneme data utilization unit U in FIG. Shall be performed.
- FIG. 13 is a flowchart showing a process in which a personal combination performing the function of the phoneme data using unit acquires phoneme data.
- FIG. 14 is a flowchart showing a speech synthesis process when the personal computer performing the function of the phoneme data utilization unit U acquires the free text data.
- FIG. 15 is a flowchart showing a speech synthesis process when the personal computer performing the function of the phoneme data utilization unit U obtains the distribution character string data.
- FIG. 16 is a flowchart showing a speech synthesis process in the case where a personal computer that performs the function of the phoneme data utilization unit U acquires the standard message data and the utterance speed data.
- a personal convenience that performs the function of the phoneme data utilization unit U
- the evening (hereinafter called a phoneme data utilizing computer) acquires the compressed phoneme data output by the phoneme data supply unit T and the like (FIG. 13, step S101), the nonlinear quantized phoneme data, pitch information and The non-linear quantized phoneme data, the pitch information, and the compression characteristic data are restored by decoding the compressed phoneme data corresponding to the compressed characteristic data that has been subjected to the entrant speech coding (step S102).
- the phoneme data utilization computer changes the instantaneous value of the waveform represented by the restored non-linear quantized phoneme data according to the compression characteristic indicated by the compression characteristic data and the characteristic that is inversely related to each other.
- the phoneme data before being quantized is restored (step S103).
- the computer using the phoneme data changes the time length of each section of the phoneme data restored in step S103 so as to be the time length indicated by the pitch information restored in step S102 (step S103). S104).
- the phoneme data using computer stores the phoneme data in which the time length of each section has been changed, that is, the restored phoneme data, in the waveform data base U506 (step S105).
- each free ideographic character included in the free text represented by the free text data is obtained. Then, the phonetic character representing the reading is specified by searching the general word dictionary 2 or the user word dictionary 3, and this ideographic character is replaced with the specified phonetic character (step S202).
- the method by which the phoneme data-using computer obtains free text data is optional.
- each phonogram included in the phonogram string is obtained.
- the waveform of the unit speech represented by the phonetic character is searched from the waveform database 7, and phoneme data representing the waveform of the unit speech represented by each phonetic character included in the phonetic character string is retrieved (step S). 2 0 3).
- the computer using the phoneme data unit combines the retrieved phoneme data in the order of the phonograms in the phonogram string and outputs them as a synthesized voice data (step). S204).
- the method by which the computer using phoneme data outputs synthesized speech data is arbitrary.
- the computer using phoneme data obtains the above-mentioned distribution character string data from an external source by an arbitrary method (FIG. 15, step S301), the computer includes the phonetic character string represented by the distribution character string data.
- the waveform of the unit speech represented by the phonogram is searched from the waveform database 7, and the phoneme data representing the waveform of the unit speech represented by each phonogram included in the phonogram string is retrieved. Find out (step S302).
- the phoneme data utilizing computer combines the searched phoneme data in the order of each phonetic character in the phonetic character string and in accordance with the order thereof, and performs the processing in step S204 as synthetic speech data.
- the output is performed by the same processing (step S303).
- the phoneme data-using computer obtains the above-mentioned fixed message data and utterance speed data from outside using any method (Fig. 16, step S401), first, the fixed message data is represented. All the compressed speech unit data associated with the phonetic characters that match the phonetic readings of the speech units included in the fixed message are retrieved (step S402).
- step S402 the above-described speech piece reading data, speed initial value data, and pitch component data associated with the corresponding compressed speech piece data are also retrieved. If more than one compressed speech piece data is applicable to one speech piece, all the corresponding compressed speech piece data are retrieved. On the other hand, if there is a sound piece that cannot be retrieved from the compressed sound piece data, the above-mentioned missing portion identification data is generated.
- the phoneme data utilizing computer restores the retrieved compressed speech piece data to the speech piece data before being compressed (step S403). Then, the restored speech piece data is processed in the same manner as the processing performed by the speech piece editing unit 8 described above. Then, the time length of the speech unit represented by the speech unit data is matched with the speed indicated by the utterance speed and the delay (step S404). When the utterance speed data is not supplied, the restored speech piece data need not be converted.
- the phoneme data-using computer predicts the prosody of the fixed message by adding an analysis based on the prosody prediction method to the fixed message represented by the fixed message data (step S405).
- the speech unit editing unit 8 performs the speech unit data representing the waveform closest to the waveform of the speech unit constituting the fixed message from the speech unit data in which the time length of the speech unit is converted.
- one sound piece is selected one by one according to the criteria indicated by the collation level data obtained from the outside (step S406).
- the phoneme data using computer specifies the speech piece data in accordance with, for example, the above-described conditions (1) to (3).
- the waveform of the speech unit in the fixed message is represented by searching for the sound unit in which the reading matches the speech unit in the fixed message.
- the phonetic character indicating the reading matches, and the content of the pitch component data indicating the time change of the frequency of the pitch component of the speech unit data is converted into a fixed message. It is considered that this speech unit data represents the waveform of the speech unit in the fixed message only if it matches the predicted result of the included speech unit.
- the phonetic characters and accents that represent the reading match, and the presence or absence of muddy or unvoiced speech represented by the speech unit is fixed. Only when the prosody of the message agrees with the predicted result, it is considered that this speech segment represents the waveform of the speech segment in the fixed message.
- the phoneme data-using computer when it generates the missing part identification data, it extracts a phonetic character string representing the reading of the sound piece indicated by the missing part identifying data from the fixed message data, and extracts the phoneme character string for each phoneme.
- the phoneme data-using computer By performing the processing in step S302 described above in the same manner as the phonetic character string represented by the distribution character string data, the waveform of the speech represented by each phonetic character in this phonetic character string is represented.
- the phoneme data is searched for (step S407).
- the phoneme data using computer combines the retrieved phoneme data and the speech unit data selected in step S406 in the order according to each of the sound units in the fixed message indicated by the fixed message data. Then, the data is output as data representing the synthesized speech (step S408).
- a program that causes a personal computer to perform the functions of the main unit M ⁇ voice unit registration unit R may be uploaded to, for example, a bulletin board (BBS) of a communication line and distributed via a communication line.
- BSS bulletin board
- carrier waves may be modulated by signals representing these programs, the resulting modulated waves may be transmitted, and a device that has received the modulated waves may demodulate the modulated waves and restore these programs.
- the program excluding the part is stored in the recording medium. It may be stored. Also in this case, in the present invention, it is assumed that the recording medium stores a program for executing each function or step executed by the computer.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE04711759T DE04711759T1 (de) | 2003-02-17 | 2004-02-17 | Sprachsyntheseverarbeitungssystem |
US10/546,072 US20060195315A1 (en) | 2003-02-17 | 2004-02-17 | Sound synthesis processing system |
EP04711759A EP1596363A4 (en) | 2003-02-17 | 2004-02-17 | SPEECH SYNTHESIS PROCESSING SYSTEM |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003038738 | 2003-02-17 | ||
JP2003-038738 | 2003-02-17 | ||
JP2004038858A JP4407305B2 (ja) | 2003-02-17 | 2004-02-16 | ピッチ波形信号分割装置、音声信号圧縮装置、音声合成装置、ピッチ波形信号分割方法、音声信号圧縮方法、音声合成方法、記録媒体及びプログラム |
JP2004-038858 | 2004-02-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004072952A1 true WO2004072952A1 (ja) | 2004-08-26 |
Family
ID=32871204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2004/001712 WO2004072952A1 (ja) | 2003-02-17 | 2004-02-17 | 音声合成処理システム |
Country Status (5)
Country | Link |
---|---|
US (1) | US20060195315A1 (ja) |
EP (1) | EP1596363A4 (ja) |
JP (1) | JP4407305B2 (ja) |
DE (1) | DE04711759T1 (ja) |
WO (1) | WO2004072952A1 (ja) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI235823B (en) * | 2004-09-30 | 2005-07-11 | Inventec Corp | Speech recognition system and method thereof |
US9672811B2 (en) * | 2012-11-29 | 2017-06-06 | Sony Interactive Entertainment Inc. | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
JP6646001B2 (ja) * | 2017-03-22 | 2020-02-14 | 株式会社東芝 | 音声処理装置、音声処理方法およびプログラム |
TWI672690B (zh) * | 2018-03-21 | 2019-09-21 | 塞席爾商元鼎音訊股份有限公司 | 人工智慧語音互動之方法、電腦程式產品及其近端電子裝置 |
JP7427957B2 (ja) * | 2019-12-20 | 2024-02-06 | ヤマハ株式会社 | 音信号変換装置、楽器、音信号変換方法および音信号変換プログラム |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63175899A (ja) * | 1987-01-16 | 1988-07-20 | シャープ株式会社 | 音声分析合成装置 |
JPS63287226A (ja) * | 1987-05-20 | 1988-11-24 | Fujitsu Ltd | 音声符号化伝送装置 |
JPH03233500A (ja) * | 1989-12-22 | 1991-10-17 | Oki Electric Ind Co Ltd | 音声合成方式およびこれに用いる装置 |
JPH05233565A (ja) * | 1991-11-12 | 1993-09-10 | Fujitsu Ltd | 音声合成システム |
JPH0723020A (ja) * | 1993-06-16 | 1995-01-24 | Fujitsu Ltd | 符号化制御方式 |
JPH0887297A (ja) * | 1994-09-20 | 1996-04-02 | Fujitsu Ltd | 音声合成システム |
JPH09232911A (ja) * | 1996-02-21 | 1997-09-05 | Oki Electric Ind Co Ltd | Iir型周期的時変フィルタとその設計方法 |
JPH11249677A (ja) * | 1998-03-02 | 1999-09-17 | Hitachi Ltd | 音声合成装置の韻律制御方法 |
JP2001249678A (ja) * | 2000-03-03 | 2001-09-14 | Nippon Telegr & Teleph Corp <Ntt> | 音声出力装置,音声出力方法および音声出力のためのプログラム記録媒体 |
JP2001306087A (ja) * | 2000-04-26 | 2001-11-02 | Ricoh Co Ltd | 音声データベース作成装置および音声データベース作成方法および記録媒体 |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4624012A (en) * | 1982-05-06 | 1986-11-18 | Texas Instruments Incorporated | Method and apparatus for converting voice characteristics of synthesized speech |
US4852168A (en) * | 1986-11-18 | 1989-07-25 | Sprague Richard P | Compression of stored waveforms for artificial speech |
DE3888547T2 (de) * | 1987-01-16 | 1994-06-30 | Sharp Kk | Gerät zur Sprachanalyse und -synthese. |
US5283833A (en) * | 1991-09-19 | 1994-02-01 | At&T Bell Laboratories | Method and apparatus for speech processing using morphology and rhyming |
US5390278A (en) * | 1991-10-08 | 1995-02-14 | Bell Canada | Phoneme based speech recognition |
DE69232112T2 (de) * | 1991-11-12 | 2002-03-14 | Fujitsu Ltd | Vorrichtung zur Sprachsynthese |
US6122616A (en) * | 1993-01-21 | 2000-09-19 | Apple Computer, Inc. | Method and apparatus for diphone aliasing |
JP3085631B2 (ja) * | 1994-10-19 | 2000-09-11 | 日本アイ・ビー・エム株式会社 | 音声合成方法及びシステム |
US5864812A (en) * | 1994-12-06 | 1999-01-26 | Matsushita Electric Industrial Co., Ltd. | Speech synthesizing method and apparatus for combining natural speech segments and synthesized speech segments |
US6052441A (en) * | 1995-01-11 | 2000-04-18 | Fujitsu Limited | Voice response service apparatus |
US5799276A (en) * | 1995-11-07 | 1998-08-25 | Accent Incorporated | Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals |
JP3349905B2 (ja) * | 1996-12-10 | 2002-11-25 | 松下電器産業株式会社 | 音声合成方法および装置 |
US6490562B1 (en) * | 1997-04-09 | 2002-12-03 | Matsushita Electric Industrial Co., Ltd. | Method and system for analyzing voices |
US6754630B2 (en) * | 1998-11-13 | 2004-06-22 | Qualcomm, Inc. | Synthesis of speech from pitch prototype waveforms by time-synchronous waveform interpolation |
EP1163663A2 (en) * | 1999-03-15 | 2001-12-19 | BRITISH TELECOMMUNICATIONS public limited company | Speech synthesis |
JP3728173B2 (ja) * | 2000-03-31 | 2005-12-21 | キヤノン株式会社 | 音声合成方法、装置および記憶媒体 |
JP2002091475A (ja) * | 2000-09-18 | 2002-03-27 | Matsushita Electric Ind Co Ltd | 音声合成方法 |
CN100568343C (zh) * | 2001-08-31 | 2009-12-09 | 株式会社建伍 | 生成基音周期波形信号的装置和方法及处理语音信号的装置和方法 |
-
2004
- 2004-02-16 JP JP2004038858A patent/JP4407305B2/ja not_active Expired - Lifetime
- 2004-02-17 US US10/546,072 patent/US20060195315A1/en not_active Abandoned
- 2004-02-17 WO PCT/JP2004/001712 patent/WO2004072952A1/ja active Application Filing
- 2004-02-17 EP EP04711759A patent/EP1596363A4/en not_active Withdrawn
- 2004-02-17 DE DE04711759T patent/DE04711759T1/de active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63175899A (ja) * | 1987-01-16 | 1988-07-20 | シャープ株式会社 | 音声分析合成装置 |
JPS63287226A (ja) * | 1987-05-20 | 1988-11-24 | Fujitsu Ltd | 音声符号化伝送装置 |
JPH03233500A (ja) * | 1989-12-22 | 1991-10-17 | Oki Electric Ind Co Ltd | 音声合成方式およびこれに用いる装置 |
JPH05233565A (ja) * | 1991-11-12 | 1993-09-10 | Fujitsu Ltd | 音声合成システム |
JPH0723020A (ja) * | 1993-06-16 | 1995-01-24 | Fujitsu Ltd | 符号化制御方式 |
JPH0887297A (ja) * | 1994-09-20 | 1996-04-02 | Fujitsu Ltd | 音声合成システム |
JPH09232911A (ja) * | 1996-02-21 | 1997-09-05 | Oki Electric Ind Co Ltd | Iir型周期的時変フィルタとその設計方法 |
JPH11249677A (ja) * | 1998-03-02 | 1999-09-17 | Hitachi Ltd | 音声合成装置の韻律制御方法 |
JP2001249678A (ja) * | 2000-03-03 | 2001-09-14 | Nippon Telegr & Teleph Corp <Ntt> | 音声出力装置,音声出力方法および音声出力のためのプログラム記録媒体 |
JP2001306087A (ja) * | 2000-04-26 | 2001-11-02 | Ricoh Co Ltd | 音声データベース作成装置および音声データベース作成方法および記録媒体 |
Non-Patent Citations (1)
Title |
---|
See also references of EP1596363A4 * |
Also Published As
Publication number | Publication date |
---|---|
US20060195315A1 (en) | 2006-08-31 |
JP2004272236A (ja) | 2004-09-30 |
DE04711759T1 (de) | 2006-03-09 |
EP1596363A4 (en) | 2007-07-25 |
EP1596363A1 (en) | 2005-11-16 |
JP4407305B2 (ja) | 2010-02-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7647226B2 (en) | Apparatus and method for creating pitch wave signals, apparatus and method for compressing, expanding, and synthesizing speech signals using these pitch wave signals and text-to-speech conversion using unit pitch wave signals | |
US20070106513A1 (en) | Method for facilitating text to speech synthesis using a differential vocoder | |
CN100568343C (zh) | 生成基音周期波形信号的装置和方法及处理语音信号的装置和方法 | |
EP0380572A1 (en) | SPEECH SYNTHESIS FROM SEGMENTS OF DIGITAL COARTICULATED VOICE SIGNALS. | |
WO2006095925A1 (ja) | 音声合成装置、音声合成方法及びプログラム | |
WO2004109659A1 (ja) | 音声合成装置、音声合成方法及びプログラム | |
JPS5827200A (ja) | 音声認識装置 | |
WO2004072952A1 (ja) | 音声合成処理システム | |
JP4256189B2 (ja) | 音声信号圧縮装置、音声信号圧縮方法及びプログラム | |
JP4264030B2 (ja) | 音声データ選択装置、音声データ選択方法及びプログラム | |
JP2000132193A (ja) | 信号符号化装置及び方法、並びに信号復号装置及び方法 | |
JP4736699B2 (ja) | 音声信号圧縮装置、音声信号復元装置、音声信号圧縮方法、音声信号復元方法及びプログラム | |
JP2005018037A (ja) | 音声合成装置、音声合成方法及びプログラム | |
JPWO2007015489A1 (ja) | 音声検索装置及び音声検索方法 | |
JP3994332B2 (ja) | 音声信号圧縮装置、音声信号圧縮方法、及び、プログラム | |
JP3976169B2 (ja) | 音声信号加工装置、音声信号加工方法及びプログラム | |
JP3994333B2 (ja) | 音声辞書作成装置、音声辞書作成方法、及び、プログラム | |
JP2003216172A (ja) | 音声信号加工装置、音声信号加工方法及びプログラム | |
JP4209811B2 (ja) | 音声選択装置、音声選択方法及びプログラム | |
TW526466B (en) | Encoding and voice integration method of phoneme | |
JP4780188B2 (ja) | 音声データ選択装置、音声データ選択方法及びプログラム | |
Morris et al. | A new speech synthesis chip set | |
KR19980037321A (ko) | 텍스트 음성합성 장치 및 방법 | |
JPH0552520B2 (ja) | ||
JPH03189698A (ja) | 符号化装置及び符号化方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2004711759 Country of ref document: EP Ref document number: 2006195315 Country of ref document: US Ref document number: 10546072 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 2004711759 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 10546072 Country of ref document: US |