JP2931059B2

JP2931059B2 - Speech synthesis method and device used for the same

Info

Publication number: JP2931059B2
Application number: JP2240243A
Authority: JP
Inventors: 隆矢頭
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1989-12-22
Filing date: 1990-09-11
Publication date: 1999-08-09
Anticipated expiration: 2014-08-09
Also published as: JPH03233500A

Description

【発明の詳細な説明】（産業上の利用分野）この発明は、波形領域の音声データを記憶領域に蓄積
しておき、これらの音声データを記憶領域から読み出し
て音声の編集合成する音声合成方式およびその装置に関
する。DETAILED DESCRIPTION OF THE INVENTION (Industrial application field) The present invention relates to a speech synthesis method in which speech data in a waveform area is stored in a storage area, and these speech data are read from the storage area to edit and synthesize speech. And its device.

（従来の技術）規則によって任意の音声を出力する規則合成の研究が
盛んである。規則合成に用いられる音声合成方式として
はPARCOR（偏自己相関係数:Patial auto−correlation
coefficient）、LSP（線スペクトル対:line spectrum p
air）といったLPC（線形予測符号化:linear predictive
coding）系の分析合成方式が現在、最も多く用いられ
ているが、合成音の明瞭性という点でやや問題がある。
これに対し、明瞭性を改善する試みに音声のスペクトル
包絡から抽出した音声素片波形を用いる方式が提案され
ている（例えば、文献I:特開昭60−22194号および文献I
I:日本音響学会講演論文集（１−２−16）、昭和58年10
月、P.73）。(Prior Art) Research on rule synthesis that outputs an arbitrary voice according to rules has been active. PARCOR (Partial auto-correlation coefficient) is used as a speech synthesis method used for rule synthesis.
coefficient), LSP (line spectrum vs. line spectrum p)
LPC (linear predictive coding: linear predictive)
At present, the analysis / synthesis method of the coding type is most frequently used, but there is a problem in terms of clarity of the synthesized sound.
On the other hand, in an attempt to improve intelligibility, a method using a speech segment waveform extracted from a spectrum envelope of speech has been proposed (for example, Reference I: JP-A-60-22194 and Reference I).
I: Proceedings of the Acoustical Society of Japan (1-2-16), 1983
Mon., p. 73).

これら文献に開示されている音声素片波形を得る方法
を第２図（Ａ）および（Ｂ）を参照して簡単に説明す
る。第２図（Ａ）は音声素片、特に有声音の音声素片を
抽出するフローを示し、第２図（Ｂ）は、第２図（Ａ）
のフローにより処理された波形を示し、そのフローの対
応するステップの横にそれぞれ示してある。同図におい
て、まず話者が発生した発生音をマイクロホン等で原音
声としてとらえる（ステップ:S10）。この原音声を一定
分析区間長切り出す（S20）。次にスペクトル包絡を抽
出する（S30）。次にスペクトル包絡を、位相成分をす
べて零として逆FFT（FFT:Fast Fourier Transform（高
速フーリエ変換））を施し、対称波形を算出する（S4
0）。次に対称波形の時間原点に生じる鋭いピークを補
正するため、簡易な非線形変換を行なう（S50）。次に
上記の処理による対称波形に時間原点を中心とした時間
窓を掛け、端点付近を切り、素片長を一定にする（S6
0）。次に抽出されるすべての素片のパワーが同一にな
るように素片パワーの正規化を行なう（S70）。The method of obtaining speech unit waveforms disclosed in these documents will be briefly described with reference to FIGS. 2 (A) and 2 (B). FIG. 2A shows a flow for extracting a speech unit, particularly a speech unit of a voiced sound, and FIG. 2B shows a flow for extracting the speech unit.
The waveforms processed by this flow are shown next to the corresponding steps in that flow, respectively. In the figure, first, a sound generated by a speaker is captured as an original sound by a microphone or the like (step: S10). This original speech is cut out for a certain analysis section length (S20). Next, a spectrum envelope is extracted (S30). Next, an inverse FFT (FFT: Fast Fourier Transform) is performed on the spectral envelope with all phase components being zero, and a symmetric waveform is calculated (S4).
0). Next, a simple nonlinear conversion is performed to correct a sharp peak generated at the time origin of the symmetric waveform (S50). Next, a time window centering on the time origin is applied to the symmetrical waveform obtained by the above processing, the vicinity of the end point is cut, and the segment length is made constant (S6
0). Next, the unit power is normalized so that the power of all the extracted units becomes the same (S70).

規則合成の場合、このようにして得られた個々の音声
素片波形を音声の基本単位として多数用意して合成に用
いる手法も考えられているが、高品質な合成音を得るた
めには、ある程度連続した音声の中でのスペクトルの流
れを忠実に表現する必要があり、そのため一般にCV音
節、或いはVCV音韻連鎖などが音声の基本単位として選
ばれる。その場合の音声素片波形の抽出は、CV或いはVC
Vなどの原音声に対し、一定時間（フレーム）周期毎
に、第２図（Ａ）および（Ｂ）で示した分析処理を施
し、得られた一連の音声素片をひとまとまりとして音声
の基本単位として扱って符号化し（第２図（Ａ）のS8
0）、音声合成装置用の記憶装置に蓄積させている。In the case of rule synthesis, a method of preparing a large number of individual speech unit waveforms obtained in this way as basic units of speech and using them for synthesis has been considered, but in order to obtain a high-quality synthesized sound, It is necessary to faithfully represent the flow of the spectrum in the speech that is continuous to some extent. Therefore, a CV syllable or a VCV phoneme chain is generally selected as a basic unit of the speech. In that case, the speech unit waveform is extracted by CV or VC
The original speech such as V is subjected to the analysis processing shown in FIGS. 2 (A) and 2 (B) at regular time (frame) periods, and a series of obtained speech units are grouped into a basic speech. Treated as a unit and encoded (S8 in FIG. 2 (A))
0), stored in the storage device for the speech synthesizer.

（発明が解決しようとする課題）この音声素片自体の波形の符号化データを用いた従来
の音声合成技術は、合成の処理が簡単で明瞭性のある合
成音が得られる反面、音声データとして波形を直接扱っ
ているため蓄積すべき音声データの量が多いという問題
があった。(Problems to be Solved by the Invention) In the conventional speech synthesis technology using the encoded data of the waveform of the speech unit itself, the synthesis process is simple and a clear synthesized voice can be obtained. There is a problem that the amount of audio data to be stored is large because the waveform is directly handled.

従って、この発明の目的は、記憶領域に蓄積すべき音
声データの容量を軽減し、少ない容量で明瞭な合成音を
生成する音声合成方式およびその装置を提供することに
ある。SUMMARY OF THE INVENTION It is therefore an object of the present invention to provide a voice synthesizing method and apparatus for generating a clear synthesized voice with a small capacity by reducing the volume of voice data to be stored in a storage area.

（課題を解決するための手段）この目的の達成を図るため、この発明の方式によれ
ば、自然音声を一定のフレーム周期毎に分析してそれぞれ
のフレームにおける音声のスペクトル包絡からそれぞれ
抽出した音声素片に関する音声素片データを記憶装置に
予め格納しておき、該記憶装置に予め格納された前記音
声素片データを用いて音声素片を合成して音声を合成す
る音声合成方式において、音声素片波形を、各音声素片
どうしで同一電力となると共に、音声素片波形が有する
全周波数成分について各音声素片どうしで同一位相特性
となるように変形しておき、隣接する２つのフレームで
それぞれ抽出された音声素片間での音声素片波形の差を
差分波形としてそれぞれ求め、各差分波形を符号化デー
タに変えて、前記隣接する２つのフレームのうち後続の
フレームにおける音声素片データとして、記憶装置に格
納しておくことを特徴とする。(Means for Solving the Problems) In order to achieve this object, according to the method of the present invention, natural sound is analyzed every fixed frame period, and each sound is extracted from the spectral envelope of the sound in each frame. In a speech synthesis method in which speech segment data relating to a speech segment is stored in a storage device in advance, and speech is synthesized by synthesizing a speech segment using the speech segment data stored in advance in the storage device, The unit waveform is modified so that each speech unit has the same power, and all the frequency components of the speech unit waveform have the same phase characteristic between the speech units, and two adjacent frames are modified. The difference between the speech unit waveforms between the speech units extracted in step (a) is determined as a difference waveform, and each difference waveform is changed to encoded data to obtain the two adjacent frames. As speech unit data in the out subsequent frame, characterized in that stored in the storage device.

また、この発明の実施に当たり、好ましくは、隣接する２つのフレームの差分波形の符号化データ
を、当該差分波形の振幅のダイナミックレンジに応じた
符号化ビット数で、前記差分波形をそれぞれ符号化した
値および前記符号化ビット数とするのが良い。Further, in the embodiment of the present invention, preferably, the differential waveform is encoded by encoding the differential waveform encoded data of two adjacent frames with the number of encoded bits corresponding to the dynamic range of the amplitude of the differential waveform. The value and the number of coded bits are good.

また、この発明の実施に当たり、好ましくは、差分波形の符号化データを、該差分波形のダイナミッ
クレンジの大きさが前記隣接する２つのフレームのうち
の後続のフレームで抽出された音声素片のダイナミック
レンジよりも小さいという条件の下で、前記記憶装置に
格納しておき、および、前記差分波形のダイナミックレ
ンジの大きさが前記音声素片のダイナミックレンジと等
しいかまたは大きいという条件の下では、前記差分波形
の符号化データの代わりに前記音声素片の音声素片波形
の符号化データを前記記憶装置に格納しておくのが良
い。In practicing the present invention, preferably, the encoded data of the differential waveform is obtained by extracting the dynamic range of the speech unit extracted in the succeeding frame of the two adjacent frames from the dynamic range of the differential waveform. Under the condition that it is smaller than the range, it is stored in the storage device, and under the condition that the magnitude of the dynamic range of the difference waveform is equal to or greater than the dynamic range of the speech unit, the It is preferable that encoded data of a speech unit waveform of the speech unit is stored in the storage device instead of encoded data of a difference waveform.

また、この発明の実施に当たり、好ましくは、隣接する２つのフレームの差分波形の符号化データ
は、当該差分波形の符号化に際して予め定めた量子化誤
差のしきい値を満足する符号化ビット数で符号化されて
いるようにすることもできる。In the embodiment of the present invention, preferably, the encoded data of the difference waveform between two adjacent frames is a number of encoded bits that satisfies a threshold value of a predetermined quantization error when encoding the difference waveform. It can also be encoded.

また、この発明の実施に当たり、好ましくは、差分波形の符号化データを、該差分波形を符号化した
ときの量子化誤差が前記隣接するフレームのうちの後続
のフレームで抽出された音声素片を同一の符号化ビット
数で符号化したときの量子化誤差よりも小さいという条
件の下で前記記憶装置に格納しておき、および、前記差
分波形の符号化時の量子化誤差が、前記音声素片の符号
化時の量子化誤差と等しいかまたは大きいという条件の
もとでは、前記差分波形の符号化データの代わりに前記
音声素片の符号化データを前記記憶装置に格納しておく
こともできる。In practicing the present invention, preferably, the encoded data of the differential waveform is obtained by extracting a speech unit from which a quantization error when encoding the differential waveform is extracted in a subsequent frame among the adjacent frames. It is stored in the storage device under the condition that it is smaller than the quantization error when encoding with the same encoding bit number, and the quantization error when encoding the differential waveform is Under the condition that the quantization error is equal to or larger than the quantization error at the time of encoding of the segment, the encoded data of the speech unit may be stored in the storage device instead of the encoded data of the difference waveform. it can.

この発明の実施に当たり、好ましくは、隣接する２つ
のフレームの差分波形の符号化において、前記差分波形
の性質に応じて、前記差分波形の符号化ビット数を１差
分波形毎に適応的に定め、当該ビット数によって符号化
した差分波形符号化データと、前記符号化ビット数を合
わせて差分波形符号化データとするのが良い。In practicing the present invention, preferably, in the encoding of the differential waveform of two adjacent frames, the number of encoded bits of the differential waveform is adaptively determined for each differential waveform in accordance with the properties of the differential waveform, It is preferable that differential waveform encoded data encoded by the bit number and the encoded bit number are combined to form differential waveform encoded data.

また、この発明の好適実施例では、フレーム毎に、当
該フレームの音声素片波形と、当該フレームと、隣接し
て先行するフレームとでの音声素片波形の差分波形とを
同一の符号化ビット数で符号化した場合、どちらが符号
化精度が良いかを判定し、符号化精度が良い方の符号化
データを当該フレームの音声素片波形の符号化データと
して記憶装置に格納しておくのが良い。Further, in the preferred embodiment of the present invention, for each frame, the speech segment waveform of the frame and the difference waveform of the speech segment waveform between the frame and the immediately preceding frame are encoded with the same encoding bit. In the case of encoding with numbers, it is better to determine which encoding accuracy is better, and store the encoded data with the better encoding accuracy in the storage device as encoded data of the speech unit waveform of the frame. good.

また、この発明の実施に当り、好ましくは、記憶装置
に格納される符号化データには、差分波形を符号化した
データかまたは音声素片波形そのものを符号化したデー
タかを識別するフラグを含ませておくのがよい。In practicing the present invention, preferably, the coded data stored in the storage device includes a flag for identifying whether the data is obtained by coding the difference waveform or the data obtained by coding the speech unit waveform itself. It is better to keep it.

さらに、この発明の装置によれば、自然音声を一定の
フレーム周期毎に分析してそれぞれのフレームにおける
音声のスペクトル包絡からそれぞれ抽出した音声素片に
関する音声素片データが格納された記憶装置と、該記憶
装置から前記音声素片データを読み出して再生すること
により、外部機器へ出力するための音声を合成する合成
手段とを含む音声合成装置において、音声素片波形を、
各音声素片データどうしで同一電力となると共に、音声
素片波形が有する全周波数成分について各音声素片どう
しで同一位相特性となるように変形しておき、音声素片
データを、隣接する２つのフレームでそれぞれ抽出され
た音声素片間での音声素片波形の差である差分波形の符
号化データとし、合成手段は、記憶装置から前記差分波
形の符号化データを復号して音声素片の再生を行なうこ
とを特徴とする。Furthermore, according to the device of the present invention, a storage device in which speech unit data relating to speech units extracted from the spectral envelope of speech in each frame by analyzing natural speech at fixed frame periods is stored, By reading and playing back the speech unit data from the storage device, and synthesizing means for synthesizing a sound to be output to an external device.
Each speech unit data has the same power, and all the frequency components of the speech unit waveform are modified so as to have the same phase characteristic between the speech units. The encoded data of the difference waveform, which is the difference of the speech segment waveform between the speech segments extracted in each of the three frames, and the synthesizing means decodes the encoded data of the difference waveform from the storage device to perform speech segmentation. Is reproduced.

また発明の実施に当り、好ましくは、前記符号化デー
タを、前記差分波形の振幅のダイナミックレンジに応じ
た符号化ビット数で前記差分波形をそれぞれ符号化した
値および前記符号化ビット数とするのがよい。In practicing the invention, preferably, the encoded data is a value obtained by encoding the differential waveform with the number of encoded bits according to a dynamic range of the amplitude of the differential waveform, and the encoded bit number. Is good.

また、この発明の実施に当たり、符号化データを、前記差分波形を予め定めた量子化誤
差を満足する符号化ビット数で、前記差分波形をそれぞ
れ符号化した値および前記符号化ビット数とするのが良
い。In practicing the present invention, the coded data is defined as a value obtained by coding the differential waveform with the number of coded bits satisfying a predetermined quantization error of the differential waveform and the coded bit number. Is good.

また、この発明の装置によれば、自然音声を一定のフ
レーム周期毎に分析してそれぞれのフレームにおける音
声のスペクトル包絡からそれぞれ抽出した音声素片に関
する音声素片データが格納された記憶装置と、該記憶装置から前記音声素片データを読み出して再生
することにより、外部機器へ出力するための音声を合成
する合成手段とを含む音声合成装置において、音声素片波形を、各音声素片どうしで同一電力となる
と共に、音声素片波形が有する全周波数成分について各
音声素片どうしで同一位相特性となるように変形してお
き、隣接する２つのフレームでそれぞれ抽出された音声素
片間での音声素片波形の差である差分波形のダイナミッ
クレンジの大きさが前記隣接する２つのフレームのうち
の後続のフレームで抽出された音声素片のダイナミック
レンジよりも小さいという条件の下では、前記記憶装置
に音声素片データとして前記差分波形の符号化データを
格納しておき、および、前記差分波形のダイナミックレ
ンジの大きさが前記音声素片波形のダイナミックレンジ
と等しいかまたは大きいという条件の下では、前記記憶
装置に音声素片データとして前記音声素片の音声素片波
形の符号化データを格納しておき、記憶装置に格納された音声素片データは、当該データ
が前記差分波形を符号化したデータか、或いは音声素片
波形そのものを符号化したデータであるかを識別するフ
ラグを含み、合成手段は、前記記憶装置から前記フラグと符号化デ
ータとを読み出し、前記フラグに応じて音声素片の再生
を、差分波形に基づく再生と、素片波形に基づく再生と
を切り換えて行なうのが良い。Further, according to the device of the present invention, a storage device in which speech unit data relating to speech units extracted from the spectral envelope of speech in each frame by analyzing natural speech at fixed frame periods, A synthesizing unit for synthesizing a voice to be output to an external device by reading and reproducing the voice unit data from the storage device, wherein a voice unit waveform is converted between the voice units. At the same time, the same power is obtained, and all the frequency components of the speech unit waveform are deformed so as to have the same phase characteristic between the speech units, and the speech components extracted between two adjacent frames are used. A speech element extracted in a subsequent frame of the two adjacent frames has a dynamic range of a difference waveform that is a difference between speech element waveforms. Under the condition that the difference is smaller than the dynamic range of the speech unit, the encoded data of the difference waveform is stored in the storage device as speech unit data, and the magnitude of the dynamic range of the difference waveform is Under the condition that it is equal to or larger than the dynamic range of the waveform, encoded data of the speech unit waveform of the speech unit is stored as speech unit data in the storage device, and the speech stored in the storage device is stored. The unit data includes a flag for identifying whether the data is data obtained by encoding the difference waveform or data obtained by encoding the speech unit waveform itself. Read the encoded data, and switch the reproduction of the speech unit between the reproduction based on the difference waveform and the reproduction based on the unit waveform according to the flag. It is better to do it.

また、この発明の装置によれば、自然音声を一定のフ
レーム周期毎に分析してそれぞれのフレームにおける音
声のスペクトル包絡からそれぞれ抽出した音声素片に関
する音声素片データが格納された記憶装置と、該記憶装置から前記音声素片データを読み出して再生
することにより、外部機器へ出力するための音声を合成
する合成手段とを含む音声合成装置において、音声素片波形を、各音声素片どうしで同一電力となる
と共に、音声素片波形が有する全周波数成分について各
音声素片どうしで同一位相特性となるように変形してお
き、隣接する２つのフレームでそれぞれ抽出された音声素
片間での音声素片波形の差である差分波形を符号化した
ときの量子化誤差が前記隣接するフレームのうちの後続
のフレームで抽出された音声素片を同一の符号化ビット
数で符号化したときの量子化誤差よりも小さいという条
件の下では、前記記憶装置に前記音声素片データとして
前記差分波形の符号化データを格納しておき、および、
前記差分波形の符号化時の量子化誤差が、前記音声素片
の符号化時の量子化誤差と等しいかまたは大きいという
条件の下では、前記記憶装置に音声素片データとして前
記音声素片波形の符号化データを格納しておき、記憶装
置に格納された音声素片波形データは、当該データが差
分波形を符号化したデータか、或いは音声素片波形その
ものを符号化したデータであるかを識別するフラグを含
み、合成手段は、記憶装置からフラグと符号化データと
を読み出し、フラグに応じて音声素片の再生を、差分波
形に基づく再生と、素片波形に基づく再生とを切り換え
て行なうのがよい。Further, according to the device of the present invention, a storage device in which speech unit data relating to speech units extracted from the spectral envelope of speech in each frame by analyzing natural speech at fixed frame periods, A synthesizing unit for synthesizing a voice to be output to an external device by reading and reproducing the voice unit data from the storage device, wherein a voice unit waveform is converted between the voice units. At the same time, the same power is obtained, and all the frequency components of the speech unit waveform are deformed so as to have the same phase characteristic between the speech units, and the speech components extracted between two adjacent frames are used. A quantization error when encoding a difference waveform that is a difference between speech unit waveforms is the same as that of a speech unit extracted in a subsequent frame among the adjacent frames. Under the condition that less than the quantization errors after encoded by the encoding bits may be stored encoded data of the differential waveform as said speech unit data in the storage device, and,
Under the condition that the quantization error when encoding the difference waveform is equal to or larger than the quantization error when encoding the speech unit, the speech unit waveform is stored in the storage device as speech unit data. The speech unit waveform data stored in the storage device determines whether the data is data obtained by encoding the difference waveform or data obtained by encoding the speech unit waveform itself. The synthesizing unit reads the flag and the encoded data from the storage device, and switches the reproduction of the speech unit between the reproduction based on the difference waveform and the reproduction based on the unit waveform according to the flag. Good to do.

（作用）この発明の方式および装置では、原音声のスペクトル
包絡線からフレーム毎に抽出された音声素片そのものの
符号化データを音声素片データとして記憶装置に蓄える
のではなく、先行して抽出された隣接フレームの音声素
片波形と、現フレームの音声素片波形との差の符号化デ
ータを音声素片データとして記憶装置に蓄積する。そし
て、音声の合成に当り、記憶装置から音声素片データを
読み出して復号し、よって音声のスペクトル包絡から抽
出した音声素片波形を連結して音声を合成する。(Operation) In the method and apparatus of the present invention, the encoded data of the speech unit itself extracted for each frame from the spectrum envelope of the original speech is not stored in the storage device as speech unit data but is extracted in advance. The encoded data of the difference between the speech segment waveform of the adjacent frame and the speech segment waveform of the current frame is stored in the storage device as speech segment data. Then, in synthesizing the voice, the voice unit data is read from the storage device and decoded, and the voice unit waveform extracted from the spectral envelope of the voice is concatenated to synthesize the voice.

これらの音声素片波形に着目すると、音声のスペクト
ルの時間的変化をある程度正確に表現するために分析の
フレーム周期は通常５〜20ms程度に選ばれるが、このよ
うな短時間間隔においては隣接フレーム間での音声のス
ペクトルの変化は小さい。前述した素片抽出過程に示さ
れるとおり（第２図（Ａ）および（Ｂ））、各音毎の音
声素片は、同一電力、同一位相特性を有するので、隣接
フレームにおけるそれぞれの素片波形の差の波形は、そ
のまま隣接フレーム間のスペクトルの差を表現した波形
となる。Focusing on these speech unit waveforms, the analysis frame period is usually selected to be about 5 to 20 ms in order to express the temporal change of the speech spectrum to some extent accurately. The change of the spectrum of the voice between is small. As shown in the above-described segment extraction process (FIGS. 2A and 2B), the speech segments of each sound have the same power and the same phase characteristics. Is a waveform expressing the difference in spectrum between adjacent frames as it is.

一方、連続音声の隣接フレーム間での音声素片波形は
非常に類似しているため、その差をとった差分波形のダ
イナミックレンジは、もとの音声素片波形のダイナミッ
クレンジに比べてはるかに小さい。従って、抽出される
素片波形をそのまま符号化する場合の符号化ビット数に
比べ、差分波形を符号化するときの符号化ビット数は明
らかに少なくできる。そのため、素片波形をそのまま符
号化、蓄積する場合に比べ、差分波形を音声素片データ
として符号化、蓄積することにより、蓄積すべき音声素
片データの容量を大幅に削減できる。On the other hand, since speech unit waveforms between adjacent frames of continuous speech are very similar, the dynamic range of the difference waveform obtained by taking the difference is far greater than that of the original speech unit waveform. small. Therefore, the number of encoded bits when encoding the differential waveform can be clearly reduced as compared with the number of encoded bits when encoding the extracted segment waveform as it is. For this reason, by encoding and storing the difference waveform as speech unit data as compared with the case where the unit waveform is directly encoded and stored, the capacity of the speech unit data to be stored can be significantly reduced.

但し、差分波形は、パワーの正規化が施されている音
声素片波形と異なり、ダイナミックレンジ等の特性にバ
ラツキが大きい。すなわち、隣接フレーム間での音声素
片波形の類似性が音声の種類、或いは定常部、過渡部の
いかんによってかなりの幅があるため、常に同一の符号
化ビット数で符号化することは効率的ではない。従っ
て、差分波形の符号化に当たっては、例えばそのダイナ
ミックレンジ等のある評価基準に基づいて符号化のビッ
ト数を可変してやれば、なお一層効率的な符号化が行な
え、データ量を低減することができる。However, the difference waveform has a large variation in characteristics such as a dynamic range, unlike the speech unit waveform to which the power is normalized. That is, since the similarity of speech segment waveforms between adjacent frames has a considerable range depending on the type of speech or the stationary part and the transient part, it is efficient to always encode with the same number of encoding bits. is not. Therefore, when encoding the differential waveform, if the number of bits of the encoding is varied based on a certain evaluation criterion such as the dynamic range, the encoding can be performed even more efficiently, and the data amount can be reduced. .

上述したように大部分の音声区間では隣接フレーム間
での素片波形は非常に類似しており、従って、その差を
とった差分波形を音声波形に代えて符号化、蓄積するこ
とにより、大幅な情報圧縮が可能となるが、無音声区
間、或いは音韻の切り変わり時等で、隣接フレーム間で
の音声素片波形が急変し、場合によっては差分波形を符
号化することが音声素片波形そのものを符号化する場合
よりもかえって符号化精度が低下してしまう恐れもあ
る。As described above, in most speech sections, the unit waveforms between adjacent frames are very similar, and therefore, by encoding and storing the difference waveform obtained by taking the difference instead of the speech waveform, it is possible to greatly reduce the difference. However, the speech unit waveform between adjacent frames changes suddenly in a non-speech section or when a phoneme changes, and in some cases, the difference unit waveform can be encoded. There is also a possibility that the encoding accuracy is reduced rather than the case where the data is encoded.

このような場合には、音声素片そのものを用いて符号
化し、蓄積しておいた方がよい。従って、全体の情報圧
縮効果を考えると、隣接フレーム間の音声素片波形自体
の符号化と、音声素片の波形の差すなわち差分波形の符
号化の両者を混在させて記憶装置に蓄積させておくのが
よい。そのために、好ましくは、符号化データが差分波
形を表わしたものであるか、または、素片波形そのもの
を表わしたものであるかを識別するためのフラグを符号
化データに含ませておくのがよい。このようにすれば、
音声素片データとともにこのフラグを記憶装置に格納で
きるので、音声素片の情報圧縮と符号化誤差の低減とを
実現できる。また、そのフラグをもとにして音声合成を
行なうことができるので、合成音の品質向上を図れる。In such a case, it is better to encode using the speech unit itself and store it. Therefore, in consideration of the overall information compression effect, both the encoding of the speech unit waveform itself between adjacent frames and the encoding of the difference between the speech unit waveforms, that is, the encoding of the difference waveform, are mixed and stored in the storage device. Good to put. Therefore, it is preferable that a flag for identifying whether the encoded data represents the differential waveform or the fragment waveform itself is included in the encoded data. Good. If you do this,
Since this flag can be stored in the storage device together with the speech unit data, it is possible to realize information compression of the speech unit and reduction of the coding error. In addition, since speech synthesis can be performed based on the flag, the quality of synthesized speech can be improved.

（実施例）以下、図面を参照して、この発明の実施例につき説明
する。(Example) Hereinafter, an example of the present invention will be described with reference to the drawings.

第１図（Ａ）は、この発明の音声合成方式および装置
の説明に供する、音声合成装置のブロック図、第１図
（Ｂ）は、この発明による音声素片波形再生の基本的過
程を示すフローおよび第３図は、この発明の説明に供す
る音声素片波形符号化の基本的過程を示すフローであ
り、また、第４図（Ａ）および（Ｂ）は、音声素片波形
および差分波形の例を示す図である。また、第９図は、
原音声を取り込んでから記憶装置へ音声素片データを格
納する様子を説明するための、ブロック図である。FIG. 1A is a block diagram of a voice synthesizing apparatus for explaining a voice synthesizing method and apparatus according to the present invention, and FIG. 1B shows a basic process of voice unit waveform reproduction according to the present invention. FIGS. 3A and 3B are flowcharts showing a basic process of speech unit waveform encoding for explaining the present invention. FIGS. 4A and 4B are diagrams showing speech unit waveforms and difference waveforms. It is a figure showing the example of. Also, FIG.
FIG. 3 is a block diagram for explaining a state in which original speech is fetched and then speech unit data is stored in a storage device.

まず、第１図（Ａ）に示す、この発明の音声合成装置
は、原音声の音声素片データを蓄積している記憶装置10
0と、この記憶装置100から音声素片データを読み出して
編集合成し外部機器へ出力するための音声を合成する合
成手段102とを主として備えている。これら記憶装置100
および合成手段を正しく機能させるために必要な制御信
号等は、制御部104から適宜供給できるようになってお
り、この制御部104は、この種の装置では常套手段であ
るため、その説明を省略する。First, a speech synthesizer according to the present invention shown in FIG. 1 (A) has a storage device 10 for storing speech unit data of an original speech.
0 and a synthesizing means 102 for reading out voice unit data from the storage device 100, editing and synthesizing the voice unit data, and synthesizing voice for output to an external device. These storage devices 100
Control signals and the like necessary for the synthesizing means to function properly can be appropriately supplied from the control unit 104. Since the control unit 104 is a conventional means in this type of apparatus, a description thereof will be omitted. I do.

まず、音声素片波形符号化の基本的過程を第３図およ
び第９図を参照して説明する。First, the basic process of speech unit waveform encoding will be described with reference to FIGS.

第９図に示すブロック図において、話者が発声する原
音声を原音声入力装置10で取り込み、適当なデジタル信
号に変換した後、音声素片作成装置20で音声素片波形を
得、ここで、所要の符号化データを得て記憶装置100へ
格納する。原音声入力装置10は、例えばマイクロホン等
の音響−電気信号変換装置、フィルタおよびA/Dコンバ
ータを以って任意適当に構成し得るものである。また、
音声素片作成装置20も中央処理装置（CPU）等を用いて
任意適当に構成し得る。そして、これら原音声入力装置
10、音声素片作成装置20および記憶装置100を、常套手
段である制御部30からの制御信号等を用いて制御しなが
ら動作させることができる構成となっている。In the block diagram shown in FIG. 9, an original voice uttered by a speaker is captured by an original voice input device 10 and converted into an appropriate digital signal, and then a voice unit waveform is obtained by a voice unit creating device 20. , Necessary encoded data is obtained and stored in the storage device 100. The original voice input device 10 can be arbitrarily and appropriately configured with an acoustic-electric signal conversion device such as a microphone, a filter, and an A / D converter. Also,
The speech unit creation device 20 can also be arbitrarily and appropriately configured using a central processing unit (CPU) or the like. And these original voice input devices
10. The configuration is such that the speech unit creating device 20 and the storage device 100 can be operated while being controlled using a control signal or the like from the control unit 30, which is a conventional means.

第３図に示す音声素片波形符号化の基本的過程は、こ
の音声素片作成装置20において行なわれる。ステップ
（以下、ステップをＳで表わす）110は音声素片抽出の
処理であって、このS110の処理では、既に説明した第２
図に示される素片抽出処理に従ってフレームｉにおける
音声素片波形a_j ⁱ（ｊ＝1,2,…Ｎ）を抽出する。ここで
は添字ｉはフレーム番号を示し、添字ｊは素片のサンプ
ル番号で、素片長をＮとしている。第４図（Ａ）にフレ
ーム１〜12（ｉ＝1,…,12）の音声素片波形の例を示し
てある。次に、S120では現在のフレーム（以下、単に、
現フレームと称する）ｉにおける素片波形a_j ⁱと隣接し
て先行するフレーム（以下、単に、前フレームと称す
る）（ｉ−１）における素片波形A_j ^i-1との差分波形b_j ⁱ
を算出する。但し、ここでの素片波形A_j ^i-1は、前フレ
ーム（ｉ−１）において符号化、復号化された後の再生
波形を用いる。但し、第１番目のフレームにおいては、
A_j ^i-1はすべて零（０）とする。The basic process of speech unit waveform encoding shown in FIG. 3 is performed in the speech unit creation device 20. Step (hereinafter, step is represented by S) 110 is a speech unit extraction process, and in the process of S110, the second
A speech unit waveform a _j ⁱ (j = 1, 2,... N) in a frame i is extracted according to a unit extraction process shown in FIG. Here, the subscript i indicates the frame number, the subscript j is the sample number of the unit, and the unit length is N. FIG. 4A shows an example of speech unit waveforms of frames 1 to 12 (i = 1,..., 12). Next, in S120, the current frame (hereinafter simply referred to as
Current frame hereinafter) frames preceding adjacent to unit waveform a _j ⁱ in i (hereinafter, simply referred to as previous frame) (the differential waveform b _j of unit waveforms A _j ^i-1 in the i-1) ⁱ
Is calculated. However, the reproduction waveform after encoding and decoding in the previous frame (i-1) is used as the unit waveform A _j ^i-1 here. However, in the first frame,
A _j ^i-1 is assumed to be all zero (0).

上述したフレーム１〜12の音声素片波形に対応する差
分波形の例を第４図（Ｂ）に示す。FIG. 4B shows an example of a difference waveform corresponding to the speech segment waveforms of frames 1 to 12 described above.

第４図（Ａ）および（Ｂ）に示した、実際の音声より
抽出した音声素片波形と隣接フレーム間での素片波形の
差分波形の例からも理解できるように、素片波形と差分
波形の振幅のダイナミックレンジの差は歴然としてお
り、抽出された素片波形そのものを符号化する方法に比
べ差分波形を素片波形に代えて符号化、蓄積する方が蓄
積容量が削減できることは明らかである。As can be understood from the example of the difference waveform between the speech segment waveform extracted from the actual speech and the segment waveform between adjacent frames shown in FIGS. 4 (A) and 4 (B), The difference between the dynamic ranges of the amplitudes of the waveforms is obvious, and it is clear that the storage capacity can be reduced by encoding and storing the difference waveforms instead of the fragment waveforms compared to the method of encoding the extracted fragment waveforms themselves. It is.

次にS130では前のS120の処理にて算出された差分波形
を符号化し、この符号化データをフレームｉにおける音
声素片データとして記憶装置100に格納する。記憶装置1
00への符号化データの蓄積が終ったら、次のS140の処理
において、S130で符号化された差分波形を復号し、復号
後差分波形B_j ⁱを得る。次にS150の処理では、復号化差
分波形B_j ⁱと前フレーム（ｉ−１）の再生波形A_j ^i-1とを
加算し、現フレームｉの再生波形A_j ⁱを算出する。そし
て、S160の処理では、フレームを更新し、以後、前述し
たS110〜S160のステップでの処理を音声の分析区間が終
了するまで繰り返し行ない、すべてのフレームに対す
る、差分波形の符号化データを、音声素片データとし
て、記憶装置100へ蓄積完了する。Next, in S130, the difference waveform calculated in the previous process of S120 is encoded, and the encoded data is stored in the storage device 100 as speech unit data in the frame i. Storage device 1
After accumulation of the encoded data finished to 00, in the processing of the next S140, it decodes the encoded difference waveform in S130, to obtain a decoded differential waveform B _j ^i. Next, in the processing of S150 adds the reproduced waveform A _j ^i-1 of the decoded differential waveform B _j ⁱ and the previous frame (i-1), calculates a reproduced waveform A _j ⁱ of the current frame i. In the process of S160, the frame is updated, and thereafter, the processes of the above-described steps S110 to S160 are repeated until the speech analysis section ends, and the encoded data of the differential waveform for all the frames is The storage of the segment data in the storage device 100 is completed.

上述した第３図の例では、S130における差分波形符号
化処理の符号化の手法はPCM、logPCMなどの少なくとも
１音声素片内では量子化ステップ幅が、ある基準におい
て固定的に定められた方式に適するもので、ここでは特
にPCM符号化により符号化する例につき以下に説明す
る。In the example of FIG. 3 described above, the encoding method of the differential waveform encoding process in S130 is a method in which the quantization step width is fixedly determined based on a certain criterion in at least one speech unit such as PCM or logPCM. Here, an example of encoding by PCM encoding will be described below.

第５図は、差分波形の符号化をPCM符号化とした場合
の動作フローを示し、第３図に示したフローと共通のス
テップには同一符号を付して示し、その詳細な説明を省
略する。この第５図に示すフローにおいて、S132および
S134が、第３図のS130に対応する処理である。このS132
においては、S120にて算出された差分波形のダイナミッ
クレンジを評価し、ダイナミックレンジの大きさに応じ
て、この差分波形の符号化のビット数ｎを最適に定め
る。この符号化ビット数ｎの決定の手法については後述
する。次に、S134の処理では、S132で定められた符号化
ビット数ｎに基づいて、差分波形b_j ⁱを符号化し、その
符号化値を符号化ビット数ｎとともにフレームｉにおけ
る音声素片データ（符号化データ）として記憶装置100
に格納する。記憶装置100への符号化値と符号化ビット
数ｎの両符号化データの蓄積が終ったら、次にS140の処
理において、S134で符号化された差分波形を復号し、復
号差分波形B_j ⁱを得る。その後の処理は、第３図のフロ
ーで説明した場合と同様に行なわれる。FIG. 5 shows an operation flow when the difference waveform is encoded by PCM encoding. Steps common to those in the flow shown in FIG. 3 are denoted by the same reference numerals, and detailed description thereof will be omitted. I do. In the flow shown in FIG. 5, S132 and
S134 is a process corresponding to S130 in FIG. This S132
In, the dynamic range of the differential waveform calculated in S120 is evaluated, and the number of bits n for encoding the differential waveform is optimally determined according to the magnitude of the dynamic range. The method of determining the number of encoded bits n will be described later. Next, in the process of S134, based on the coding bit number n defined in S132, the differential waveform b _j ⁱ to the coding, speech unit data in the frame i the encoded value with the number of coding bits n ( Storage device 100 as encoded data)
To be stored. After the storage of both the encoded value and the encoded bit number n in the storage device 100 is completed, the differential waveform encoded in S134 is decoded in the process of S140, and the decoded differential waveform B _j ^{i is} decoded. Get. Subsequent processing is performed in the same manner as the case described in the flow of FIG.

ここで、第５図のフロー中のS132での符号化ビット数
決定の処理について説明する。第６図（Ａ）および
（Ｂ）は、符号化ビット数決定の説明図であり、第６図
（Ａ）は音声素片波形の一例を示し、第６図（Ｂ）は差
分波形を絶対値化した波形の例を示す。いま、音声素片
波形をそのまま符号化したときに必要な量子化精度が得
られる符号化ビット数を８ビットとし（第６図（Ｂ）の
左側に示す）、そのときのダイナミックレンジをＤ（第
６図（Ｂ）の右側に示す）とすれば、ここに示す例の差
分波形のダイナミックレンジはD/32の範囲に納まってお
り、同様の量子化精度を得るのに３ビットの符号化ビッ
ト数があればよいことがわかる。このように、S132での
符号化ビット数決定の処理では、差分波形のダイナミッ
クレンジを評価し符号化のビット数を決定する。Here, the process of determining the number of coded bits in S132 in the flow of FIG. 5 will be described. 6 (A) and 6 (B) are explanatory diagrams of how to determine the number of coded bits. FIG. 6 (A) shows an example of a speech unit waveform, and FIG. 6 (B) shows an absolute difference waveform. An example of a digitized waveform is shown. Assume that the number of coded bits at which the required quantization precision is obtained when the speech unit waveform is directly coded is 8 bits (shown on the left side of FIG. 6B), and the dynamic range at that time is D ( 6B), the dynamic range of the differential waveform in the example shown here is in the range of D / 32, and 3-bit encoding is required to obtain the same quantization accuracy. It is understood that the number of bits is sufficient. As described above, in the process of determining the number of encoded bits in S132, the dynamic range of the differential waveform is evaluated to determine the number of encoded bits.

第７図（Ａ）、（Ｂ）および（Ｃ）は、実際の音声よ
りフレーム毎に抽出した音声素片波形、隣接フレーム間
での差分波形および第５図および第６図（Ａ）および
（Ｂ）を参照して説明した手法により定められた量子化
ステップ数の例をそれぞれ示す。但し、表示を見易くす
るために音声素片波形は差分波形に対して縮尺を縮めて
表示してある。また、決定された符号化ビット数は音声
素片波形をそのまま符号化する場合の符号化ビット数を
８として算出してある。第７図（Ｂ）に示すように差分
波形のダイナミッイクレンジは、もとの音声素片波形に
比べはるかに小さく、また、第７図（Ａ）〜（Ｃ）に示
す例では、ビット数で示されるとおり、1/16〜1/64と大
きく変化していることがわかる。FIGS. 7 (A), (B) and (C) show speech unit waveforms extracted for each frame from actual speech, difference waveforms between adjacent frames, and FIGS. 5 and 6 (A) and (C). Each example of the number of quantization steps determined by the method described with reference to B) will be described. However, in order to make the display easier to see, the speech unit waveform is displayed on a reduced scale with respect to the difference waveform. The determined number of encoded bits is calculated assuming that the number of encoded bits in the case of directly encoding the speech unit waveform is eight. As shown in FIG. 7 (B), the dynamic range of the difference waveform is much smaller than the original speech unit waveform, and in the example shown in FIGS. 7 (A) to 7 (C), the number of bits is small. As shown in the graph, it can be seen that there is a large change from 1/16 to 1/64.

ところで、既に説明したように、無声音或いは音韻の
切り換わり時などでは、期待どおり圧縮されない場合が
ある。そのため、記憶装置100に符号化データを格納す
るに際しては、差分波形を符号化するほうが有利である
か、或いは、素片波形そのものを符号化して蓄積したほ
うが有利であるかを判定して、符号化データにはこの判
定結果を識別するためのフラグを含ませておくのがよ
い。By the way, as described above, compression may not be performed as expected when an unvoiced sound or a phoneme is switched. Therefore, when storing the encoded data in the storage device 100, it is determined whether it is more advantageous to encode the differential waveform or to encode and accumulate the unit waveform itself to determine the encoding. It is preferable that the coded data include a flag for identifying the determination result.

以下、この点について説明する。 Hereinafter, this point will be described.

大部分の音声区間においては、既に第７図に例示した
ように、素片波形と差分波形の振幅のダイナミックレン
ジの差は歴然としており、このような区間においては、
差分波形を符号化した方が効率がよいことは明らかであ
る。In most of the voice sections, as already illustrated in FIG. 7, the difference between the dynamic ranges of the amplitude of the unit waveform and the difference waveform is obvious, and in such a section,
It is clear that encoding the difference waveform is more efficient.

これに対し、第８図（Ａ）および（Ｂ）にフレーム毎
の音声素片波形および差分波形をそれぞれ例示して示す
ように、特に、音韻の変化部分（フレーム５）において
は差分波形の方が素片波形よりも大きな場合もある。従
って、このような場合に対処するためには、音声素片波
形の形状に応じて、音声素片を差分として記憶するか、
差分をとらずに記憶するか切り換えるようにしておくの
が好ましい。On the other hand, as shown in FIGS. 8 (A) and 8 (B) exemplarily showing the speech unit waveform and the difference waveform for each frame, the difference waveform in the phoneme change portion (frame 5) is particularly large. May be larger than the unit waveform. Therefore, in order to cope with such a case, the speech unit is stored as a difference according to the shape of the speech unit waveform, or
It is preferable to store or switch without taking the difference.

そこで、これを説明するために、量子化精度を損なわ
ずに符号化データを得るための符号化シーケンスの実施
例を第10図に示す。この符号化シーケンスも、既に説明
した音声素片作成装置20で行なう。Therefore, to explain this, FIG. 10 shows an embodiment of an encoding sequence for obtaining encoded data without impairing the quantization accuracy. This encoding sequence is also performed by the speech unit creating apparatus 20 described above.

第10図は、素片波形の符号化に際して差分波形を符号
化するか、或いは素片波形そのものを符号化するのか判
定処理を第３図符号化の基本的過程に組み合れた、符号
化の基本的過程を示す動作フローである。第10図におい
て、第３図と共通のステップには同一符号を付して示
し、その詳細な説明は省略する。S112においては、S110
で抽出された音声素片に対し、そのダイナミックレンジ
の評価を行なう。既に第２図（Ａ）のS60で説明したよ
うに、抽出された音声素片は、パワーの正規化が施され
ているため、そのダイナミックレンジは、各素片ともお
およそ近い値になる。しかし、音声素片波形には、スペ
クトル形状によって素片中心が鋭くとがるものとそうで
ないものとがあるため、音声素片波形のダイナミックレ
ンジの評価も精密を期して行なうのがより好ましい。FIG. 10 is a diagram showing an encoding process in which a decision process of encoding a difference waveform or encoding a segment waveform itself in encoding a segment waveform is combined with a basic process of encoding in FIG. 3 is an operation flow showing a basic process of the first embodiment. In FIG. 10, steps common to those in FIG. 3 are denoted by the same reference numerals, and detailed description thereof will be omitted. In S112, S110
The dynamic range is evaluated for the speech unit extracted in step (1). As already described in S60 of FIG. 2A, the extracted speech unit has been subjected to power normalization, so that the dynamic range of the extracted speech unit is almost the same for each unit. However, some speech unit waveforms have sharp and sharp center segments depending on the spectrum shape. Therefore, it is more preferable to accurately evaluate the dynamic range of the speech unit waveform.

このような評価を行なうに当り、音声素片波形が対称
波形である性質上、位相原点すなわち波形の中心が最も
振幅が大きくなることが明らかであるので、中心１点の
み見て評価すればよい。このときの音声素片波形のダイ
ナミックレンジの評価値をD_Sとする。次に、S120で、差
分波形を算出し、然る後、S122において差分波形のダイ
ナミックレンジを評価する。この場合には必ずしも差分
波形の中心が最大振幅をとるとは限らないため、差分波
形全体を調べる。このときの差分波形のダイナミックレ
ンジの評価値をD_Dとする。S124では、S112、S122で求め
られた素片波形のダイナミックレンジの評価値D_Sと差分
波形のダイナミックレンジの評価値D_Dとを比較し、その
結果に基づき符号化方法を振り分ける。In performing such an evaluation, it is apparent that the amplitude is largest at the phase origin, that is, the center of the waveform, because of the nature of the speech unit waveform being a symmetrical waveform. . The evaluation value of the dynamic range of the speech unit waveform at this time is D _S. Next, the differential waveform is calculated in S120, and then the dynamic range of the differential waveform is evaluated in S122. In this case, since the center of the difference waveform does not always have the maximum amplitude, the entire difference waveform is examined. The evaluation value of the dynamic range of the differential waveform at this time is D _D. In S124, S112, the evaluation value of the dynamic range of the unit waveform obtained in S122 is compared with evaluation value D _D of the dynamic range of the D _S and the differential waveform, allocating an encoding method based on the results.

以下、この点につき説明する。 Hereinafter, this point will be described.

今、D_S＞D_Dすなわち素片のダイナミックレンジの評価
値が差分のダイナミックレンジの評価値より大きいとき
は、第３図のシーケンス同様S130にて差分波形を符号化
して記憶装置100に格納する。但し、この際、符号化デ
ータとともに差分波形を符号化した旨を指し示す識別フ
ラグを符号化データに付加する。S140以降の処理は、第
３図と同様である。If D _S > D _D, that is, if the evaluation value of the dynamic range of the segment is larger than the evaluation value of the dynamic range of the difference, the difference waveform is encoded in S130 as in the sequence of FIG. . However, at this time, an identification flag indicating that the differential waveform has been encoded is added to the encoded data together with the encoded data. The processing after S140 is the same as in FIG.

一方、D_S≦D_Dすなわち差分のダイナミックレンジの評
価値がもとの素片のダイナミックレンジの評価値に等し
いかもしくは逆に大きくなってしまうときは、S132にて
素片波形そのものを符号化し、その符号化データを、素
片波形そのものを符号化した旨を指し示す識別フラグと
ともに記憶装置100に格納する。この場合の素片再生処
理は、S142にて素片波形の復号により直接再生素片A_jが
得られるためS150の波形加算処理は不要となる。On the other hand, when D _S ≦ D _D, that is, when the evaluation value of the dynamic range of the difference is equal to or larger than the evaluation value of the dynamic range of the original unit, the unit waveform itself is encoded in S132. The encoded data is stored in the storage device 100 together with an identification flag indicating that the segment waveform itself has been encoded. Fragment reproduction processing in this case, the waveform addition process S150 for directly reproducing units A _j is obtained by the decoding of the unit waveform at S142 becomes unnecessary.

以上、素片の符号化データに差分波形を符号化したも
のであるか、或いは素片そのものを符号化したものであ
るかを識別するフラグを設ける方法について、第３図の
フローに機能付加するかたちで説明したが、第５図の符
号化をPCM手法で行なう例に付加することもほぼ同様に
なし得る。As described above, the method of providing a flag for identifying whether the difference waveform is encoded in the encoded data of the unit or the unit itself is added to the flow of FIG. As described above, the encoding shown in FIG. 5 can be substantially similarly added to the example in which the encoding is performed by the PCM method.

第11図（Ａ）〜（Ｃ）は、記憶装置100に格納される
音声素片データの形式を示す。11 (A) to 11 (C) show the format of speech unit data stored in the storage device 100. FIG.

第11図（Ａ）は、第３図の音声素片波形符号化の基本
的過程により作成されるデータの形式、同図（Ｂ）は、
第５図の符号化をPCM手法で行なう過程で作成されるデ
ータの形式、また同図（Ｃ）は、第10図の処理過程にて
生成されるデータの形式を示したものである。FIG. 11 (A) shows the format of data created by the basic process of speech unit waveform encoding of FIG. 3, and FIG.
FIG. 5 shows the format of data created in the process of performing the encoding by the PCM method, and FIG. 5C shows the format of data generated in the process of FIG.

第11図（Ａ）においては、フレーム毎に生成される音
声素片の符号化データ（ここでは差分波形）が順次格納
されている。In FIG. 11A, encoded data (here, a differential waveform) of a speech unit generated for each frame is sequentially stored.

第11図（Ｂ）においては、ｉ番目のフレームの素片を
表わすデータは当該フレームの符号化ビット数部と、差
分波形の符号化データ部の対からなっており、当然のこ
とながら符号化ビット数によって符号化データ部の容量
は異なる。In FIG. 11 (B), the data representing the unit of the i-th frame is composed of a pair of the coded bit number part of the frame and the coded data part of the differential waveform. The capacity of the encoded data part differs depending on the number of bits.

第11図（Ｃ）では、ｉ番目のフレームの素片を表わす
データはフラグ部とデータ部とからなり、フラグ部には
当該フレームのデータ部が差分波形を符号化したもので
あるか、素片波形そのものを符号化したものであるかを
識別するフラグが格納されている。In FIG. 11 (C), the data representing the segment of the i-th frame includes a flag portion and a data portion. The flag portion indicates whether the data portion of the frame is obtained by encoding a differential waveform. A flag for identifying whether the one-side waveform itself is encoded is stored.

一方、波形の符号化方式としてはPCM、logPCMなどの
ように固定した量子化値を有するものではなく、さら
に、効率的な符号化を行なうため、量子化ステップ幅を
適応的に変化させる方式（適応PCM）或いは、波形その
ものを符号化するのではなく、隣接サンプル間の差（厳
密には、前サンプルの符号化、復号化後の値と後続サン
プル値との差）を符号化する差分PCM、さらに差分PCMに
おいて、その量子化ステップ幅を適応的に変える適応差
分PCMなどの方式がある。このような方式においては、
波形の符号化精度は、波形のダイナミックレンジには必
ずしも対応しない。従って、前記実施例における差分波
形或いは音声素片波形の符号化方式として前記適応PC
M、差分PCM、適応差分PCMなどの方式を用いる場合に
は、前述したPCM符号化で採用した符号化ビット数決定
における評価基準および音声素片波形をそのまま符号化
するか或いは差分波形を音声素片波形に代えて符号化す
るかの判定基準として波形のダイナミックレンジを用い
ることは妥当ではない。そこで、それぞれの基準を波形
のダイナミックレンジに代えて、１フレームの波形（音
声素片波形或いは差分波形）の符号化によって生じる誤
差の程度を表わす量子化誤差、従って、この場合にはフ
レーム内信号対雑音比を用いる例につき以下に説明す
る。On the other hand, the waveform encoding method does not have a fixed quantization value such as PCM or logPCM, and furthermore, in order to perform more efficient encoding, a method of adaptively changing the quantization step width ( Adaptive PCM) or difference PCM that encodes the difference between adjacent samples (strictly speaking, the difference between the value after encoding and decoding of the previous sample and the value of the subsequent sample) instead of encoding the waveform itself. In the differential PCM, there is a method such as an adaptive differential PCM that adaptively changes the quantization step width. In such a scheme,
The encoding accuracy of a waveform does not always correspond to the dynamic range of the waveform. Therefore, the adaptive PC is used as the encoding method of the difference waveform or speech unit waveform in the embodiment.
When using a method such as M, differential PCM, or adaptive differential PCM, the evaluation criterion for determining the number of coded bits and the speech unit waveform used in the PCM encoding described above are directly encoded or the difference waveform is converted to a speech element. It is not appropriate to use the dynamic range of a waveform as a criterion for determining whether to encode in place of a single waveform. Therefore, each reference is replaced with the dynamic range of the waveform, and a quantization error representing the degree of error generated by encoding a waveform (speech unit waveform or difference waveform) of one frame. An example using the noise-to-noise ratio will be described below.

第12図は、差分波形の符号化ビット数を可変する前記
第５図の処理過程におけるS132符号化ビット数の決定
を、波形のダイナミックレンジの代わりにフレーム内信
号対雑音比を用いて行なう場合の処理フローであって、
全体的な動作は第５図のフローとほぼ同一である。第12
図において、S1320〜S1326の処理が第５図S132符号化ビ
ット数決定のステップに相当する。また、第12図におい
て、第５図と同一番号を付されたステップについては第
５図とまったく同一の処理であり説明は省略する。FIG. 12 shows a case where the determination of the number of S132 coded bits in the process of FIG. 5 for varying the number of coded bits of the differential waveform is performed by using the intra-frame signal-to-noise ratio instead of the dynamic range of the waveform. The processing flow of
The overall operation is almost the same as the flow in FIG. Twelfth
In the figure, the processing of S1320 to S1326 corresponds to the step of determining the number of coded bits in S132 of FIG. Further, in FIG. 12, steps denoted by the same reference numerals as those in FIG. 5 are the same processing as those in FIG. 5, and a description thereof will be omitted.

符号化ビット数決定の過程では、まず、符号化ビット
数ｎの初期値として符号化可能なビット数の最小値n_min
を与える（S1320）。次に、この符号化ビット数ｎによ
り、差分波形b_j ⁱを符号化し、符号化データL_j ⁱを得る
（S1321）。符号化データL_j ⁱは次のステップS1322にお
いて復号化され差分波形の復号値B_j ⁱが求められる（S13
22）。次のS1323では、現在与えられている符号化ビッ
ト数ｎが予め定められた符号化ビット数の最大値n_maxに
達しているかどうかの判定を行ない、もし、すでにｎが
n_maxに達していれば、この時点において符号化ビット数
ｎと、差分波形の符号化データL_j ⁱが確定する。一方、
ｎが最大値n_max未満であれば次のS1324において符号化
精度を算出する。符号化精度は先に述べたとおり、フレ
ーム内信号対雑音比として表わす。これをSNとすると、
SNは次式で与えられる。In the process of determining the number of coded bits, first, the minimum value n _{min of the} number of bits that can be coded is used as the initial value of the number n of coded bits.
(S1320). Next, this encoding bit number n, the differential waveform b _j ⁱ encodes, obtain coded data L _j ⁱ (S1321). Coded data L _j ⁱ is decoded value B _j ⁱ of the decoded differential waveform is calculated in the next step S1322 (S13
twenty two). In the next S1323, it is determined whether or not the currently given number of encoded bits n has reached a predetermined maximum value _{nmax of the} number of encoded bits.
If it reached n _max, and the coding bit number n at this time, the encoded data L _j ⁱ of the differential waveform is determined. on the other hand,
n calculates the coding accuracy in the next S1324 is less than the maximum value n _max. The encoding accuracy is expressed as an intra-frame signal-to-noise ratio as described above. If this is SN,
SN is given by the following equation.

ここで算出したフレーム内信号対雑音比SNに対し、ス
テップS1325では、予め定めたフレーム内信号対雑音比
のきい値SN_thと比較し、SN≧SN_thであれば、この時点で
の符号化ビット数ｎにおいて十分な符号化の精度が得ら
れたとしてS141においてｎおよび符号化データL_j ⁱを音
声データとして記憶装置100に格納する。一方、SN＜SN
_thであれば、符号化時の精度が十分でないので符号化ビ
ット数を１ビット増加して（S1326）あらためてS1321か
らの処理を繰り返す。このようにして所望の符号化精度
が得られるまでS1320〜S1326の処理を繰り返す。但し、
この発明においては蓄積すべき音声データの容量を削減
することが目的であるため、前記S1322の処理におい
て、符号化ビット数の最大値n_maxを定め、符号化ビット
数ｎがn_maxに達した場合には、符号化精度が所望の値SN
_thに達しようが、或いは達しまいが、これ以上のデータ
の増加はしないようにしている。但し、n_maxは、音声素
片波形をそのまま符号化するのに必要なビット数として
おけば、S1323において符号化ビット数のこれ以上の増
加を打ち切ったことにより、従来の方法に比べ量子化誤
差が増大するというようなことはない。 On a frame in the signal-to-noise ratio SN calculated here, in step S1325, as compared to the threshold SN _th of a predetermined frame within the signal-to-noise ratio, if SN ≧ SN _th, encoding at this point n and coded data L _j ⁱ stored in the storage device 100 as the audio data at S141 as sufficient coding precision obtained in the number of bits n. On the other hand, SN <SN
If it is _th , the encoding precision is not sufficient, so the number of encoding bits is increased by one bit (S1326), and the processing from S1321 is repeated again. Thus, the processing of S1320 to S1326 is repeated until the desired encoding accuracy is obtained. However,
Since the object of the present invention is to reduce the capacity of audio data to be stored, in the process of S1322, the maximum value n _{max of the} number of encoded bits is determined, and the number n of encoded bits reaches n _max . If the encoding precision is the desired value SN
Whether or not _th is reached, no further increase in data is made. However, if n _max is the number of bits necessary to encode the speech unit waveform as it is, the further increase in the number of encoded bits in S1323 is terminated, and the quantization error is smaller than that of the conventional method. Does not increase.

第13図は、音声素片波形をそのまま符号化するか、差
分波形を音声素片波形に代えて符号化するかを判定して
符号化を行なう第10図の処理を、前記判定基準を第10図
におけるダイナミックレンジからフレーム内信号対雑音
比に変えたときの動作フローを示している。ここでは、
S1120〜1122が第10図のS120に、S1220〜S1222が第10図
のS122にそれぞれ相当する。また、S1240においては、
第10図のS124における音声素片波形のダイナミックレン
ジD_Sが、音声素片波形を符号化、復号化した際のフレー
ム内信号対雑音比SN_Sに、差分波形のダイナミックレン
ジD_Dが、差分波形を符号化、復号化した際の音声素片波
形に対するフレーム内信号対雑音比SN_Dに、それぞれ置
き換わる。また、S1300、S1320は、それぞれ第10図のS1
30、S132に相当する部分であるが、S1121、S1221におい
てすでに復号化処理が行なわれているため、ここでは、
第10図のS130、S132の処理のうち、データを記憶装置10
0に格納するだけの処理となる。FIG. 13 shows a process of FIG. 10 for performing encoding by determining whether to encode a speech segment waveform as it is or to encode a difference waveform in place of a speech segment waveform, according to the above-described criterion. 10 shows an operation flow when the dynamic range in FIG. 10 is changed to the intra-frame signal-to-noise ratio. here,
S1120 to S1122 correspond to S120 in FIG. 10, and S1220 to S1222 correspond to S122 in FIG. Also, in S1240,
Dynamic range D _S of speech unit waveforms in S124 of FIG. 10 is encoding the speech unit waveform, a frame in the signal-to-noise ratio SN _S when decrypted, the dynamic range D _D of the difference waveform is the difference coding waveforms, the frame within the signal-to-noise ratio SN _D for speech unit waveform when decrypted, replace, respectively. Further, S1300 and S1320 are respectively S1 of FIG.
30 and a part corresponding to S132, but since the decoding process has already been performed in S1121 and S1221,
In the processing of S130 and S132 in FIG.
It is a process that only stores it in 0.

次に、第１図（Ｂ）に従って、この発明の音声合成装
置における音声素片波形再生の基本的過程を説明する。
上述したように、記憶装置100には音声素片データとし
て、差分波形の符号化データのみ或いは、場合によって
は、音声素片波形自体の符号化データと差分波形の符号
化データとが格納されている。そこで、まず、合成手段
102では、S200では記憶装置から符号化された差分波形
データを読み出し、続いて、S210の処理で符号化データ
の識別フラグで差分波形の符号化データが素片波形その
ものの符号化データかを判定する。差分波形と判定され
た場合には、この符号化データをS220の処理で復号し復
号化差分波形B_j ⁱを算出する。尚、当然ながら、この復
号化は符号化にマッチした手法で行なう。そして、次の
S230では、S220で得られた復号後差分波形B_j ⁱを前フレ
ーム（ｉ−１）における音声素片再生波形A_j ^i-1とを加
算し、当該フレームｉにおける音声素片波形A_j ⁱを得
る。一方、S210の判定処理で素片波形そのものの符号化
データであると判定された場合には、図示していない
が、従来と同様に音声素片自体の再生を行なう。以後、
S240の処理でフレームを更新しながら必要なフレームに
対し上記の処理を繰り返し音声素片の再生を行なってい
く。このようにして、再生されて合成された音声は、ス
ピーカ、コンピュータ、表示デバイス或いはその他の外
部機器へ出力するための処理が行なわれる。尚、第３図
および第１図（Ｂ）の説明では触れなかったが、第２図
の処理によって抽出される音声素片波形は対称形である
から、第３図および第１図（Ｂ）の処理においても実際
の処理過程では音声素片長の1/2を扱えばよいことはい
うまでもないことである。Next, with reference to FIG. 1 (B), a basic process of the speech unit waveform reproduction in the speech synthesizer of the present invention will be described.
As described above, the storage device 100 stores, as speech unit data, only the encoded data of the difference waveform, or, in some cases, the encoded data of the speech unit waveform itself and the encoded data of the difference waveform. I have. Therefore, first, the synthesis means
In step S200, the coded difference waveform data is read from the storage device in step S200, and then, in step S210, it is determined whether the coded data of the difference waveform is coded data of the unit waveform itself by the coded data identification flag. I do. If it is determined that the difference waveform, calculates a decoded differential waveform B _j ⁱ decodes the coded data in the process of S220. Note that, of course, this decoding is performed by a method that matches the encoding. And the next
In S230, the sum of the speech unit reproduced waveform A _j ^i-1 in the previous frame decoded differential waveform B _j ⁱ obtained in S220 (i-1), the speech unit waveform in the frame i A _j ⁱ Get. On the other hand, if it is determined in the determination processing of S210 that the data is encoded data of the unit waveform itself, the speech unit itself is reproduced as in the related art, although not shown. Since then
While updating the frame in the process of S240, the above process is repeated for a necessary frame to reproduce a speech unit. The sound reproduced and synthesized in this way is subjected to a process for outputting it to a speaker, a computer, a display device, or another external device. Although not described in the description of FIGS. 3 and 1B, since the speech unit waveform extracted by the processing of FIG. 2 is symmetrical, it is not shown in FIGS. 3 and 1B. Needless to say, it is sufficient to handle half of the speech unit length in the actual process in the processing of (1).

（発明の効果）上述した説明からも明らかなように、この発明では音
声のスペクトル包絡から抽出される素片波形をそのまま
符号化、蓄積して音声素片データとして用いるのでな
く、隣接フレームで抽出される素片波形の差分波形を符
号化したデータを音声データとして蓄積しているため音
声データの記憶容量を大幅に削減できる。言いかえれ
ば、同一の記憶容量においては、はるかに高精度な符号
化が可能となり、合成音声の品質が向上する。(Effects of the Invention) As is clear from the above description, in the present invention, a segment waveform extracted from the spectrum envelope of speech is not encoded and stored as it is and used as speech segment data, but is extracted in an adjacent frame. Since the data obtained by encoding the difference waveform of the segment waveform to be obtained is stored as audio data, the storage capacity of the audio data can be significantly reduced. In other words, with the same storage capacity, much higher precision encoding is possible, and the quality of synthesized speech is improved.

また、差分波形を音声データとして蓄積するに当り、
差分波形のダイナミックレンジ、或いは差分波形を符号
化時の量子化誤差などの判定基準をもとに符号化のビッ
ト数を適切に定めているため、過不足のない最適な情報
量で符号化が可能となり、少ない記憶容量で高品質な合
成音を得ることができる。In storing the difference waveform as audio data,
Because the number of bits for encoding is appropriately determined based on the dynamic range of the differential waveform or the criterion such as the quantization error at the time of encoding the differential waveform, encoding can be performed with an optimal amount of information without excess or deficiency. This makes it possible to obtain a high-quality synthesized sound with a small storage capacity.

さらに、フレーム間の音声素片波形を符号化した手法
を表わすフラグを符号化データにもたせることにより、
音声素片波形を少ない符号化誤差で表わすことができ、
合成音の一層の品質向上を図ることができる。Furthermore, by giving a flag indicating the method of encoding the speech unit waveform between frames to the encoded data,
The speech unit waveform can be represented with a small encoding error,
It is possible to further improve the quality of the synthesized sound.

【図面の簡単な説明】[Brief description of the drawings]

第１図（Ａ）は、この発明の音声合成方式および音声合
成装置の説明に供する、音声合成装置の要部のブロック
図、第１図（Ｂ）は、この発明の説明に供する音声素片波形
再生の基本的過程を示す動作フロー図、第２図（Ａ）および（Ｂ）は、従来およびこの発明の説
明に供する音声素片抽出過程を示す動作フロー図、第３図は、この発明の説明に供する音声素片波形符号化
の基本的過程を示す動作フロー図、第４図（イ）および（ロ）は、実際の音声の音声素片波
形と差分波形の例を示す、プロッタで描いた波形図、第５図は、音声素片波形の符号化をPCM手法で行なう基
本的過程を示す動作フロー図、第６図（Ａ）および（Ｂ）は、符号化ビット数決定のた
めの説明図、第７図（イ）および（ロ）は、実際の音声の音声素片波
形、差分波形および符号化ビット数の関係を示す、各波
形をプロッタで描いた図、第８図は、音韻変化部分の音声素片波形および差分波形
を示す、プロッタで描いた波形図、第９図は、原音声を取り込んでから記憶装置へ音声素片
データを格納する様子を説明するためのブロック図、第10図は、素片波形の符号化に際して差分波形を符号化
するか、或いは素片波形そのものを符号化するのか判定
処理を含む、符号化の基本的過程を示す動作フロー図、第11図（Ａ）、（Ｂ）および（Ｃ）は、符号化データの
説明図、第12図は、符号化ビット数の決定をフレーム内の信号対
雑音比を用いて行なう処理を示す動作フロー図、第13図は、素片波形をそのまま符号化するか、或いは、
差分波形を符号化するかの判定基準をフレーム内信号対
雑音比として用いた場合の符号化の動作フロー図であ
る。 100……記憶装置、102……合成手段 104……制御部、106……データ部 108……フラグ部。FIG. 1 (A) is a block diagram of a main part of a speech synthesis apparatus for explaining a speech synthesis system and a speech synthesis apparatus of the present invention, and FIG. 1 (B) is a speech unit for explanation of the present invention. FIGS. 2A and 2B are flow charts showing the basic steps of waveform reproduction, FIGS. 2A and 2B are flow charts showing a speech unit extraction process for explanation of the prior art and the present invention, and FIG. 4 (a) and 4 (b) show an example of an actual speech unit waveform and a difference waveform of an actual speech by a plotter. FIG. 5 is an operation flow diagram showing a basic process of encoding a speech unit waveform by the PCM method. FIGS. 6A and 6B are diagrams for determining the number of encoded bits. FIGS. 7 (a) and (b) show the speech unit waveform and the difference of the actual speech. FIG. 8 is a diagram showing the relationship between the waveform and the number of coded bits, in which each waveform is drawn by a plotter. FIG. 8 is a waveform diagram drawn by a plotter, showing a speech unit waveform and a difference waveform of a phoneme change portion. FIG. 10 is a block diagram for explaining a state in which speech data is stored in a storage device after capturing an original speech. FIG. 11 (A), (B) and (C) are operation flowcharts showing a basic process of encoding including a process of determining whether or not to encode the data itself. FIG. 13 is an operation flow diagram showing a process of determining the number of coded bits by using a signal-to-noise ratio in a frame.
FIG. 9 is an operation flowchart of encoding when a criterion for encoding a differential waveform is used as an intra-frame signal-to-noise ratio. 100 storage device 102 synthesis means 104 control unit 106 data unit 108 flag unit

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/18 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 3/00-9/18 JICST file (JOIS)

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】自然音声を一定のフレーム周期毎に分析し
てそれぞれのフレームにおける音声のスペクトル包絡か
らそれぞれ抽出した音声素片に関する音声素片データを
記憶装置に予め格納しておき、該記憶装置に予め格納さ
れた前記音声素片データを用いて音声素片を合成して音
声を合成する音声合成方式において、音声素片波形を、各音声素片どうしで同一電力となると
共に、音声素片波形が有する全周波数成分について各音
声素片どうしで同一位相特性となるように変形し、隣接する２つのフレームでそれぞれ抽出された音声素片
間での音声素片波形の差を差分波形としてそれぞれ求
め、各差分波形を符号化データに変えて、前記隣接する２つ
のフレームのうち後続のフレームにおける音声素片デー
タとして、記憶装置に格納しておくことを特徴とする音声合成方式。1. Speech unit data relating to speech units extracted from a spectral envelope of speech in each frame by analyzing natural speech at regular frame intervals, and stored in a storage device in advance. In the speech synthesis method of synthesizing speech by synthesizing speech segments using the speech segment data stored in advance, the speech segment waveform is converted to the same power for each speech segment, For all frequency components of the waveform, each speech unit is deformed so as to have the same phase characteristic, and the difference of the speech unit waveform between the speech units extracted in two adjacent frames is used as a difference waveform, respectively. Then, each differential waveform is converted into encoded data and stored in a storage device as speech unit data in a subsequent frame of the two adjacent frames. A speech synthesis method characterized in that:

【請求項２】請求項１に記載の音声合成方式において、隣接する２つのフレームの差分波形の符号化データを、
当該差分波形の振幅のダイナミックレンジに応じた符号
化ビット数で、前記差分波形をそれぞれ符号化した値お
よび前記符号化ビット数としたことを特徴とする音声合
成方式。2. The speech synthesis system according to claim 1, wherein the encoded data of the difference waveform between two adjacent frames is
A speech synthesis method, wherein a value obtained by encoding the differential waveform and a number of encoded bits are respectively used as the number of encoded bits according to the dynamic range of the amplitude of the differential waveform.

【請求項３】請求項１に記載の音声合成方式において、前記差分波形の符号化データを、該差分波形のダイナミ
ックレンジの大きさが前記隣接する２つのフレームのう
ちの後続のフレームで抽出された音声素片のダイナミッ
クレンジよりも小さいという条件の下で、前記記憶装置
に格納しておき、および、前記差分波形のダイナミック
レンジの大きさが前記音声素片のダイナミックレンジと
等しいかまたは大きいという条件の下では、前記差分波
形の符号化データの代わりに前記音声素片の音声素片波
形の符号化データを前記記憶装置に格納しておくことを
特徴とする音声合成方式。3. The speech synthesis method according to claim 1, wherein encoded data of the differential waveform is extracted in a subsequent frame of the two adjacent frames whose dynamic range of the differential waveform is large. Under the condition that the dynamic range of the speech unit is smaller than the dynamic range of the speech unit, and that the magnitude of the dynamic range of the differential waveform is equal to or greater than the dynamic range of the speech unit. A speech synthesis method, characterized in that encoded data of a speech segment waveform of the speech segment is stored in the storage device instead of encoded data of the difference waveform under a condition.

【請求項４】請求項１に記載の音声合成方式において、隣接する２つのフレームの差分波形の符号化データは、
当該差分波形の符号化に際して予め定めた量子化誤差の
しきい値を満足する符号化ビット数で符号化されている
ことを特徴とする音声合成方式。4. The speech synthesis method according to claim 1, wherein the encoded data of the difference waveform between two adjacent frames is:
A speech synthesis method characterized in that the difference waveform is encoded with an encoding bit number that satisfies a predetermined threshold value of a quantization error.

【請求項５】請求項１に記載の音声合成方式において、前記差分波形の符号化データを、該差分波形を符号化し
たときの量子化誤差が前記隣接するフレームのうちの後
続のフレームで抽出された音声素片を同一の符号化ビッ
ト数で符号化したときの量子化誤差よりも小さいという
条件の下で前記記憶装置に格納しておき、および、前記
差分波形の符号化時の量子化誤差が、前記音声素片の符
号化時の量子化誤差と等しいかまたは大きいという条件
のもとでは、前記差分波形の符号化データの代わりに前
記音声素片の符号化データを前記記憶装置に格納してお
くことを特徴とする音声合成方式。5. The speech synthesis system according to claim 1, wherein the encoded data of the differential waveform is extracted from a subsequent frame among the adjacent frames, wherein a quantization error when encoding the differential waveform is extracted. Stored in the storage device under a condition that the quantization error is smaller than the quantization error when the encoded speech unit is encoded with the same encoding bit number, and the quantization at the time of encoding the differential waveform is performed. Under the condition that the error is equal to or larger than the quantization error at the time of encoding the speech unit, the encoded data of the speech unit is stored in the storage device instead of the encoded data of the difference waveform. A speech synthesis method characterized by being stored.

【請求項６】請求項３または請求項５に記載の音声合成
方式において、前記記憶装置に格納される前記符号化データは、前記差
分波形を符号化したデータかまたは前記音声素片波形そ
のものを符号化したデータかを識別するフラグを含むこ
とを特徴とする音声合成方式。6. The speech synthesis system according to claim 3, wherein the encoded data stored in the storage device is data obtained by encoding the difference waveform or the speech unit waveform itself. A speech synthesis method characterized by including a flag for identifying encoded data.

【請求項７】請求項１に記載の音声合成方式において、隣接する２つのフレームの差分波形の符号化において、
前記差分波形の性質に応じて、前記差分波形の符号化ビ
ット数を１差分波形毎に適応的に定め、当該符号化ビッ
ト数によって符号化した差分波形符号化データと、前記
符号化ビット数を合わせて差分波形符号化データとする
ことを特徴とする音声合成方式。7. The speech synthesis system according to claim 1, wherein in encoding a difference waveform between two adjacent frames,
The number of coded bits of the differential waveform is adaptively determined for each differential waveform according to the property of the differential waveform, and the differential waveform coded data coded by the coded bit number and the coded bit number are calculated. A speech synthesis method characterized in that the combined waveform data is differential waveform encoded data.

【請求項８】請求項１に記載の音声合成方式において、フレーム毎に、当該フレームの音声素片波形と、当該フ
レームと隣接して先行するフレームとでの音声素片波形
の差分波形とを同一の符号化ビット数で符号化した場
合、どちらが符号化効率が高いかを判定し、符号化効率
が高い方の符号化データを当該フレームの音声素片波形
の符号化データとして前記記憶装置に格納することを特
徴とする音声合成方式。8. The speech synthesis system according to claim 1, wherein, for each frame, a speech segment waveform of the frame and a difference waveform of a speech segment waveform of a preceding frame adjacent to the frame are determined. When encoding is performed with the same encoding bit number, it is determined which encoding efficiency is higher, and the encoded data having the higher encoding efficiency is stored in the storage device as encoded data of the speech unit waveform of the frame. A speech synthesis method characterized by storing.

【請求項９】自然音声を一定のフレーム周期毎に分析し
てそれぞれのフレームにおける音声のスペクトル包絡か
らそれぞれ抽出した音声素片に関する音声素片データが
格納された記憶装置と、該記憶装置から前記音声素片データを読み出して再生す
ることにより、外部機器へ出力するための音声を合成す
る合成手段とを含む音声合成装置において、音声素片波形を、各音声素片どうしで同一電力となると
共に、音声素片波形が有する全周波数成分について各音
声素片どうしで同一位相特性となるように変形してお
き、音声素片データを、隣接する２つのフレームでそれぞれ
抽出された音声素片間での音声素片波形の差である差分
波形の符号化データとし、合成手段は、記憶装置から前記差分波形の符号化データ
を復号して音声素片の再生を行なうことを特徴とする音声合成装置。9. A storage device which stores speech unit data relating to speech units extracted from the spectral envelope of speech in each frame by analyzing natural speech at fixed frame periods, and And a synthesizing means for synthesizing a voice to be output to an external device by reading and reproducing the voice unit data. For all frequency components of the speech unit waveform, the speech units are deformed so as to have the same phase characteristic, and speech unit data is converted between the speech units extracted in two adjacent frames. And synthesizing means for decoding the encoded data of the difference waveform from the storage device to reproduce the speech unit. Speech synthesis apparatus and performing.

【請求項１０】請求項９に記載の音声合成装置におい
て、前記符号化データを、前記差分波形の振幅のダイナミッ
クレンジに応じた符号化ビット数で前記差分波形をそれ
ぞれ符号化した値および前記符号化ビット数としたこと
を特徴とする音声合成装置。10. The speech synthesizer according to claim 9, wherein said encoded data is a value obtained by encoding said differential waveform with an encoding bit number corresponding to a dynamic range of the amplitude of said differential waveform, and said code. A speech synthesizer characterized in that the number of coded bits is set.

【請求項１１】請求項９に記載の音声合成装置におい
て、前記符号化データを、前記差分波形を予め定めた量子化
誤差を満足する符号化ビット数で、前記差分波形をそれ
ぞれ符号化した値および前記符号化ビット数としたこと
を特徴とする音声合成装置。11. The speech synthesizer according to claim 9, wherein said encoded data is a value obtained by encoding each of said differential waveforms with the number of encoded bits satisfying a predetermined quantization error. And the number of coded bits.

【請求項１２】自然音声を一定のフレーム周期毎に分析
してそれぞれのフレームにおける音声のスペクトル包絡
からそれぞれ抽出した音声素片に関する音声素片データ
が格納された記憶装置と、該記憶装置から前記音声素片データを読み出して再生す
ることにより、外部機器へ出力するための音声を合成す
る合成手段とを含む音声合成装置において、音声素片波形を、各音声素片どうしで同一電力となると
共に、音声素片波形が有する全周波数成分について各音
声素片どうしで同一位相特性となるように変形してお
き、隣接する２つのフレームでそれぞれ抽出された音声素片
間での音声素片波形の差である差分波形のダイナミック
レンジの大きさが前記隣接する２つのフレームのうちの
後続のフレームで抽出された音声素片の音声素片波形の
ダイナミックレンジよりも小さいという条件の下では、
前記記憶装置に音声素片データとして前記差分波形の符
号化データを格納しておき、および、前記差分波形のダ
イナミックレンジの大きさが前記音声素片波形のダイナ
ミックレンジと等しいかまたは大きいという条件の下で
は、前記記憶装置に音声素片データとして前記音声素片
波形の符号化データを格納しておき、記憶装置に格納された音声素片データは、当該データが
前記差分波形を符号化したデータか、或いは音声素片波
形そのものを符号化したデータであるかを識別するフラ
グを含み、合成手段は、前記記憶装置から前記フラグと符号化デー
タとを読み出し、前記フラグに応じて音声素片の再生
を、差分波形に基づく再生と、素片波形に基づく再生と
を切り換えて行なうことを特徴とする音声合成装置。12. A storage device storing speech unit data relating to speech units extracted from a spectral envelope of speech in each frame by analyzing natural speech at fixed frame periods, and A speech synthesizer including a synthesis unit for synthesizing a voice to be output to an external device by reading and reproducing the voice unit data. For all the frequency components of the speech segment waveform, the speech segments are deformed so as to have the same phase characteristic, and the speech segment waveform between the speech segments extracted in two adjacent frames is calculated. A speech unit waveform of a speech unit extracted from a subsequent frame of the two adjacent frames in which the magnitude of the dynamic range of the difference waveform is a difference. Under the condition that less than the dynamic range,
The storage device stores encoded data of the difference waveform as speech unit data, and a condition that a dynamic range of the difference waveform is equal to or larger than a dynamic range of the speech unit waveform. In the following, encoded data of the speech unit waveform is stored as speech unit data in the storage device, and speech unit data stored in the storage device is data obtained by encoding the difference waveform with the data. Or a flag for identifying whether the data is coded data of the speech unit waveform itself, and the synthesizing unit reads the flag and the coded data from the storage device, and according to the flag, A speech synthesizer characterized in that reproduction is performed by switching between reproduction based on a difference waveform and reproduction based on a segment waveform.

【請求項１３】自然音声を一定のフレーム周期毎に分析
してそれぞれのフレームにおける音声のスペクトル包絡
からそれぞれ抽出した音声素片に関する音声素片データ
が格納された記憶装置と、該記憶装置から前記音声素片データを読み出して再生す
ることにより、外部機器へ出力するための音声を合成す
る合成手段とを含む音声合成装置において、音声素片波形を、各音声素片どうしで同一電力となると
共に、音声素片波形が有する全周波数成分について各音
声素片どうしで同一位相特性となるように変形してお
き、隣接する２つのフレームでそれぞれ抽出された音声素片
間での音声素片波形の差である差分波形を符号化したと
きの量子化誤差が前記隣接するフレームのうちの後続の
フレームで抽出された音声素片の音声素片波形を同一の
符号化ビット数で符号化したときの量子化誤差よりも小
さいという条件の下では、前記記憶装置に前記音声素片
データとして前記差分波形の符号化データを格納してお
き、および、前記差分波形の符号化時の量子化誤差が、
前記音声素片の符号化時の量子化誤差と等しいかまたは
大きいという条件の下では、前記記憶装置に音声素片デ
ータとして前記音声素片波形の符号化データを格納して
おき、記憶装置に格納された音声素片データは、当該データが
前記差分波形を符号化したデータか、或いは音声素片波
形そのものを符号化したデータであるかを識別するフラ
グを含み、合成手段は、前記記憶装置から前記フラグと符号化デー
タとを読み出し、前記フラグに応じて音声素片の再生
を、差分波形に基づく再生と、素片波形に基づく再生と
を切り換えて行なうことを特徴とする音声合成装置。13. A storage device in which speech unit data relating to speech units extracted from a spectral envelope of speech in each frame by analyzing natural speech at fixed frame periods is stored, and A speech synthesizer including a synthesis unit for synthesizing a voice to be output to an external device by reading and reproducing the voice unit data. For all the frequency components of the speech segment waveform, the speech segments are deformed so as to have the same phase characteristic, and the speech segment waveform between the speech segments extracted in two adjacent frames is calculated. The quantization error when encoding the difference waveform that is the difference is the same as the speech unit waveform of the speech unit extracted in the subsequent frame of the adjacent frames. Under the condition that it is smaller than the quantization error when encoding with the number of encoded bits, the encoded data of the differential waveform is stored in the storage device as the speech unit data, and the differential waveform The quantization error when encoding
Under the condition that the quantization error at the time of encoding the speech unit is equal to or greater than the quantization error, encoded data of the speech unit waveform is stored as speech unit data in the storage device, The stored speech unit data includes a flag for identifying whether the data is data obtained by encoding the difference waveform or data obtained by encoding the speech unit waveform itself. A voice synthesis unit that reads out the flag and the encoded data from the device and reproduces a speech unit in accordance with the flag by switching between reproduction based on a difference waveform and reproduction based on a unit waveform.