JP3459600B2

JP3459600B2 - Speech data amount reduction device and speech synthesis device for speech synthesis device

Info

Publication number: JP3459600B2
Application number: JP27423399A
Authority: JP
Inventors: ニック・キャンベル; 敏北川
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 1999-09-28
Filing date: 1999-09-28
Publication date: 2003-10-20
Anticipated expiration: 2019-09-28
Also published as: JP2001100775A

Abstract

PROBLEM TO BE SOLVED: To reduce a memory capacity for storage of a sound waveform database to increase the retrieval speed for voice synthesis. SOLUTION: A voice data volume reduction processing part 45 calculates a prescribed degree of similarity related to metrical feature parameters and acoustic feature parameters for each pair of phonemes (biphone) on the basis of a list of biphones; and if the calculated degree of similarity is equal to or higher than a first prescribed threshold and the number of same biphones in the list of biphones is equal to or larger than a second prescribed threshold, data of sound segments of a sound waveform signal related to one of these biphones is deleted from the sound waveform database to reduce the voice data volume. A text database and the voice waveform database where the voice data volume is reduced are used to perform voice synthesis processing.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音素ラベルに対応
した音声波形信号の音声セグメントのデータからなる音
声波形データベースを用いて、自然発話の音声波形信号
の音声セグメントを連結することにより任意の音素列を
音声合成する音声合成装置のための音声データ量削減装
置及び音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention uses a voice waveform database composed of voice segment data of a voice waveform signal corresponding to a phoneme label to connect arbitrary voice segments of a voice waveform signal of natural speech. The present invention relates to a voice data amount reducing device and a voice synthesizing device for a voice synthesizing device for synthesizing a sequence of voices.

【０００２】[0002]

【従来の技術】例えば、特開平１０−０４９１９３号公
報において、音素ラベルに対応した音声波形信号の音声
セグメントのデータからなる音声波形データベースを用
いて、自然発話の音声波形信号の音声セグメントを連結
することにより任意の音素列を音声合成する音声合成装
置が開示されている。2. Description of the Related Art For example, in Japanese Unexamined Patent Publication No. 10-049193, a voice waveform database consisting of voice segment data of a voice waveform signal corresponding to a phoneme label is used to connect voice segments of a voice waveform signal of spontaneous speech. As a result, a voice synthesizing device for voice synthesizing an arbitrary phoneme sequence is disclosed.

【０００３】[0003]

【発明が解決しようとする課題】この従来例の音声合成
装置においては、最小限の信号処理を用いて単純に音声
波形の連結を行うために、適切な音響的パラメータと韻
律的パラメータとを有する音声セグメントを選択するた
め、大規模な音声波形データベースを必要とする。この
大規模な音声波形データベースでは、当該データベース
を記憶するメモリの容量が大きくなり、また、それに伴
って探索空間が大きくなるために、適切な音声セグメン
トを探索するときの探索速度を高めることができないと
いう問題点があった。The speech synthesizer of this conventional example has appropriate acoustic parameters and prosodic parameters in order to simply connect the speech waveforms using the minimum signal processing. A large speech waveform database is required to select the speech segment. In this large-scale speech waveform database, the memory capacity for storing the database becomes large, and the search space becomes large accordingly, so that it is not possible to increase the search speed when searching for an appropriate speech segment. There was a problem.

【０００４】本発明の目的は以上の問題点を解決し、音
声波形データベースを格納するメモリ容量を削減するこ
とができ、音声合成時の探索速度を高めることができる
音声合成装置のための音声データ量削減装置及び音声合
成装置を提供することにある。An object of the present invention is to solve the above problems, reduce the memory capacity for storing the voice waveform database, and increase the search speed at the time of voice synthesis. An object is to provide a volume reduction device and a voice synthesis device.

【０００５】[0005]

【課題を解決するための手段】本発明に係る音声合成装
置のための音声データ量削減装置は、音素ラベルに対応
した音声波形信号の音声セグメントのデータからなる音
声波形データベースを記憶する記憶装置を備え、上記自
然発話の音声波形信号の音声セグメントを連結すること
により任意の音素列を音声合成する音声合成装置のため
の音声データ量削減装置であって、上記音声波形データ
ベースに含まれる１対の音素のリストを生成する生成手
段と、上記生成された１対の音素のリストに基づいて各
１対の音素に対する韻律的特徴パラメータと音響的特徴
パラメータとに関する所定の類似度を計算し、上記計算
された類似度が所定の第１のしきい値以上であるとき、
当該各１対の音素のうちの一方の１対の音素に係る音声
波形信号の音声セグメントのデータを上記音声波形デー
タベースから削除することにより音声データ量を削減す
るとともに、上記計算された類似度が所定の第１のしき
い値以上であり、かつ上記１対の音素のリスト中の同一
の１対の音素の数が所定の第２のしきい値以上であると
きに、当該各１対の音素のうちの一方の１対の音素に係
る音声波形信号の音声セグメントのデータを上記音声波
形データベースから削除する削減手段とを備えたことを
特徴とする。A voice data amount reducing apparatus for a voice synthesizing apparatus according to the present invention comprises a storage device for storing a voice waveform database consisting of voice segment data of a voice waveform signal corresponding to a phoneme label. A voice data amount reducing device for a voice synthesizing device for synthesizing a voice sequence of an arbitrary phoneme by connecting voice segments of a voice waveform signal of natural speech, comprising a pair of voice data included in the voice waveform database. Based on the generating means for generating a list of phonemes, and the generated list of pair of phonemes, a predetermined degree of similarity between the prosodic characteristic parameter and the acoustic characteristic parameter for each pair of phonemes is calculated, and the calculation is performed. When the calculated similarity is greater than or equal to a predetermined first threshold value,
The amount of speech data is reduced by deleting the data of the speech segment of the speech waveform signal relating to one pair of phonemes of each pair of phonemes from the speech waveform database , and the calculated similarity is Predetermined first threshold
Is equal to or greater than a certain value and is the same in the above list of phonemes.
The number of phonemes in a pair is equal to or greater than a predetermined second threshold value.
The phoneme of one of the pair of phonemes.
Data of the audio segment of the audio waveform signal
And a reduction means for deleting from the shape database .

【０００６】[0006]

【０００７】また、上記音声データ量削減装置におい
て、上記類似度は、好ましくは、それぞれ所定の重み係
数で重み付けされた、上記韻律的特徴パラメータに関す
る類似度のスコアと、上記音響特徴パラメータに関する
類似度のスコアとの線形結合の式を用いて計算されるこ
とを特徴とする。In the audio data amount reducing device, the similarity is preferably weighted by a predetermined weighting coefficient, and the similarity score for the prosodic feature parameter and the similarity for the acoustic feature parameter are preferably weighted. It is characterized by being calculated using a formula of linear combination with the score of.

【０００８】さらに、上記音声データ量削減装置におい
て、上記韻律的特徴パラメータは、好ましくは、音素時
間長と、音声基本周波数Ｆ₀と、経過時間に対する音声
基本周波数Ｆ₀の傾きと、パワーとを含み、上記音響的
特徴パラメータは、スペクトラム情報を含む。Furthermore, in the audio data amount reduction device, the prosodic feature parameters are preferably the phoneme duration, the voice fundamental frequency F _0, the slope of the voice fundamental frequency F ₀ with respect to the elapsed time, and a power And the acoustic characteristic parameter includes spectral information.

【０００９】本発明に係る音声合成装置は、上記音声デ
ータ量削減装置によって音声データ量が削減された音声
波形データベースに基づいて、入力された自然発話文の
音素列に対して、音素候補を上記音声波形データベース
から検索して連結することにより音声合成を行なう音声
合成手段を備えたことを特徴とする。The speech synthesis apparatus according to the present invention, based on the speech waveform database in which the speech data amount is reduced by the speech data amount reduction apparatus, selects the phoneme candidates for the phoneme sequence of the input spontaneous speech sentence. It is characterized by comprising a voice synthesizing means for synthesizing a voice by retrieving and connecting the voice waveform database.

【００１０】また、本発明に係る音声合成装置は、上記
音声データ量削減装置によって音声データ量が削減され
た音声波形データベースに基づいて、入力された自然発
話文の音素列に対して、目標音素と音素候補との間の近
似コストと、時間的に隣接して連結されるべき音素候補
間の近似コストとを含むコストが最小となるように、音
素候補を上記音声波形データベースから検索して連結す
ることにより音声合成を行なう音声合成手段を備えたこ
とを特徴とする。Further, the speech synthesizer according to the present invention, based on the speech waveform database in which the speech data amount is reduced by the speech data amount reducing device, with respect to the phoneme string of the input spontaneous utterance sentence, the target phoneme. And phoneme candidates are searched and linked from the speech waveform database so that the cost including the approximate cost between the phoneme candidates and the approximate cost between the phoneme candidates that should be temporally adjacent to each other is minimized. It is characterized in that it is provided with a voice synthesizing means for performing voice synthesis.

【００１１】[0011]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings.

【００１２】図１は、本発明に係る一実施形態である音
声データ量削減処理装置のブロック図である。この実施
形態の音声データ量削減装置は、図２の音声合成装置に
提供する音素ラベルデータにてなるテキストデータベー
スと音声波形データベース内の音声データ量を削減する
ために、図３の音声データ量削減処理を用いて、各バイ
フォン（biphone；時間的に隣接する１対の音素をい
う。）に対する評価韻律データとバイスペクトラム（詳
細後述）とのデータを含む評価データ行列に基づいて、
テキストデータベースと音声波形データベース内で所定
の類似度以上の音声データを削除することにより音声デ
ータ量の削減を行う音声データ量削減処理部４５を備え
たことを特徴とする。FIG. 1 is a block diagram of an audio data amount reduction processing apparatus according to an embodiment of the present invention. The voice data amount reduction device of this embodiment reduces the voice data amount in FIG. 3 in order to reduce the voice data amount in the text database and the voice waveform database formed of the phoneme label data provided to the voice synthesis device in FIG. Using the processing, based on an evaluation data matrix including evaluation prosodic data and bispectrum (described later in detail) for each biphone (a pair of phonemes that are temporally adjacent to each other),
It is characterized in that a voice data amount reduction processing unit 45 is provided for reducing the voice data amount by deleting voice data having a predetermined degree of similarity or more in the text database and the voice waveform database.

【００１３】すなわち、本実施形態においては、韻律的
特徴パラメータの領域及びスペクトル領域の両方におけ
る近似性の物理的尺度を使用して、出力される音声合成
された音声の品質を維持しながら音声波形データベース
における音声セグメントの数を減らして音声波形データ
ベースにおける冗長さを小さくする方法について開示す
る。That is, in the present embodiment, a physical measure of the closeness in both the domain of the prosodic feature parameter and the spectral domain is used to maintain the quality of the speech synthesized speech output while maintaining the speech waveform. A method for reducing the number of speech segments in the database to reduce redundancy in the speech waveform database is disclosed.

【００１４】本実施形態では、上記物理的尺度の音響的
特徴パラメータとして、バイスペクトラムを用いる。パ
ワースペクトルは２次のデータによってのみ決定される
ため、より高次の情報は無視される。もし音声がガウス
分布を示すならば、２次のデータのみで完全に音声を復
元合成することができる。しかしながら、実際は音声が
ガウス分布ではない。以上の２つの理由により、本発明
者は音声セグメントの類似度を測定するための音響特徴
パラメータの尺度として、バイスペクトルを使用して音
声合成を評価する。In the present embodiment, the bispectrum is used as the acoustic characteristic parameter of the physical scale. Higher order information is ignored because the power spectrum is determined only by the second order data. If the voice exhibits a Gaussian distribution, the voice can be completely restored and synthesized using only the second-order data. However, the voice is not actually Gaussian distributed. For the above two reasons, the present inventor evaluates speech synthesis using bispectrum as a measure of acoustic feature parameters for measuring the similarity of speech segments.

【００１５】ここで、本実施形態において用いるバイス
ペクトラム（bispectrum）について、従来技術文献１
「C. L. Nikias et al., "Bispectrum Estimation: A D
igitalSignal Processing Framework", Proceedings of
the IEEE 76, pp.869-891,1987」を参照して説明す
る。この実施形態において、実離散処理を仮定してお
り、｛Ｘ（ｋ）｝を実離散ゼロ平均定常処理とすると、
パワースペクトラムＰ（ω）は次式で定義される。Here, regarding the bispectrum used in the present embodiment, the prior art document 1
"CL Nikias et al.," Bispectrum Estimation: AD
igitalSignal Processing Framework ", Proceedings of
The IEEE 76, pp.869-891, 1987 ”will be described. In this embodiment, real discrete processing is assumed, and if {X (k)} is real discrete zero-mean stationary processing,
The power spectrum P (ω) is defined by the following equation.

【００１６】[0016]

【数１】ここで、[Equation 1] here,

【数２】ｒ（τ）＝Ｅ｛Ｘ（ｋ）Ｘ（ｋ＋τ）｝はその自己相関シーケンスである。もしＲ（ｍ，ｎ）が
｛Ｘ（ｋ）｝の３次のモーメントのシーケンスを示すな
らば、すなわち、R (τ) = E {X (k) X (k + τ)} is its autocorrelation sequence. If R (m, n) indicates a sequence of third moments of {X (k)}, that is,

【数３】Ｒ（ｍ，ｎ）＝Ｅ｛Ｘ（ｋ）Ｘ（ｋ＋ｍ）Ｘ
（ｋ＋ｎ）｝であるとき、そのバイスペクトラムは次式で定義され
る。## EQU3 ## R (m, n) = E {X (k) X (k + m) X
(K + n)}, the bispectrum is defined by the following equation.

【００１７】[0017]

【数４】 [Equation 4]

【００１８】ここで、Ｒ（ｍ，ｎ）は上述の３次の自己
相関関数であり、３次のモーメントと累積係数（累積率
ともいう。）が同一であるので、バイスペクトラムは３
次の累積係数のスペクトラムとなる。パワースペクトラ
ムとバイスペクトラムの物理的な重要性は、次式のＸ
（ｋ）のフーリエ・スティルチェスの表現式（クラメル
のスペクトル表現式ともいう。）の成分ｄＺ（ω）の項
を用いて表わすときに明らかになる。Here, R (m, n) is the above-mentioned third-order autocorrelation function, and since the third-order moment and the cumulative coefficient (also called cumulative rate) are the same, the bispectrum is 3.
It becomes the spectrum of the following cumulative coefficient. The physical importance of the power spectrum and bispectrum is expressed by the following X
It becomes clear when it is expressed by using the term of the component dZ (ω) of the Fourier-Stilches expression of (k) (also called Kramer's spectrum expression).

【００１９】すべてのｋに対して、For all k,

【数５】であり、ここで、[Equation 5] And where

【数６】Ｅ｛ｄＺ（ω）｝＝０## EQU6 ## E {dZ (ω)} = 0

【数７】Ｅ｛ｄＺ（ω₁）ｄＺ^*（ω₂）｝＝０，ω₁≠ω₂のとき＝２πＰ（ω）ｄω，ω₁＝ω₂＝ωのとき並びに、## EQU7 ## E {dZ (ω ₁ ) dZ ^* (ω ₂ )} = 0, when ω ₁ ≠ ω ₂ = 2πP (ω) dω, when ω ₁ = ω ₂ = ω, and

【数８】Ｅ｛ｄＺ（ω₁）ｄＺ（ω₂）ｄＺ^*（ω₃）｝＝Ｂ（ω₁，ω₂）ｄω₁ｄω₂，ω₁＋ω₂＝ω₃ ＝０，ω₁＋ω₂≠ω₃ である。E {dZ (ω ₁ ) dZ (ω ₂ ) dZ ^* (ω ₃ )} = B (ω ₁ , ω ₂ ) dω ₁ dω ₂ , ω ₁ + ω ₂ = ω ₃ = 0, ω ₁ + ω ₂ ≠ ω ₃ .

【００２０】従って、パワースペクトラムＰ（ω）は周
波数が同一である２つのフーリエ成分の平均値の積に寄
与することを表わす一方、バイスペクトラムＢ（ω₁，
ω₂）は、１つの周波数が他の２つの周波数の和に等し
いときの３つのフリーエ成分の平均値の積に寄与するこ
とを表わす。すなわち、バイスペクトラムは、２つのス
ペクトラム列に対する寄与度又は類似度を表わす。Therefore, while the power spectrum P (ω) represents that it contributes to the product of the average values of two Fourier components having the same frequency, the bispectrum B (ω ₁ ,
ω ₂ ) represents that one frequency contributes to the product of the mean values of the three Frieden components when it equals the sum of the other two frequencies. That is, the bispectrum represents the degree of contribution or the degree of similarity to the two spectrum sequences.

【００２１】本実施形態では、２次元のフーリエ変換を
使用してバイスペクトルを計算する。バイスペクトルの
次元は一般に高いため、２次元のＤＣＴを使用してバイ
スペクトルをより低次の係数に圧縮して計算する。な
お、本実施形態においては、音響特徴パラメータとして
バイスペクトラムを用いているが、本発明はこれに限ら
ず、スペクトラム情報を用いてもよい。In this embodiment, the bispectrum is calculated by using the two-dimensional Fourier transform. Since the dimension of the bispectrum is generally high, a two-dimensional DCT is used to compress the bispectrum into lower order coefficients for calculation. In the present embodiment, the bispectrum is used as the acoustic feature parameter, but the present invention is not limited to this, and spectrum information may be used.

【００２２】本実施形態では、上記物理的尺度の韻律的
特徴パラメータとして、各音声セグメントの持続時間
（又は音素時間長）、最大信号振幅（又はパワー）、平
均基本周波数（又はピッチ周波数）Ｆ₀及び基本周波数
Ｆ₀の傾き（又は傾斜）を用いる。１対の音声セグメン
ト間の音声距離を決定するにはスペクトル領域における
物理的尺度が必要であるが、音声波形データベースの多
様性を維持しようとする場合には、１対の音声セグメン
ト間の韻律的特徴パラメータの距離を決定することも必
要である。スペクトル特性及び韻律的特性の両方で所定
のしきい値内で類似する（又は重複する）音声セグメン
トは音声波形データベースから除去することが可能であ
る。言い換えれば、音声波形データベースのある部分を
削除しても、残った部分から削除前と実質的に同一の音
声波形を合成できるならば、その部分を削除してもかま
わないという技術的思想に基づいている。これにより音
声波形データベースの量は減少するが、音声波形データ
としてのカバー範囲は実質的に変化しないといえる。本
実施形態においては、韻律的特徴パラメータの類似度を
測定するため、音素サイズの各音声セグメントの上述の
韻律的特徴パラメータを用いる。In the present embodiment, the duration (or phoneme time length) of each speech segment, the maximum signal amplitude (or power), the average fundamental frequency (or pitch frequency) F _{0 is used} as the prosodic feature parameter of the physical scale. And the slope (or slope) of the fundamental frequency F ₀ is used. A physical measure in the spectral domain is needed to determine the speech distance between a pair of speech segments, but if the diversity of the speech waveform database is to be maintained, the prosodic spacing between the pair of speech segments is required. It is also necessary to determine the distance of the feature parameters. Speech segments that are similar (or overlap) within a given threshold in both spectral and prosodic characteristics can be removed from the speech waveform database. In other words, based on the technical idea that if you delete a certain part of the voice waveform database and if you can synthesize a voice waveform that is substantially the same as that before the deletion from the remaining part, you may delete that part. ing. Although this reduces the amount of the voice waveform database, it can be said that the coverage of the voice waveform data does not substantially change. In the present embodiment, in order to measure the similarity of the prosodic feature parameters, the above-mentioned prosodic feature parameters of each phoneme-sized speech segment are used.

【００２３】そして、本実施形態においては、各１対の
バイフォンに対して、以下に示す類似度のスコアＭを計
算して所定のしきい値Ｍｔｈ以下のときに、１対のバイ
フォンは互いに類似していると判断し、かつこのバイフ
ォンの数Ｎを予め決められたしきい値Ｎｔｈ未満となら
ないように、すなわち当該しきい値Ｎｔｈ以上であれば
削除しても複製可能であり、それは冗長であると判断し
て、当該バイフォンの音素に関するデータをテキストデ
ータベースメモリ２２及び音声波形データベースメモリ
２１から削除することにより、音声データ量の削減を行
う。In the present embodiment, the pair of biphones are similar to each other when the similarity score M shown below is calculated for each pair of biphones and is less than or equal to a predetermined threshold value Mth. It is determined that the number of biphones is not less than the predetermined threshold Nth, that is, if the number N of the biphones is equal to or more than the threshold Nth, it can be duplicated even if it is deleted. When it is determined that the phoneme of the biphone is present, the data of the phoneme of the biphone is deleted from the text database memory 22 and the voice waveform database memory 21 to reduce the voice data amount.

【００２４】１対のバイフォンに対する類似度のスコア
Ｍは次式で表わされる。The similarity score M for a pair of biphones is expressed by the following equation.

【００２５】[0025]

【数９】 [Equation 9]

【００２６】ここで、（ａ）ｗ₁：韻律的特徴パラメータのスコアに対する重
み係数であり、例えば０．５である。（ｂ）ｗ₁₁：基本周波数Ｆ₀のスコアに対する重み係数
であり、例えば０．３である。（ｃ）Ｓ_F0：１対のバイフォンに対する基本周波数Ｆ₀
の差の絶対値である。（ｄ）ｗ₁₂：パワーのスコアに対する重み係数であり、
例えば０．２である。（ｅ）Ｓ_P：１対のバイフォンに対するパワーの差の絶
対値である。（ｆ）ｗ₁₃：音素時間長のスコアに対する重み係数であ
り、例えば０．１である。（ｇ）Ｓ_D：１対のバイフォンに対する音素時間長の比
の値（又は比率）である。（ｈ）ｗ₁₄：基本周波数Ｆ₀の傾きのスコアに対する重
み係数であり、例えば０．３である。（ｉ）Ｓ_FI：１対のバイフォンに対する基本周波数Ｆ₀
の傾きに係るスコアであり、２つのバイフォンの音声波
形の変化パターンの組み合せから求める。＜１＞まず、基本周波数Ｆ₀の変化を上昇、水平（実質
的に変化せずを意味する。）、下降の３つのパターンに
分類する。＜２−１＞音声波形Ａと音声波形Ｂのパターンが同じな
らば、スコア＝０とする。＜２−２＞音声波形Ａと音声波形Ｂのパターンが次の表
に該当するときは、スコア＝１とする。Here, (a) w ₁ : a weighting coefficient for the score of the prosodic feature parameter, which is, for example, 0.5. (B) w ₁₁ : a weighting coefficient for the score of the fundamental frequency F ₀ , which is 0.3, for example. (C) S _F0 : fundamental frequency F ₀ for a pair of biphones
Is the absolute value of the difference. (D) w ₁₂ : a weighting coefficient for the power score,
For example, it is 0.2. (E) S _P : Absolute value of power difference between a pair of biphones. (F) w ₁₃ : a weighting coefficient for the score of the phoneme time length, which is 0.1, for example. (G) S _D : a value (or ratio) of a ratio of phoneme time lengths for a pair of biphones. (H) w ₁₄ : a weighting coefficient for the score of the slope of the fundamental frequency F ₀ , which is, for example, 0.3. (I) S _FI : fundamental frequency F ₀ for a pair of biphones
Is a score relating to the slope of the two biphones, and is obtained from a combination of change patterns of the voice waveforms of the two biphones. <1> First, the change of the fundamental frequency F ₀ is classified into three patterns of rising, horizontal (meaning that there is substantially no change), and falling. <2-1> If the patterns of the voice waveform A and the voice waveform B are the same, score = 0. <2-2> When the patterns of the voice waveform A and the voice waveform B correspond to the following table, the score is set to 1.

【表１】 ――――――――――― 波形Ａ波形Ｂ ――――――――――― 上昇水平水平上昇水平下降下降水平 ――――――――――― ＜２−２＞音声波形Ａと音声波形Ｂのパターンが次の表
に該当するときは、スコア＝１とする。[Table 1] ――――――――――― Waveform A Waveform B ――――――――――― Ascending Horizontal Horizontal Ascending Horizontal Decreasing Horizontal ―――――――――――― ＜ 2-2> When the patterns of the voice waveform A and the voice waveform B correspond to the following table, score = 1.

【表２】 ――――――――――― 波形Ａ波形Ｂ ――――――――――― 上昇下降下降上昇 ――――――――――― （ｊ）ｗ₂：バイスペクトラムに対する重み係数であ
り、例えば，０．８である。（ｋ）Ｓ_BS：１対のバイフォンのスペクトラム列のベク
トルの距離、すなわち、各スペクトラム要素の２乗和の
平方根である。[Table 2] ――――――――――― Waveform A Waveform B ――――――――――― Rise Rise Fall Rise Rise ――――――――――― (j) w ₂ : Weighting coefficient for the bispectrum, which is, for example, 0.8. (K) S _BS : The distance between vectors of a pair of biphone spectrum sequences, that is, the square root of the sum of squares of each spectrum element.

【００２７】上記数９から明らかなように、類似度のス
コアＭが０に近いほど、類似度の度合いが高いといえ
る。As is clear from the above equation 9, the closer the similarity score M is to 0, the higher the degree of similarity.

【００２８】次いで、図１を参照して音声データ量削減
処理装置の構成及び動作について説明する。図１におい
て、テキストデータベースメモリ２２は、自然発話の書
き下し文の音素ラベルデータにてなるテキストデータベ
ースを記憶し、音声波形データベースメモリ２１は、上
記テキストデータベースにおける自然発話の書き下し文
の音素ラベルデータに対応する音声波形信号の音声セグ
メントからなる音声波形データベースを記憶する。基本
韻律データ生成部４０は、これら２つのメモリ２１，２
２内のデータに基づいて、各音素ラベルに対応して、例
えばＬＰＣ分析法を用いて基本周波数Ｆ₀を検出すると
ともに、所定の音声波形分析を行うことによりパワーを
検出することにより、基本韻律データとして生成して基
本韻律データメモリ５０に格納する。Next, the configuration and operation of the audio data amount reduction processing device will be described with reference to FIG. In FIG. 1, a text database memory 22 stores a text database consisting of phoneme label data of a written sentence of a natural utterance, and a voice waveform database memory 21 stores a voice corresponding to phoneme label data of a written sentence of a natural utterance in the text database. A speech waveform database consisting of speech segments of the waveform signal is stored. The basic prosody data generation unit 40 uses the two memories 21, 2
Based on the data in 2, the fundamental frequency F ₀ is detected corresponding to each phoneme label by using, for example, the LPC analysis method, and the power is detected by performing a predetermined speech waveform analysis. It is generated as data and stored in the basic prosody data memory 50.

【００２９】次いで、評価韻律データ生成部４１は、テ
キストデータベースメモリ２２内の音素ラベルデータ
と、基本韻律データメモリ５０内の基本韻律データとに
基づいて、基本周波数Ｆ₀とパワーに加えて、基本周波
数Ｆ₀の時間方向の傾きと、音素時間長を計算してこれ
ら４つのデータを評価韻律データとして評価韻律データ
メモリ５１に格納する。また、バイフォンリスト生成部
４２は、テキストデータベースメモリ２２内の音素ラベ
ルデータに基づいて、時間方向に隣接する２つの音素並
びであるバイフォンを計数して、そのバイフォンと計数
値をバイフォンリストメモリ５２に格納する。さらに、
バイスペクトラムデータ生成部４３は、テキストデータ
ベースメモリ２２内の音素ラベルデータと音声波形デー
タベースメモリ２１内の音声波形データベースとに基づ
いて、上述のバイスペクトラムデータ、すなわち、１対
のバイフォンのスペクトラム列のベクトルの距離、すな
わち、各スペクトラム要素の２乗和の平方根を計算して
バイスペクトラムデータメモリ５３に格納する。Next, the evaluation prosody data generation unit 41, based on the phoneme label data in the text database memory 22 and the basic prosody data in the basic prosody data memory 50, in addition to the basic frequency F ₀ and the power, The inclination of the frequency F _{0 in} the time direction and the phoneme time length are calculated, and these four data are stored in the evaluation prosody data memory 51 as evaluation prosody data. In addition, the biphone list generation unit 42 counts the biphones, which are two phoneme sequences adjacent in the time direction, based on the phoneme label data in the text database memory 22, and the biphone and the count value are stored in the biphone list memory. It stores in 52. further,
The bispectrum data generation unit 43, based on the phoneme label data in the text database memory 22 and the voice waveform database in the voice waveform database memory 21, outputs the above-mentioned bispectrum data, that is, a vector of a spectrum sequence of a pair of biphones. , The square root of the sum of squares of each spectrum element is calculated and stored in the bispectrum data memory 53.

【００３０】さらに、評価データ行列生成部４４は、評
価韻律データメモリ５１内の評価韻律データと、バイフ
ォンリストメモリ５２内のバイフォンと、バイスペクト
ラムデータメモリ５３内のバイスペクトラムデータとに
基づいて、テキストデータベース内の各バイフォンの列
に対して、上述の４つの評価韻律データと１つのバイス
ペクトラムデータを行方向並置することにより、評価デ
ータ行列を生成して評価データ行列メモリ５４に格納す
る。そして、音声データ量削減処理部４５は、評価デー
タ行列メモリ５４内の評価データ行列と、バイフォンリ
ストメモリ５２内のバイフォンに基づいて、図３の音声
データ量削減処理を実行することにより、テキストデー
タベースメモリ２２内のテキストデータと音声波形デー
タベースメモリ２１内の音声波形データベースの音声デ
ータ量を削減して、削減後のデータをテキストデータベ
ースメモリ２２ａと音声波形データベースメモリ２１ａ
にコピーする。そして、図２の音声合成装置は、音声デ
ータ量が削減された後のテキストデータベースメモリ２
２ａ及び音声波形データベースメモリ２１ａ内のデータ
を用いて音声合成処理を行う。Further, the evaluation data matrix generator 44, based on the evaluation prosody data in the evaluation prosody data memory 51, the biphones in the biphone list memory 52, and the bispectrum data in the bispectrum data memory 53, The evaluation data matrix is generated and stored in the evaluation data matrix memory 54 by arranging the above-described four evaluation prosody data and one bispectrum data in the row direction for each biphone column in the text database. Then, the voice data amount reduction processing unit 45 executes the voice data amount reduction process of FIG. 3 on the basis of the evaluation data matrix in the evaluation data matrix memory 54 and the biphones in the biphone list memory 52, thereby performing text processing. The text data in the database memory 22 and the voice data amount of the voice waveform database in the voice waveform database memory 21 are reduced, and the data after the reduction is reduced to the text database memory 22a and the voice waveform database memory 21a.
To copy. Then, the speech synthesizer of FIG. 2 has the text database memory 2 after the speech data amount is reduced.
2a and the data in the speech waveform database memory 21a are used to perform speech synthesis processing.

【００３１】図３は、図１のデータ量削減処理部によっ
て実行されるデータ量削減処理を示すフローチャートで
ある。図３において、まず、ステップＳ１においてバイ
フォンリストメモリ５２内のバイフォンリストから同一
の１対のバイフォンを検索し、ステップＳ２において存
在するか否かが判断され、ＹＥＳのときはステップＳ３
に進む一方、ＮＯのときはステップＳ９に進む。ステッ
プＳ３では、検索した１対バイフォンに対する評価韻律
データとバイスペクトラムデータに基づいて，上記数９
を用いて類似度のスコアＭを計算する。FIG. 3 is a flow chart showing the data amount reduction processing executed by the data amount reduction processing unit of FIG. In FIG. 3, first, in step S1, the same pair of biphones is searched from the biphone list in the biphone list memory 52, and it is determined in step S2 whether or not they exist. If YES, step S3
On the other hand, if NO, the process proceeds to step S9. In step S3, based on the evaluation prosody data and bispectrum data for the retrieved one-to-one biphone,
Is used to calculate the similarity score M.

【００３２】そして、ステップＳ４において１対のバイ
フォンが類似しているか否かが判断され、具体的にはＭ
≦Ｍｔｈであるか否かが判断され、ＹＥＳのときは類似
していると判断してステップＳ５に進む一方、ＮＯのと
きは類似していないと判断してステップＳ８に進む。こ
こで、Ｍｔｈは予め決められたしきい値であり、例えば
０．１である。次いで、ステップＳ５においてバイフォ
ンリストメモリ５２内のバイフォンリストから同一のバ
イフォンの件数Ｎを計算し、ステップＳ６においてＮ≧
Ｎｔｈか否かが判断され、ＹＥＳのときはステップＳ７
に進む一方、ＮＯのときはステップＳ８に進む。ここ
で、Ｎｔｈは予め決められたしきい値であり、例えば５
０であり、これは元のテキストデータベースのバイフォ
ンの数にも依存する。ステップＳ６の判断では、音声波
形データベースにおいて所定数の同一のバイフォンに対
する音声波形データを確保して、音声合成後の音声の品
質を所定以上に確保するために設けられる。さらに、ス
テップＳ７において当該１対のバイフォンのうちの一方
のバイフォンの音素をバイフォンリストから削除し、当
該バイフォンの音素のラベルデータ及び音声波形データ
をそれぞれテキストデータベース及び音声波形データベ
ースから削除して、ステップＳ８に進む。ステップＳ８
では、バイフォンリストから別の組み合わせの１対のバ
イフォンを検索して、ステップＳ２に戻る。Then, in step S4, it is determined whether or not the pair of biphones are similar to each other. Specifically, M
It is determined whether or not ≦ Mth. If YES, it is determined that they are similar and the process proceeds to step S5. If NO, it is determined that they are not similar and the process proceeds to step S8. Here, Mth is a predetermined threshold value, for example, 0.1. Then, in step S5, the number N of identical biphones is calculated from the biphone list in the biphone list memory 52, and in step S6 N ≧
It is determined whether Nth or not, and if YES, step S7.
On the other hand, if NO, the process proceeds to step S8. Here, Nth is a predetermined threshold value, for example, 5
0, which also depends on the number of biphones in the original text database. The determination in step S6 is provided to secure a predetermined number of voice waveform data for the same biphone in the voice waveform database and to ensure the quality of the voice after the voice synthesis to a predetermined level or higher. Further, in step S7, the phoneme of one of the pair of biphones is deleted from the biphone list, and the phoneme label data and the speech waveform data of the biphone are deleted from the text database and the speech waveform database, respectively. Go to step S8. Step S8
Then, another pair of biphones in another combination is searched from the biphone list, and the process returns to step S2.

【００３３】ステップＳ２においてＮＯであるときは、
ステップＳ９においてバイフォンリストから別の種類の
バイフォンを検索し、ステップＳ１０において存在する
か否かが判断され、ＹＥＳのときはステップＳ１に戻る
一方、ＮＯのときはステップＳ１０ａに進む。ステップ
Ｓ１０ａでは、音声データ量が削減された後のテキスト
データベースメモリ２２及び音声波形データベースメモ
リ２１内のデータをそれぞれ、テキストデータベースメ
モリ２２ａ及び音声波形データベースメモリ２１ａにコ
ピーして当該音声データ量削減処理を終了する。If NO at step S2,
In step S9, another type of biphone is searched from the biphone list, and it is determined in step S10 whether or not it exists. If YES, the process returns to step S1, while if NO, the process proceeds to step S10a. In step S10a, the data in the text database memory 22 and the voice waveform database memory 21 after the voice data amount is reduced are copied to the text database memory 22a and the voice waveform database memory 21a, respectively, and the voice data amount reduction processing is performed. finish.

【００３４】以上の図３の実施形態において、ステップ
Ｓ４における判断とステップＳ６における判断がともに
ＹＥＳであるときに、当該バイフォンの音素のラベルデ
ータと音声波形データとを削除しているが、本発明はこ
れに限らず、ステップＳ４における判断のみがＹＥＳで
あるときに、当該バイフォンの音素のラベルデータと音
声波形データとを削除してもよい。In the embodiment of FIG. 3 described above, when the judgments in step S4 and step S6 are both YES, the phoneme label data and the voice waveform data of the biphone are deleted. Is not limited to this, and when only the determination in step S4 is YES, the phoneme label data and the voice waveform data of the biphone may be deleted.

【００３５】次いで、図１の音声データ量削減処理装置
で音声データ量が削減された、テキストデータベースメ
モリ２２ａ及び音声波形データベースメモリ２１ａ内の
データを用いて音声合成を行う音声合成装置について以
下に説明する。Next, a voice synthesizing device for performing voice synthesis using the data in the text database memory 22a and the voice waveform database memory 21a, the voice data amount of which has been reduced by the voice data amount reduction processing device of FIG. 1, will be described below. To do.

【００３６】図２は、本発明に係る一実施形態である自
然発話音声波形信号接続型音声合成装置のブロック図で
ある。本実施形態では、大きく分類すれば、次の４つの
処理部に分類される。（１）音声波形信号データベース
メモリ２１ａ内の音声波形信号データベースの音声波形
信号データの音声分析、具体的には、音素記号系列の生
成、音素のアラインメント、特徴パラメータの抽出を含
む処理を実行する音声分析部１０。（２）最適重み係数
を学習しながら決定する重み係数学習部１１。（３）入
力される音素列に基づいて音声単位の選択を実行して入
力音素列に対応する音声波形信号データの索引情報を出
力する音声単位選択部１２。（４）音声単位選択部１２
から出力される索引情報に基づいて音声波形信号データ
ベースメモリ２１ａ内の音声波形信号データベースをラ
ンダムにアクセスして最適とされた各音素候補の音声波
形信号を再生してスピーカ１４に出力する音声合成部１
３。FIG. 2 is a block diagram of a spontaneous speech waveform waveform signal connection type speech synthesizer according to an embodiment of the present invention. In the present embodiment, when roughly classified, they are classified into the following four processing units. (1) Speech analysis of speech waveform signal data in the speech waveform signal database in the speech waveform signal database memory 21a, specifically, speech for performing processing including generation of phoneme symbol series, alignment of phonemes, and extraction of characteristic parameters. Analysis unit 10. (2) A weighting coefficient learning unit 11 that determines an optimum weighting coefficient while learning it. (3) A voice unit selection unit 12 that executes selection of a voice unit based on an input phoneme sequence and outputs index information of voice waveform signal data corresponding to the input phoneme sequence. (4) Voice unit selection unit 12
A voice synthesizer for randomly accessing the voice waveform signal database in the voice waveform signal database memory 21a based on the index information output from the device to reproduce the optimal voice waveform signal of each phoneme candidate and outputting it to the speaker 14. 1
3.

【００３７】具体的には、音声分析部１０は、入力され
る自然発話の音声波形信号の音声セグメントと、上記音
声波形信号に対応する音素列とに基づいて、音素ＨＭＭ
メモリ２３を参照して、上記音声波形信号における音素
毎の索引情報と、上記索引情報によって示された音素毎
の第１の音響的特徴パラメータと、上記索引情報によっ
て示された音素毎の第１の韻律的特徴パラメータとを抽
出して出力する。特徴パラメータメモリ３０は、上記音
声分析部１０から出力される索引情報と、上記第１の音
響的特徴パラメータと、上記第１の韻律的特徴パラメー
タとを記憶する。次いで、重み係数学習部１１は、特徴
パラメータメモリ３０に記憶された第１の音響的特徴パ
ラメータと韻律的特徴パラメータとに基づいて、同一の
音素種類の１つの目標音素とそれ以外の音素候補との間
の第２の音響的特徴パラメータにおける音響的距離を計
算し、上記計算した音響的距離に基づいて各音素候補に
対して上記第２の音響的特徴パラメータ毎に所定の統計
的解析を実行することにより、各音素候補に対する上記
第２の音響的特徴パラメータにおける寄与度を表わす各
目標音素毎の重み係数ベクトルを決定する。重み係数ベ
クトルメモリ３１は、重み係数学習部１１によって決定
された上記第２の音響的特徴パラメータにおける各目標
音素毎の重み係数ベクトルと、予め与えられた、各音素
候補に関する第２の韻律的特徴パラメータにおける寄与
度を表わす各目標音素毎の重み係数ベクトルとを記憶す
る。さらに、音声単位選択部１２は、重み係数ベクトル
メモリ３１に記憶された各目標音素毎の重み係数ベクト
ルと、特徴パラメータメモリ３０に記憶された第１の韻
律的特徴パラメータとに基づいて、入力される自然発話
文の音素列に対して、目標音素と音素候補との間の近似
コストを表わす目標コストと、隣接して連結されるべき
２つの音素候補間の近似コストを表わす連結コストとを
含むコストが最小となる、音素候補の組み合わせを検索
して、検索した音素候補の組み合わせの索引情報を出力
する。そして、音声合成部１３は、音声単位選択部１２
から出力される索引情報に基づいて、当該索引情報に対
応する音声波形信号の音声セグメントを音声波形信号デ
ータベースメモリ２１ａから逐次読み出して連結してス
ピーカ１４に出力することにより、音声合成装置は、上
記入力された音素列に対応する音声を合成して出力す
る。Specifically, the voice analysis unit 10 determines the phoneme HMM based on the voice segment of the voice waveform signal of the spontaneous utterance and the phoneme sequence corresponding to the voice waveform signal.
With reference to the memory 23, index information for each phoneme in the speech waveform signal, a first acoustic feature parameter for each phoneme indicated by the index information, and a first phoneme for each phoneme indicated by the index information. And the prosody characteristic parameters of are extracted and output. The characteristic parameter memory 30 stores the index information output from the voice analysis unit 10, the first acoustic characteristic parameter, and the first prosodic characteristic parameter. Next, the weighting factor learning unit 11 determines one target phoneme of the same phoneme type and other phoneme candidates based on the first acoustic feature parameter and the prosodic feature parameter stored in the feature parameter memory 30. Between the second acoustic feature parameters, the acoustic distance in the second acoustic feature parameter is calculated, and a predetermined statistical analysis is performed for each second acoustic feature parameter for each phoneme candidate based on the calculated acoustic distance. By doing so, the weighting coefficient vector for each target phoneme that represents the degree of contribution in the second acoustic feature parameter to each phoneme candidate is determined. The weighting coefficient vector memory 31 includes a weighting coefficient vector for each target phoneme in the second acoustic characteristic parameter determined by the weighting coefficient learning unit 11 and a second prosodic characteristic for each phoneme candidate given in advance. The weighting coefficient vector for each target phoneme that represents the degree of contribution in the parameter is stored. Further, the voice unit selection unit 12 is input based on the weighting coefficient vector for each target phoneme stored in the weighting coefficient vector memory 31 and the first prosodic feature parameter stored in the feature parameter memory 30. A target cost that represents an approximate cost between a target phoneme and a phoneme candidate and a connection cost that represents an approximate cost between two phoneme candidates that should be adjacently connected to the phoneme sequence of the natural utterance sentence. A combination of phoneme candidates that minimizes the cost is searched, and index information of the searched combination of phoneme candidates is output. Then, the voice synthesis unit 13 uses the voice unit selection unit 12
Based on the index information output from the voice synthesizer, the voice segment of the voice waveform signal corresponding to the index information is sequentially read from the voice waveform signal database memory 21a, connected, and output to the speaker 14. The speech corresponding to the input phoneme sequence is synthesized and output.

【００３８】ここで、音声分析部１０の処理は新しい音
声波形信号データベースに対しては必ず一度行なう必要
があり、重み係数学習部１１の処理は、一般に一度の処
理でよく、重み係数学習部１１によって求めた最適重み
係数は異なる音声合成条件に対しても再利用が可能であ
る。さらに、音声単位選択部１２と音声合成部１３の処
理は、音声合成すべき入力音素列が変われば、その都度
実行される。Here, the processing of the speech analysis unit 10 must be always performed once for a new speech waveform signal database, and the processing of the weighting coefficient learning unit 11 may be generally one time processing. The optimum weighting factor obtained by can be reused for different speech synthesis conditions. Furthermore, the processes of the voice unit selection unit 12 and the voice synthesis unit 13 are executed each time the input phoneme sequence to be voice-synthesized changes.

【００３９】本実施形態の音声合成装置は与えられたレ
ベルの入力に基づいて必要とする、すべての特徴パラメ
ータを予測し、所望の音声の特徴に最も近いサンプル
（すなわち、音素候補の音声波形信号）を音声波形デー
タベースメモリ２１ａ内の音声波形信号データベースの
中から選び出す。最低限、音素ラベルの系列が与えられ
れば処理は可能であるが、音声基本周波数Ｆ₀や音素時
間長が予め与えられていれば、さらに高品質の合成音声
が得られる。なお、入力として単語の情報だけが与えら
れた場合には、例えば音素隠れマルコフモデルメモリ２
３に格納された音素隠れマルコフモデル（以下、隠れマ
ルコフモデルをＨＭＭという。）などの辞書や規則に基
づいて音素系列を予測する必要がある。また、韻律特徴
が与えられなかった場合には音声波形信号データベース
中のいろいろな環境における音素の既知の特徴を基に標
準的な韻律を生成する。The speech synthesizer of this embodiment predicts all the required characteristic parameters based on the input of a given level, and determines the sample (that is, the phoneme candidate speech waveform signal) closest to the desired speech characteristic. ) Is selected from the voice waveform signal database in the voice waveform database memory 21a. At a minimum, processing is possible if a sequence of phoneme labels is given, but if a fundamental voice frequency F ₀ and a phoneme time length are given in advance, a higher quality synthesized speech can be obtained. When only word information is given as input, for example, the phoneme hidden Markov model memory 2
It is necessary to predict the phoneme sequence based on a dictionary or rules such as a phoneme hidden Markov model (hereinafter, the hidden Markov model is referred to as HMM) stored in 3. If no prosodic feature is given, a standard prosody is generated based on known features of phonemes in various environments in the speech waveform signal database.

【００４０】本実施形態では、音声波形信号データベー
スメモリ２１ａ内の録音内容を少なくとも正書法で記述
されたテキストデータが例えば、テキストデータベース
メモリ２２ａ内のテキストデータベースのように存在す
るならば、あらゆる音声波形信号データベースが合成用
の音声波形信号データとして利用可能であるが、出力音
声の品質は録音状態、音声波形信号データベース中の音
素のバランス等に大きく影響を受け、音声波形データベ
ースメモリ２１ａ内の音声波形信号データベースが豊富
な内容であれば、より多様な音声が合成でき、反対に音
声波形信号データベースが貧弱であれば、合成音声は不
連続感が強く、ブツブツしたものになる。In the present embodiment, if there is text data in which the recorded contents in the voice waveform signal database memory 21a are described at least in the orthography, such as a text database in the text database memory 22a, all voice waveform signals are recorded. Although the database can be used as voice waveform signal data for synthesis, the quality of the output voice is greatly affected by the recording state, the phoneme balance in the voice waveform signal database, etc., and the voice waveform signal in the voice waveform database memory 21a. If the database has abundant contents, a wider variety of voices can be synthesized. On the contrary, if the voice waveform signal database is poor, the synthesized voice has a strong sense of discontinuity and becomes sloppy.

【００４１】次いで、自然な発話音声に対する音素ラベ
ル付けについて説明する。音声単位の選択の善し悪しは
音声波形信号データベース中の音素のラベル付けと検索
の方法に依存する。ここで、好ましい実施例において
は、音声単位は、音素である。まず、録音された音声に
付与された正書法の発話内容を音素系列に変換し、さら
に音声波形信号に割り当てる。韻律的特徴パラメータの
抽出はこれに基づいて行なわれる。音声分析部１０の入
力はテキストデータベースメモリ２２ａ内の音素表記を
伴った音声波形データベースメモリ２１ａ内の音声波形
信号データであり、出力は特徴ベクトル又は特徴パラメ
ータである。この特徴ベクトルは音声波形信号データベ
ース中で音声サンプルを表す基本単位となり、最適な音
声単位の選択に用いられる。Next, phoneme labeling for natural speech will be described. The goodness of choice of the voice unit depends on the method of labeling and searching the phonemes in the voice waveform signal database. Here, in the preferred embodiment, the voice unit is a phoneme. First, the orthographic contents of the orthography given to the recorded voice are converted into a phoneme sequence and further assigned to a voice waveform signal. The extraction of the prosodic feature parameter is performed based on this. The input of the voice analysis unit 10 is the voice waveform signal data in the voice waveform database memory 21a with the phoneme notation in the text database memory 22a, and the output is the feature vector or the feature parameter. This feature vector serves as a basic unit representing a voice sample in the voice waveform signal database, and is used for selecting an optimum voice unit.

【００４２】音声分析部１０の処理における第１段階に
おいては、正書法で書かれた発話内容が実際の音声波形
信号データでどのように発音されているかを記述するた
めの正書法テキストから音素記号への変換である。次い
で、第２段階においては、韻律的及び音響的特徴を計測
するために各音素の開始及び終了時点を決めるために、
各音素記号を音声波形信号に対応付ける処理である（以
下、当該処理を、音素のアラインメント処理とい
う。）。さらに、第３段階においては、各音素の特徴ベ
クトル又は特徴パラメータを生成することである。この
特徴ベクトルには、必須項目として音素ラベル、メモリ
３０内の音声波形信号データベース中の各ファイルにお
ける当該音素の開始時刻（開始位置）、音声基本周波数
Ｆ₀、音素時間長、パワーの情報が記憶され、さらに、
特徴パラメータのオプションとしてストレス、アクセン
ト型、韻律境界に対する位置、スペクトル傾斜等の情報
が記憶される。以上の特徴パラメータを整理すると、例
えば、次の表のようになる。In the first stage of the processing of the speech analysis unit 10, the orthographic text for describing how the utterance content written by the orthography is actually pronounced in the speech waveform signal data is converted into a phoneme symbol. It is a conversion. Then, in a second step, to determine the start and end times of each phoneme to measure prosodic and acoustic features,
This is a process of associating each phoneme symbol with a speech waveform signal (hereinafter, this process is referred to as a phoneme alignment process). Further, the third step is to generate a feature vector or feature parameter of each phoneme. This feature vector stores a phoneme label as essential items, a start time (start position) of the phoneme in each file in the voice waveform signal database in the memory 30, a voice fundamental frequency F ₀ , a phoneme time length, and power information. And in addition,
Information such as stress, accent type, position with respect to prosodic boundaries, and spectral tilt is stored as an option of the characteristic parameter. The following table summarizes the above characteristic parameters.

【００４３】[0043]

【表３】 ――――――――――――――――――――――――――――――――――― 索引情報：索引番号（１つのファイルに対して付与）メモリ３０内の音声波形信号データベース中の各ファイルにおける当該音素の開始時刻（開始位置） ――――――――――――――――――――――――――――――――――― 第１の音響的特徴パラメータ：１２次メルケプストラム係数１２次Δメルケプストラム係数音素ラベル弁別素性：母音性（vocalic）（＋）／非母音性（non-vocalic）（−）子音性（consonantal）（＋）／非子音性（non-consonantal）（−）中断性（interrupted）（＋）／連続性（continuant）（−）抑止性（checked）（＋）／非抑止性（unchecked）（−）粗擦性（strident）（＋）／円熟性（mellow）（−）有声（voiced）（＋）／無声（unvoiced）（−）集約性（compact）（＋）／拡散性（diffuse）（−）低音調性（grave）（＋）／高音調性（acute）（−）変音調性（flat）（＋）／常音調性（plain）（−）嬰音調性（sharp）（＋）／常音調性（plain）（−）緊張性（tense）（＋）／弛緩性（lax）（−）鼻音性（nasal）（＋）／口音性（oral）（−） ――――――――――――――――――――――――――――――――――― 第１の韻律的特徴パラメータ：音素時間長音声基本周波数Ｆ₀ パワー ―――――――――――――――――――――――――――――――――――[Table 3] ――――――――――――――――――――――――――――――――――― Index information: Index number (for one file Start time (start position) of the phoneme in each file in the voice waveform signal database in the memory 30 ――――――――――――――――――――――――― ―――――――――― 1st acoustic characteristic parameter: 12th mel cepstrum coefficient 12th delta mel cepstrum coefficient Phoneme label Discrimination feature: Vowel (vocalic) (+) / non-vocal (non-vocalic) ) (-) Consonantal (+) / non-consonantal (-) interrupted (+) / continuant (-) deterrence (checked) (+) / Non-deterrence (unchecked) (-) rubbing (strident) (+) / maturity (mellow) (-) voiced (+) / unvoiced (unvoi) ced) (-) Concentration (compact) (+) / Diffuse (-) Low tone (grave) (+) / High tone (acute) (-) Transonic (+) / Regular tone (plain) (-) Tonal tone (sharp) (+) / Ordinary tone (plain) (-) Tension (tense) (+) / Relaxation (lax) (-) Nasal (nasal) ) (+) / Oral (or) (-) ――――――――――――――――――――――――――――――――――― 1st Prosodic characteristic parameters of: phoneme duration length fundamental frequency F ₀ power ―――――――――――――――――――――――――――――――――――

【００４４】とって代わって、第１の音響的特徴パラメ
ータは、好ましくは、フォルマントパラメータと、声道
音源パラメータであってもよい。上記索引情報内の開始
時刻（開始位置）、第１の音響的特徴パラメータ及び第
１の韻律的特徴パラメータは、各音素毎に特徴パラメー
タメモリ３０に記憶される。ここで、音素ラベルに付与
される、例えば１２個の弁別素性の特徴パラメータは各
項目別に（＋）又は（−）のパラメータ値が与えられ
る。さらに、音声分析部１０の出力結果である特徴パラ
メータの一例を次の表に示す。ここで、索引番号は、音
声波形信号データベースメモリ２１ａにおいて、例えば
複数の文からなる１つのパラグラフ又は１つの文のファ
イル毎に、索引番号が付与され、そして、１つの索引番
号が付与されたファイル中の任意の音素の位置を示すた
めに当該ファイル内の開始時刻から計時された当該音素
の開始時刻及びその当該音素の音素時間長とを付与する
ことにより、当該音素の音声波形信号の音声セグメント
を特定することができる。Alternatively, the first acoustic feature parameter may preferably be a formant parameter and a vocal tract source parameter. The start time (start position), the first acoustic feature parameter, and the first prosodic feature parameter in the index information are stored in the feature parameter memory 30 for each phoneme. Here, for example, 12 distinctive feature parameters to be added to the phoneme label are given (+) or (−) parameter values for each item. Furthermore, the following table shows an example of the characteristic parameter which is the output result of the voice analysis unit 10. Here, as the index number, in the speech waveform signal database memory 21a, for example, an index number is assigned to each paragraph of a plurality of sentences or each file of one sentence, and a file to which one index number is assigned. A voice segment of a voice waveform signal of the phoneme by adding the start time of the phoneme and the phoneme time length of the phoneme, which is timed from the start time in the file to indicate the position of an arbitrary phoneme in the phoneme. Can be specified.

【００４５】[0045]

【表４】音声分析部１０の出力結果である特徴パラメー
タの一例索引番号Ｘ０００５ ――――――――――――――――――――――― 音素時間長基本周波数パワー ……… ――――――――――――――――――――――― ＃１２０９０４．０ ……… ｓ１７５９８４．７ ……… ｅｉ９５１０２６．５ ……… ｄｈ３０１１４４．９ ……… ｉｈ７５１４３６．９ ……… ｓ１５０１４０５．７ ……… ｐ８７１３７５．１ ……… ｌ３４１０７４．９ ……… ｉｉ１５０９８６．３ ……… ｚ１４０８７５．８ ……… ＃２５３８７４．０ ……… ―――――――――――――――――――――――[Table 4] Example of characteristic parameter that is the output result of the voice analysis unit Index number X0005 ――――――――――――――――――――――― Phoneme duration Long fundamental frequency power… …… ――――――――――――――――――――――― # 120 90 90 4.0 ……… s 175 98 4.7 ……… ei 95 102 6.5… ...... dh 30 114 4.9 ………… ih 75 143 6.9 …… s 150 140 140 5.7 ………… p 87 137 5.1 ………… l 34 107 4.9 …… ii 150 98 6.3 ……… z 140 87 5.8 ……… # 253 87 4.0 ……… ――――――――――――――――――――――――――

【００４６】表４において、＃はポーズを示す。音声単
位を選択する場合に、音響的及び韻律的な各特徴パラメ
ータがそれぞれの音素でどれだけの寄与をするかを予め
調べておくことが必要であり、第４段階では、このため
に音声波形信号データベース中のすべての音声サンプル
を用いて各特徴パラメータの重み係数を決定する。In Table 4, # indicates a pose. When selecting a voice unit, it is necessary to examine in advance how much each acoustic and prosodic feature parameter contributes to each phoneme. All speech samples in the signal database are used to determine the weighting factor for each feature parameter.

【００４７】音声分析部１０における音素記号系列の生
成処理においては、上述した通り、本実施形態では、少
なくとも録音内容が正書法で記述されたものがあれば、
あらゆる音声波形信号データベースが合成用の音声波形
信号データとして利用可能である。入力として単語の情
報だけが与えられた場合には辞書や規則に基づいて音素
系列を予測する必要がある。また、音声分析部１０にお
ける音素のアラインメント処理においては、読み上げ音
声の場合、各単語がそれぞれの標準の発音に近く発音さ
れることが多く、躊躇したり、言い淀んだりすることも
まれである。このような音声波形信号データの場合には
簡単な辞書検索によって音素ラベリングが正しく行なわ
れ、音素アラインメント用の音素ＨＭＭの音素モデルの
学習が可能となる。In the process of generating the phoneme symbol sequence in the voice analysis unit 10, as described above, in the present embodiment, if at least the recorded contents are described in the orthography,
Any voice waveform signal database can be used as voice waveform signal data for synthesis. When only word information is given as input, it is necessary to predict the phoneme sequence based on a dictionary or rules. Further, in the phoneme alignment processing in the voice analysis unit 10, in the case of read-aloud speech, each word is often pronounced close to its standard pronunciation, and it is rare that hesitation or stuttering occurs. In the case of such speech waveform signal data, the phoneme labeling is correctly performed by a simple dictionary search, and the phoneme model of the phoneme HMM for phoneme alignment can be learned.

【００４８】音素アラインメント用の音素モデルの学習
では完全な音声認識の場合と異なり、学習用の音声波形
信号データとテスト用の音声波形信号データとを完全に
分離する必要はなく、すべての音声波形信号データを用
いて学習を行なうことができる。まず、別の話者用のモ
デルを初期モデルとし、すべての単語について標準発音
か限られた発音変化のみを許し、適切なセグメンテーシ
ョンが行なわれるように、全音声波形信号データを用い
てビタビの学習アルゴリズムを用いて音素のアライメン
トを行ない、特徴パラメータの再推定を行なう。単語間
のポーズは単語間ポーズ生成規則によって処理するが、
単語内にポーズがあってアライメントが失敗した場合に
は人手により修正する必要がある。In the learning of the phoneme model for phoneme alignment, unlike the case of complete speech recognition, it is not necessary to completely separate the training speech waveform signal data and the testing speech waveform signal data, and all speech waveforms. Learning can be performed using signal data. First, a model for another speaker is used as an initial model, and only the standard pronunciation or limited pronunciation changes are allowed for all words, and Viterbi learning is performed using all speech waveform signal data so that proper segmentation is performed. The phonemes are aligned using an algorithm and the feature parameters are re-estimated. Interword poses are processed by the interword pose generation rules,
If there is a pose in a word and alignment fails, it must be manually corrected.

【００４９】どういう音素ラベルを音素表記として用い
るかは選択が必要である。もし良く学習されたＨＭＭモ
デルが利用できるような音素セットが存在するなら、そ
れを用いることが有利である。反対に、音声合成装置が
完全な辞書を持っているなら、音声波形信号データベー
スのラベルを完全に辞書と照合する方法も有効である。
我々は、重み係数の学習に対して選択の余地があるか
ら、後で音声合成装置が予測したものと等価なものを音
声波形信号データベースの中から照合できるかどうかを
最も重要な基準とすればよい。発音の微妙な違いはその
発音の韻律的環境によって自動的に把握されるため、特
に手作業で音素のラベル付けを行なう必要はない。It is necessary to select what kind of phoneme label is used as the phoneme notation. If a phoneme set exists for which a well-trained HMM model is available, it is advantageous to use it. On the contrary, if the speech synthesizer has a complete dictionary, a method of matching the label of the speech waveform signal database with the dictionary is also effective.
Since we have a choice for learning the weighting factors, if the most important criterion is whether or not the equivalent of what the speech synthesizer predicts later can be checked from the speech waveform signal database. Good. Subtle differences in pronunciation are automatically recognized by the prosodic environment of the pronunciation, so there is no need to manually label phonemes.

【００５０】前処理の次の段階として、個々の音素の調
音的な特徴を記述するための韻律特徴パラメータの抽出
を行なう。従来の音声学では、調音位置や調音様式とい
った素性で言語音を分類した。これに対して、ファース
（Ｆｉｒｔｈ）学派のような韻律を考慮した音声学で
は、韻律的文脈の違いから生ずる細かな音質の違いをと
らえるために、明瞭に調音されている箇所や強調が置か
れている箇所を区別する。これらの違いを記述する方法
はいろいろなものがあるが、ここでは以下の２つの方法
を用いる。まず低次のレベルでは、１次元の特徴を求め
るために、パワー、音素時間長の伸び及び音声基本周波
数Ｆ₀を、ある音素について平均した値を用いる。一
方、高次のレベルでは、韻律特徴における上記の違いを
考慮した韻律境界や強調箇所をマークする方法を用い
る。これらの２種類の特徴は相互に密接に関係している
ため一方から他方を予測することができるが、両者は共
に各音素の特徴に強い影響を与えている。As the next stage of the preprocessing, prosodic feature parameters for describing articulatory features of individual phonemes are extracted. In conventional phonetics, language sounds are classified according to features such as articulation position and articulation style. On the other hand, in phonetics that considers prosody such as the Firth school, in order to catch a minute difference in sound quality caused by a difference in prosodic context, clearly articulated parts and emphasis are placed. Distinguish where it is. There are various methods for describing these differences, but here, the following two methods are used. First, at a low level, in order to obtain a one-dimensional feature, a value obtained by averaging power, extension of phoneme time length, and speech fundamental frequency F ₀ for a certain phoneme is used. On the other hand, at a higher level, a method of marking a prosodic boundary or an emphasizing part in consideration of the above difference in prosodic features is used. Since these two types of features are closely related to each other, one can predict the other, but both have a strong influence on the features of each phoneme.

【００５１】音声波形信号データベースを記述するため
の音素セットの規定法に自由度があるのと同様に、韻律
的特徴パラメータの記述方法についても自由度がある
が、これらの選び方は音声合成装置の予測能力に依存す
る。もし音声波形信号データベースが予めラベリングさ
れているなら、音声合成装置の仕事は内部表現から音声
波形信号データベース中の実音声をいかに行なうかを学
習することである。これに対して、もし音声波形信号デ
ータベースが音素のラベル付けがなされていないなら、
どのような特徴パラメータを使えば音声合成装置が最も
適切な音声単位を予測できるか否か、から検討すること
が必要となる。この検討及び最適な特徴パラメータの重
みの決定学習は、各特徴パラメータに対する重み係数を
学習しながら決定する重み係数学習部１１において実行
される。Similar to the flexibility of defining the phoneme set for describing the speech waveform signal database, the flexibility of the description method of the prosodic feature parameters is also different. Depends on predictive power. If the speech waveform signal database is pre-labeled, the job of the speech synthesizer is to learn from its internal representation how to perform the actual speech in the speech waveform signal database. On the other hand, if the speech waveform signal database is not labeled with phonemes,
It is necessary to consider from which characteristic parameter the speech synthesizer can predict the most appropriate speech unit. This examination and optimal decision of the weight of the feature parameter are learned in the weight coefficient learning unit 11 which decides while learning the weight coefficient for each feature parameter.

【００５２】次いで、重み係数学習部１１によって実行
される重み係数学習処理について述べる。与えられた目
標音声の音響的及び韻律的な環境に最適なサンプルを音
声波形信号データベースから選択するために、まずどの
特徴がどれだけ寄与しているかを音素的及び韻律的な環
境の違いによって決める必要がある。これは音素の性質
によって重要な特徴パラメータの種類が変化するため
で、例えば、音声基本周波数Ｆ₀は有声音の選択には極
めて有効であるが、無声音の選択にはほとんど影響がな
い。また、摩擦音の音響的特徴は前後の音素の種類によ
って影響が変わる。最適な音素を選択するためにそれぞ
れの特徴にどれだけの重みを置くかを最適重み決定処
理、すなわち重み係数学習処理で自動的に決定する。Next, the weight coefficient learning process executed by the weight coefficient learning unit 11 will be described. In order to select the optimum sample for the acoustic and prosodic environment of a given target speech from the speech waveform signal database, first, which features and how much contribute is determined by the difference between the phonetic and prosodic environments. There is a need. This is because the type of important feature parameter changes depending on the nature of the phoneme. For example, the fundamental sound frequency F ₀ is extremely effective for selection of voiced sound, but has little effect on selection of unvoiced sound. In addition, the acoustic characteristics of the fricative sound change depending on the types of phonemes before and after. The optimum weight determination process, that is, the weight coefficient learning process, automatically determines how much weight should be assigned to each feature in order to select the optimum phoneme.

【００５３】重み係数学習部１１によって実行される最
適重み係数の決定処理で、最初に行なわれることは音声
波形信号データベース中で該当するすべての発話サンプ
ルの中から最適なサンプルを選ぶときに使われる特徴を
リストアップすることである。ここでは、調音位置や調
音様式等の音素的特徴と先行音素、当該音素、及び後続
音素の音声基本周波数Ｆ₀、音素時間長、パワー等の韻
律的特徴パラメータ等を用いる。具体的には、詳細後述
する第２の韻律的特徴パラメータを用いる。次いで、第
２段階では各音素毎に、最適な候補を選ぶ際にどの特徴
パラメータがどれだけ重要かを決定するために、１つの
音声サンプル（又は音素の音声波形信号）に着目し、他
のすべての音素サンプルとの音素時間長の差をも含む音
響的距離を求め、上位Ｎ２個の最良の類似音声サンプ
ル、すなわちＮ２ベストの音素候補の音声波形信号の音
声セグメントを選び出す。In the process of determining the optimum weighting factor executed by the weighting factor learning unit 11, what is first performed is used when selecting the optimum sample from all the corresponding utterance samples in the speech waveform signal database. To list the features. Here, phonetic features such as articulation position and articulation style, and prosodic feature parameters such as the preceding fundamental phoneme, the fundamental phoneme frequency F _{0 of} the preceding phoneme, the phoneme time length, and the power are used. Specifically, the second prosodic feature parameter described in detail later is used. Then, in the second stage, for each phoneme, one voice sample (or a phoneme's voice waveform signal) is focused on in order to determine which feature parameter is important when selecting an optimal candidate. The acoustic distance including the difference in the phoneme time length from all the phoneme samples is obtained, and the top N2 best similar voice samples, that is, the voice segments of the voice waveform signal of the N2 best phoneme candidates are selected.

【００５４】さらに、第３段階では線形回帰分析を行な
い、それらの類似音声サンプルを用いて種々の音響的及
び韻律的環境におけるそれぞれの特徴パラメータの重要
度を示す重み係数を求める。当該線形回帰分析処理にお
ける韻律的特徴パラメータとして、例えば、次の特徴パ
ラメータ（以下、第２の韻律的特徴パラメータとい
う。）を用いる。（１）処理すべき当該音素から１つだけ先行する先行音
素（以下、先行音素という。）の第１の韻律的特徴パラ
メータ；（２）処理すべき当該音素から１つだけ後続する後続音
素（以下、後続音素という。）の音素ラベルの第１の韻
律的特徴パラメータ；（３）当該音素の音素時間長；（４）当該音素の音声基本周波数Ｆ₀；（５）先行音素の音声基本周波数Ｆ₀；及び、（６）後続音素の音声基本周波数Ｆ₀。ここで、先行音素は、当該音素から１つだけ先行する音
素としているが、これに限らず、複数の音素だけ先行す
る音素を含んでもよい。また、後続音素は、当該音素か
ら１つだけ後続する音素としているが、これに限らず、
複数の音素だけ後続する音素を含んでもよい。さらに、
後続音素の音声基本周波数Ｆ₀を除外してもよい。以上
の実施形態においては、線形回帰分析を行って、重み係
数を求めているが、本発明はこれに限らず、例えば、所
定のニューラルネットワークを用いた統計的解析などの
種々の統計的解析を用いて、重み係数を求めてもよい。Furthermore, in the third stage, linear regression analysis is performed, and the similar speech samples are used to obtain weighting factors indicating the importance of the respective feature parameters in various acoustic and prosodic environments. As the prosodic feature parameter in the linear regression analysis process, for example, the following feature parameter (hereinafter referred to as the second prosodic feature parameter) is used. (1) A first prosodic feature parameter of a preceding phoneme (hereinafter referred to as a preceding phoneme) that precedes the phoneme to be processed by one; (2) A succeeding phoneme that follows only one after the phoneme to be processed ( Hereinafter, referred to as a subsequent phoneme.) The first prosodic feature parameter of the phoneme label; (3) Phoneme time length of the phoneme; (4) Speech fundamental frequency F _{0 of the} phoneme; (5) Speech fundamental frequency of the preceding phoneme. F ₀ ; and (6) the fundamental frequency F _{0 of the} subsequent phoneme. Here, the preceding phoneme is a phoneme that precedes only one by one, but the preceding phoneme is not limited to this and may include a phoneme that precedes by a plurality of phonemes. Further, the succeeding phoneme is a phoneme that succeeds only one from the phoneme, but is not limited to this.
It may include a phoneme that follows only a plurality of phonemes. further,
The speech fundamental frequency F ₀ of the subsequent phoneme may be excluded. In the above embodiments, the linear regression analysis is performed to obtain the weighting coefficient, but the present invention is not limited to this, and various statistical analyzes such as statistical analysis using a predetermined neural network may be performed. The weighting factor may be obtained by using this.

【００５５】次いで、自然な音声サンプルの選択を行う
音声単位選択部１２の処理について説明する。従来例の
音声合成装置では目的の発話に対して音素系列を決定
し、さらに韻律制御のためのＦ₀と音素時間長の目標値
が計算された。これに対して、本実施形態では最適の音
声サンプルを選択するために韻律が計算されるだけで、
直接韻律を制御することは行なわれない。Next, the processing of the voice unit selection unit 12 for selecting a natural voice sample will be described. In the conventional speech synthesizer, the phoneme sequence is determined for the target utterance, and the target values of F ₀ and the phoneme time length for prosody control are calculated. On the other hand, in the present embodiment, only the prosody is calculated in order to select the optimum voice sample,
No direct prosody control is performed.

【００５６】図４にはこの音声合成装置における図２の
音声単位選択部１２の処理を示す。この処理の入力は、
目的発話の音素系列と、それぞれの音素毎に求めた各特
徴に対する重みベクトル及び音声波形信号データベース
中の全サンプルを表す特徴ベクトルである。一方、出力
は音声波形信号データベース中での音素サンプルの位置
を表す索引情報であって、音声波形信号の音声セグメン
トを接続するためのそれぞれの音声単位（具体的には音
素、場合により複数の音素の系列が連続して選択され、
一つの音声単位となることがある）の開始位置と音声単
位時間長を示したものである。FIG. 4 shows the processing of the voice unit selection unit 12 of FIG. 2 in this voice synthesizer. The input of this process is
It is a phoneme sequence of the target utterance, a weight vector for each feature obtained for each phoneme, and a feature vector representing all samples in the speech waveform signal database. On the other hand, the output is index information indicating the position of the phoneme sample in the speech waveform signal database, and each speech unit (specifically, a phoneme, and in some cases, a plurality of phonemes) for connecting the speech segments of the speech waveform signal. The series of are continuously selected,
It shows the start position and the voice unit time length.

【００５７】最適な音声単位は目的発話との差の近似コ
ストを表す目標コストと、隣接音声単位間での不連続性
の近似コストを表す連結コストの和を最小化するパスと
して求められる。経路探索には公知のビタビの学習アル
ゴリズムが利用される。目的とする目標音声ｔ₁ ⁿ＝（ｔ
₁，…，ｔ_n）に対しては、目標コストと連結コストの和
を最小化することで、各特徴が目的音声に近く、しかも
音声単位間の不連続性が少ない音声波形信号データベー
ス中の音声単位の組合せｕ₁ ⁿ＝（ｕ₁，…，ｕ_n）を選ぶ
ことができ、これらの音声単位の音声波形信号データベ
ース内での位置を示すことにより、任意の発話内容の音
声合成が可能になる。The optimum speech unit is obtained as a path that minimizes the sum of the target cost, which represents the approximate cost of the difference from the target utterance, and the concatenation cost, which represents the approximate cost of discontinuity between adjacent speech units. A known Viterbi learning algorithm is used for the route search. Target target voice t ₁ ⁿ = (t
₁ , ..., t _n ), the sum of the target cost and the concatenation cost is minimized so that each feature is close to the target voice, and the discontinuity between voice units is small in the voice waveform signal database. A combination of voice units u ₁ ⁿ = (u ₁ , ..., u _n ) can be selected, and voice synthesis of arbitrary utterance content is possible by indicating the position of these voice units in the voice waveform signal database. become.

【００５８】音声単位の選択コストは、図４に示すよう
に、目標コストＣ^t（ｕ_i，ｔ_i）と連結コストＣ^c（ｕ
_i-1，ｕ_i）からなり、目標コストＣ^t（ｕ_i，ｔ_i）は、
音声波形信号データベース中の音声単位（音素候補）ｕ
_iと、合成音声として実現したい音声単位（目標音素）
ｔ_iの間の差の予測値であり、連結コストＣ^c（ｕ_i-1，
ｕ_i）は接続単位（接続する２つの音素）ｕ_i-1とｕ_iと
の間の接続で起こる不連続の予測値である。例えば、本
出願人によって研究実用化された従来のＡＴＲν−Ｔａ
ｌｋ音声合成システムも目標コストと連結コストを最小
化するという点では類似の考え方を取っていたが、韻律
的な特徴パラメータを直接に単位選択に用いるというこ
とは本実施形態の音声合成装置の新しい特徴となってい
る。As shown in FIG. 4, the selection cost per voice unit is the target cost C ^t (u _i , t _i ) and the concatenation cost C ^c (u
_i−1 , u _i ) and the target cost C ^t (u _i , t _i ) is
Speech unit (phoneme candidate) u in the speech waveform signal database
_i and the voice unit (target phoneme) that you want to realize as synthesized voice
is the predicted value of the difference between t _i and the connection cost C ^c (u _i-1 ,
u _i ) is a predictive value of discontinuity occurring in a connection between connection units (two phonemes to be connected) u _i-1 and u _i . For example, the conventional ATR ν-Ta which was researched and put into practical use by the present applicant.
The lk speech synthesis system also has a similar idea in that it minimizes the target cost and the concatenation cost, but the fact that the prosodic feature parameter is directly used for unit selection is new to the speech synthesis apparatus of this embodiment. It is a feature.

【００５９】次いで、コストの計算について述べる。目
標コストは実現したい音声単位の特徴ベクトルと音声波
形信号データベース中から選ばれた候補の音声単位の特
徴ベクトルの各要素の差の重み付き合計であり、各目標
サブコストＣ^t _j（ｔ_i，ｕ_i）の重み係数ｗ^t _jが与えられ
た場合、目標コストＣ^t（ｔ_i，ｕ_i）は次式で計算する
ことができる。Next, the cost calculation will be described. The target cost is a weighted sum of the differences between the feature vector of the speech unit to be realized and the feature vector of the candidate speech unit selected from the speech waveform signal database, and each target sub-cost C ^t _j (t _i , u Given the weighting coefficient w ^t _j of _i ), the target cost C ^t (t _i , u _i ) can be calculated by the following equation.

【００６０】[0060]

【数１０】 [Equation 10]

【００６１】ここで、特徴ベクトルの各要素の差はｐ個
の目標サブコストＣ^t _j（ｔ_i，ｕ_i）（ただし、ｊは１か
らｐまでの自然数である。）で表され、特徴ベクトルの
次元数ｐは、好ましい実施例においては、２０から３０
の範囲で可変としている。より好ましい実施例において
は、次元数ｐ＝３０であり、目標サブコストＣ^t（ｔ_i，
ｕ_i）及び重み係数ｗ^t _jにおける変数ｊの特徴ベクトル
又は特徴パラメータは、上述の第２の韻律的特徴パラメ
ータである。Here, the difference of each element of the feature vector is represented by p target sub-costs C ^t _j (t _i , u _i ) (where j is a natural number from 1 to p), and the feature vector Has a dimension p of 20 to 30 in the preferred embodiment.
The range is variable. In a more preferred embodiment, the dimensionality p = 30 and the target sub-cost C ^t (t _i ,
u _i ) and the feature vector or feature parameter of the variable _j in the weighting coefficient w ^t _j is the above-mentioned second prosodic feature parameter.

【００６２】一方、連結コストＣ^c（ｕ_i-1，ｕ_i）も同
様にｑ個の連結サブコストＣ^c _j（ｕ_i _-1，ｕ_i）（ただ
し、ｊは１からｑまでの自然数である。）の重み付き合
計で表される。連結サブコストは接続する音声単位ｕ
_i-1とｕ_iの音響的特徴から決定することができる。好ま
しい実施形態においては、連結サブコストとしては、
（１）音素接続点におけるケプストラム距離、（２）対
数パワーの差の絶対値、（３）音声基本周波数Ｆ₀の差
の絶対値の３種類を用いており、すなわち、ｑ＝３であ
る。これら３種類の音響的特徴パラメータと、先行音素
の音素ラベルと、後続音素の音素ラベルとを、第３の音
響的特徴パラメータという。各連結サブコストＣ^c _j（ｕ
_i-1，ｕ_i）の重みｗ^c _jは予め経験的に（又は実験的に）
与えられ、この場合、連結コストＣ^c（ｕ_i-1，ｕ_i）は
次式で計算することができる。[0062] On the other hand, connection cost ^{_{C c (u i-1,}} u i) likewise connected to the q subcost ^{_{_{_{C c j (u i -1,}}}} u i) ( although, j is a natural number from 1 to q There is a). Concatenated sub-cost is the unit of connected voice u
It can be determined from the acoustic features of _i-1 and u _i . In the preferred embodiment, the consolidated sub-costs are:
Three types are used: (1) cepstrum distance at the phoneme connection point, (2) absolute value of difference in logarithmic power, and (3) absolute value of difference in fundamental frequency F ₀ of the speech, that is, q = 3. These three types of acoustic feature parameters, the phoneme label of the preceding phoneme, and the phoneme label of the subsequent phoneme are referred to as a third acoustic feature parameter. Each linked sub-cost C ^c _j (u
The weight w ^c _j of _i−1 , u _i ) is previously empirically (or experimentally)
Given, in this case, the connection cost C ^c (u _i-1 , u _i ) can be calculated by the following equation.

【００６３】[0063]

【数１１】 [Equation 11]

【００６４】もし、音素候補ｕ_i-1とｕ_iが音声波形信号
データベース中の連続する音声単位であった場合には、
接続は自然であり、連結コストは０になる。ここで、好
ましい実施例においては、連結コストは、特徴パラメー
タメモリ３０内の第１の音響的特徴パラメータと第１の
韻律的特徴パラメータに基づいて決定され、連続量であ
る上記３つの第３の音響的特徴パラメータを取り扱うか
ら例えば０から１までの任意のアナログ量をとる一方、
目標コストは、それぞれの先行あるいは後続音素の弁別
素性が一致するか否かなどを示す上記３０個の第２の音
響的特徴パラメータを取り扱うから、例えば０（特徴が
一致しているとき）又は１（特徴が一致していないと
き）のデジタル量で表される要素を含む。そして、Ｎ個
の音声単位の連結コストはそれぞれの音声単位の目標コ
ストと連結コストの和となり、次式で表される。If the phoneme candidates u _i-1 and u _i are continuous voice units in the voice waveform signal database,
The connection is natural and the connection cost is zero. Here, in the preferred embodiment, the connection cost is determined on the basis of the first acoustic feature parameter and the first prosodic feature parameter in the feature parameter memory 30, and is the continuous quantity of the third third. Since it handles acoustic feature parameters, it takes an arbitrary analog quantity from 0 to 1,
Since the target cost deals with the above-mentioned 30 second acoustic characteristic parameters indicating whether or not the discrimination features of the preceding and succeeding phonemes match, the target cost is, for example, 0 (when the characteristics match) or 1 Includes elements represented by digital quantities (when features do not match). Then, the concatenation cost of the N voice units is the sum of the target cost and the concatenation cost of each voice unit, and is represented by the following equation.

【００６５】[0065]

【数１２】 [Equation 12]

【００６６】このとき、Ｓはポーズを表しており、Ｃ^c
（Ｓ，ｕ₁）及びＣ^c（ｕ_n，Ｓ）はポーズから最初の音
声単位へ及び最後の音声単位からポーズへの接続におけ
る連結コストを表している。この表現からも明らかなよ
うに、本実施形態ではポーズも音声波形信号データベー
ス中の他の音素とまったく同じ扱い方をしている。さら
に上の式をサブコストで直接表現すると次式のようにな
る。At this time, S represents a pose, and C ^c
(S, u ₁ ) and C ^c (u _n , S) represent the connection cost of connecting from pause to first voice unit and from last voice unit to pause. As is clear from this expression, in the present embodiment, the pose is treated in exactly the same way as other phonemes in the speech waveform signal database. Furthermore, if the above equation is directly expressed by sub-cost, it becomes the following equation.

【００６７】[0067]

【数１３】 [Equation 13]

【００６８】音声単位選択処理は上式で決まる全体のコ
ストを最小にするような音声単位の組合せ／ｕ₁ ⁿを決定
するためのものである。ここで、日本出願の明細書で
は、オーバーラインを記述することができないために、
オーバーラインの代わりに／を用いる。The voice unit selection processing is for determining a voice unit combination / u ₁ ⁿ that minimizes the overall cost determined by the above equation. Here, since the overline cannot be described in the specification of the Japanese application,
Use / instead of overline.

【００６９】[0069]

【数１４】／ｕ₁ ⁿ＝ｍｉｎＣ（ｔ₁ ⁿ，ｕ₁ ⁿ）ｕ₁,ｕ₂,…,ｕ_n / U ₁ ⁿ = min C (t ₁ ⁿ , u ₁ ⁿ ) u ₁ , u ₂ , ..., u _n

【００７０】上記数１４において、関数ｍｉｎは、当該
関数の引数であるＣ（ｔ₁ ⁿ，ｕ₁ ⁿ）を最小にする音素候
補の組み合わせ（すなわち、音素列候補）ｕ₁,ｕ₂,…,
ｕ_n＝／ｕ₁ ⁿを表わす関数である。In the above equation 14, the function min is a combination of phoneme candidates (that is, phoneme string candidates) u ₁ , u ₂ , ... Which minimizes C (t ₁ ⁿ , u ₁ ⁿ ) which is an argument of the function. ,
This is a function representing u _n = / u ₁ ⁿ .

【００７１】図２の重み係数学習部１１における重み係
数の学習処理について以下説明する。目標サブコストの
重みは音響的距離に基づく線形回帰分析を用いて決定す
る。重み係数の学習処理ではすべての音素毎に異なる重
み係数を決めることもできるし、音素カテゴリ（例え
ば、すべての鼻音）毎に重み係数を決めることもでき
る。また、すべての音素について共通の重み係数を決め
ることもできるが、ここでは各音素で別々の重み係数を
用いることとする。特徴パラメータメモリ３０内のデー
タベースにおける各トークン（又は各音声サンプル）
は、各トークンの音響的特徴に関係する第１の音響的特
徴パラメータと第１の韻律的特徴パラメータの組で記述
されている。重み係数は、第１の音響的特徴パラメータ
と第１の韻律的特徴パラメータの各パラメータと、トー
クン又はコンテキストにおける音素の第２の音響的特徴
パラメータにおける差又は音響的距離との間の関係の強
さ（寄与度）を決定するために学習される。以下に線形
回帰分析における処理の流れを示す。The learning process of the weighting factor in the weighting factor learning unit 11 of FIG. 2 will be described below. The target sub-cost weights are determined using a linear regression analysis based on acoustic distance. In the weighting factor learning process, different weighting factors can be determined for all phonemes, or weighting factors can be determined for each phoneme category (for example, all nasal sounds). Although a common weighting factor can be determined for all phonemes, different weighting factors are used for each phoneme here. Each token (or each audio sample) in the database in the characteristic parameter memory 30
Is described by a set of a first acoustic feature parameter and a first prosodic feature parameter related to the acoustic feature of each token. The weighting factor is a strong relationship between each parameter of the first acoustic feature parameter and the first prosodic feature parameter, and the difference or acoustic distance in the second acoustic feature parameter of the phoneme in the token or context. Learned to determine the contribution (contribution). The flow of processing in linear regression analysis is shown below.

【００７２】＜１＞現在学習を行なっている音素種類
（又は音素カテゴリ）に属する音声波形信号データベー
ス中のすべてのサンプルについて繰り返し以下の４つの
処理（ａ）乃至（ｄ）を実行する。（ａ）取り上げた音声サンプルを目的の発話内容と見な
す。（ｂ）音声波形信号データベース中の同一の音素種類
（カテゴリ）に属する他のすべてのサンプルと当該音声
サンプルとの音響的距離を計算する。（ｃ）目標音素に近いもの上位Ｎ１個（例えば、Ｎ１＝
２０個である。）の最良の音素候補を選び出す。（ｄ）目標音素自身ｔ_iと上記（ｃ）で選んだ上位Ｎ１
個のサンプルについて目標サブコストＣ^t _j（ｔ_i，ｕ_i）
を求める。＜２＞すべての目標音素ｔ_iと上位Ｎ１個の最適サンプ
ルについて音響的距離と目標サブコストＣ^t _j（ｔ_i，
ｕ_i）を求める。＜３＞ｐ個の目標サブコストに対して線形回帰分析を実
行することにより、上記目標音素を表わす第１の音響的
特徴パラメータと第１の韻律的特徴パラメータの各特徴
パラメータにおける寄与度を予測して、当該音素種類
（カテゴリ）に対する、ｐ個の目標サブコストの線形重
み係数を求める。この重み係数を用いて上記コストを計
算する。そして、＜１＞から＜３＞までの処理をすべて
の音素種類（カテゴリ）について繰り返す。<1> The following four processes (a) to (d) are repeatedly executed for all the samples in the speech waveform signal database belonging to the phoneme type (or phoneme category) currently being learned. (A) The taken voice sample is regarded as the target utterance content. (B) The acoustic distance between all other samples belonging to the same phoneme type (category) in the speech waveform signal database and the speech sample is calculated. (C) N1 top phonemes close to the target phoneme (for example, N1 =
It is 20 pieces. ) Select the best phoneme candidate. (D) Target phoneme itself t _i and the top N1 selected in (c) above
Target sub-costs C ^t _j (t _i , u _i ) for each sample
Ask for. <2> Acoustic distance and target sub-cost C ^t _j (t _i , t _i , t _i ) for all target phonemes t _i and the top N1 optimum samples
u _i ). <3> By performing a linear regression analysis on p target sub-costs, the contribution of each of the first acoustic feature parameter and the first prosodic feature parameter representing the target phoneme to each feature parameter is predicted. Then, linear weighting coefficients of p target sub-costs for the phoneme type (category) are obtained. The cost is calculated using this weighting factor. Then, the processes from <1> to <3> are repeated for all phoneme types (categories).

【００７３】もし仮に目的音声単位の音響的距離が直接
求められた場合に最も近い音声サンプルを選び出すため
にはそれぞれの目標サブコストにどのような重み係数を
かければ良いのかを決定するのが、この重み係数学習部
１１の目的である。本実施形態の利点は音声波形信号デ
ータベース中の音声波形信号の音声セグメントを直接的
に利用できることである。If the acoustic distance of the target voice unit is directly obtained, it is necessary to determine what kind of weighting coefficient should be applied to each target sub-cost in order to select the closest voice sample. This is the purpose of the weighting coefficient learning unit 11. The advantage of this embodiment is that the voice segment of the voice waveform signal in the voice waveform signal database can be directly used.

【００７４】以上のように構成された図２の音声合成装
置において、音声分析部１０と、重み係数学習部１１
と、音声単位選択部１２と、音声合成部１３とは、例え
ば、マイクロプロセッシングユニット（ＭＰＵ）などの
デジタル計算機又は演算制御装置によって構成される一
方、テキストデータベースメモリ２２ａと、音素ＨＭＭ
メモリ２３と、特徴パラメータメモリ３０と、重み係数
ベクトルメモリ３１とは例えばハードディスクなどの記
憶装置で構成される。ここで、好ましい実施例において
は、音声波形信号データベースメモリ２１ａは、ＣＤ−
ＲＯＭの形式の記憶装置である。以下、以上のように構
成された図２の音声合成装置の各処理部１０乃至１３に
おける処理について説明する。In the speech synthesizer of FIG. 2 configured as described above, the speech analysis unit 10 and the weighting coefficient learning unit 11 are provided.
The voice unit selecting unit 12 and the voice synthesizing unit 13 are constituted by, for example, a digital computer such as a micro processing unit (MPU) or an arithmetic control unit, while the text database memory 22a and the phoneme HMM are used.
The memory 23, the characteristic parameter memory 30, and the weight coefficient vector memory 31 are configured by a storage device such as a hard disk. Here, in the preferred embodiment, the voice waveform signal database memory 21a is a CD-
It is a storage device in the form of a ROM. The processing in each of the processing units 10 to 13 of the speech synthesizer of FIG. 2 configured as above will be described below.

【００７５】図５は、図１の音声分析部１０によって実
行される音声分析処理のフローチャートである。図５に
おいて、まず、ステップＳ１１で、音声波形信号データ
ベースメモリ２１ａから自然発話の音声波形信号の信号
を入力してＡ／Ｄ変換してデジタル音声波形信号データ
に変換するとともに、当該音声波形信号の音声文を書き
下したテキストデータをテキストデータベースメモリ２
２ａ内のテキストデータベースから入力する。ここで、
テキストデータはなくてもよく、ない場合は、音声波形
信号から公知の音声認識装置を用いて音声認識してテキ
ストデータを得てもよい。なお、Ａ／Ｄ変換した後のデ
ジタル音声波形信号データは、例えば１０ミリ秒毎の音
声セグメントに分割されている。そして、ステップＳ１
２で、音素列が予測されているか否かが判断され、音素
列が予測されていないときは、ステップＳ１３で、例え
ば音素ＨＭＭメモリ２３内の音素ＨＭＭを用いて音素列
を予測して記憶した後、ステップＳ１４に進む。ステッ
プＳ１２で音素列が予測されている又は予め与えられて
いる、もしくは手作業で音素ラベルが付与されていると
きは、直接にステップＳ１４に進む。FIG. 5 is a flowchart of the voice analysis process executed by the voice analysis unit 10 of FIG. In FIG. 5, first, in step S11, a signal of a speech waveform signal of spontaneous speech is input from the speech waveform signal database memory 21a, A / D-converted, and converted into digital speech waveform signal data. Text database memory 2 for text data with voice sentences
Input from the text database in 2a. here,
The text data may not be provided. If not, the text data may be obtained by performing voice recognition from the voice waveform signal using a known voice recognition device. The digital voice waveform signal data after A / D conversion is divided into voice segments, for example, every 10 milliseconds. And step S1
In step 2, it is determined whether or not the phoneme string is predicted, and when the phoneme string is not predicted, in step S13, the phoneme string is predicted and stored using the phoneme HMM in the phoneme HMM memory 23, for example. Then, it progresses to step S14. If the phoneme sequence is predicted or given in advance in step S12, or if the phoneme label is manually attached, the process directly proceeds to step S14.

【００７６】ステップＳ１４では、各音素セグメントに
対する、音声波形信号の複数の文又は１つの文からなる
ファイルにおける開始位置と終了位置を記録し、当該フ
ァイルに索引番号を付与する。次いで、ステップＳ１５
では、各音素セグメントに対する上記第１の音響的特徴
パラメータを例えば公知のピッチ抽出法を用いて抽出す
る。そして、ステップＳ１６では、各音素セグメントに
対して音素ラベル付けを実行して、音素ラベルとそれに
対する第１の音響的特徴パラメータを記録する。さら
に、ステップＳ１７では、各音素セグメントに対する第
１の音響的特徴パラメータと、音素ラベルと、音素ラベ
ルに対する上記第１の韻律的特徴パラメータを、ファイ
ルの索引番号と、ファイル内の開始位置と時間長ととも
に、特徴パラメータメモリ３０に記憶する。最後に、ス
テップＳ１８で、各音素セグメントに対して、ファイル
の索引番号とファイル内の開始位置と時間長とを含む索
引情報を付与して、当該索引情報を特徴パラメータメモ
リ３０に記憶して、当該音声分析処理を終了する。In step S14, a start position and an end position in a file consisting of a plurality of sentences or one sentence of the speech waveform signal for each phoneme segment are recorded, and an index number is given to the file. Then, step S15
Then, the first acoustic characteristic parameter for each phoneme segment is extracted using, for example, a known pitch extraction method. Then, in step S16, phoneme labeling is performed on each phoneme segment to record the phoneme label and the first acoustic characteristic parameter for the phoneme label. Further, in step S17, the first acoustic feature parameter for each phoneme segment, the phoneme label, the first prosodic feature parameter for the phoneme label, the file index number, the start position in the file, and the time length. At the same time, it is stored in the characteristic parameter memory 30. Finally, in step S18, index information including a file index number, a start position in the file, and a time length is added to each phoneme segment, and the index information is stored in the feature parameter memory 30. The voice analysis process ends.

【００７７】図６及び図７は、図２の重み係数学習部１
１によって実行される重み係数学習処理のフローチャー
トである。図６において、まず、ステップＳ２１で、特
徴パラメータメモリ３０から１個の音素種類を選択す
る。次いで、ステップＳ２２で、選択された音素種類と
同一の音素種類を有する音素の第１の音響的特徴パラメ
ータから第２の音響的特徴パラメータを取り出して目標
音素の第２の音響的特徴パラメータとする。そして、ス
テップＳ２３で、同一の音素種類を有する目標音素以外
の残りの音素と、第２の音響的特徴パラメータにおける
目標音素との間の、音響的距離であるユークリッドケプ
ストラム距離と、底を２とする対数音素時間長とを計算
する。ステップＳ２４では、すべての残りの音素につい
てステップＳ２２及びＳ２３の処理をしたか否かが判断
され、処理が完了していないときは、ステップＳ２５で
別の残りの音素を選択してステップＳ２３からの処理を
繰り返す。6 and 7 show the weighting coefficient learning unit 1 of FIG.
6 is a flowchart of a weighting factor learning process executed by 1. In FIG. 6, first, in step S21, one phoneme type is selected from the characteristic parameter memory 30. Next, in step S22, the second acoustic feature parameter is extracted from the first acoustic feature parameter of the phoneme having the same phoneme type as the selected phoneme type and used as the second acoustic feature parameter of the target phoneme. . Then, in step S23, the Euclidean cepstrum distance, which is the acoustic distance, between the remaining phonemes other than the target phonemes having the same phoneme type and the target phonemes in the second acoustic feature parameter, and the base is 2. And the logarithmic phoneme time length to be calculated. In step S24, it is determined whether or not the processes in steps S22 and S23 have been performed on all the remaining phonemes. If the process is not completed, another remaining phoneme is selected in step S25 and the process from step S23 is started. Repeat the process.

【００７８】一方、ステップＳ２４で処理が完了してい
るときは、ステップＳ２６で、ステップＳ２３で得られ
た距離及び時間長に基づいて、上位Ｎ１個の最良の音素
候補を選択する。次いで、ステップＳ２７で選択された
上位Ｎ１個の最良の音素候補について１番目からＮ１番
目までランク付けする。そして、ステップＳ２８で、ラ
ンク付けされたＮ１個の最良の音素候補に対して各距離
から中間値を引いてスケール変換値を計算する。そし
て、ステップＳ２９において、すべての音素種類及び音
素についてステップＳ２２からＳ２８までの処理を完了
したか否かが判断され、完了していないときは、ステッ
プＳ３０で別の音素種類又は音素を選択した後、ステッ
プＳ２２からの処理を繰り返す。一方、ステップＳ２９
で処理が完了しているときは、図７のステップＳ３１に
進む。On the other hand, when the processing is completed in step S24, the top N1 best phoneme candidates are selected in step S26 based on the distance and the time length obtained in step S23. Next, the top N1 best phoneme candidates selected in step S27 are ranked from 1st to N1th. Then, in step S28, a scale conversion value is calculated by subtracting an intermediate value from each distance for the ranked N1 best phoneme candidates. Then, in step S29, it is determined whether or not the processes of steps S22 to S28 have been completed for all phoneme types and phonemes, and if not completed, another phoneme type or phoneme is selected in step S30. , And the processing from step S22 is repeated. On the other hand, step S29
When the processing is completed in step S31, the process proceeds to step S31 in FIG.

【００７９】図７において、ステップＳ３１では、１個
の音素種類を選択する。次いで、ステップＳ３２では、
選択された音素種類に対して各音素の第２の音響的特徴
パラメータを抽出する。そして、ステップＳ３３で、選
択された音素種類に対するスケール変換値に基づいて線
形回帰分析を行うことにより、各第２の音響的特徴パラ
メータにおけるスケール変換値に対する寄与度を計算
し、計算された寄与度を目標音素毎の重み係数として重
み係数ベクトルメモリ３１に記憶する。ステップＳ３４
では、すべての音素種類について上記ステップＳ３２及
びＳ３３の処理を完了したか否かが判断され、完了して
いないときは、ステップＳ３５で別の音素種類を選択し
た後、ステップＳ３２からの処理を繰り返す。一方、ス
テップＳ３４で処理が完了しているときは、当該重み係
数学習処理を終了する。なお、各第２の韻律的特徴パラ
メータにおける寄与度は経験的に（又は実験的に）予め
与えられて、当該寄与度を目標音素毎の重み係数ベクト
ルとして重み係数ベクトルメモリ３１に記憶する。In FIG. 7, in step S31, one phoneme type is selected. Then, in step S32,
The second acoustic feature parameter of each phoneme is extracted for the selected phoneme type. Then, in step S33, the linear regression analysis is performed based on the scale conversion value for the selected phoneme type to calculate the contribution to the scale conversion value in each of the second acoustic feature parameters, and the calculated contribution Is stored in the weight coefficient vector memory 31 as a weight coefficient for each target phoneme. Step S34
Then, it is determined whether or not the processes of steps S32 and S33 have been completed for all the phoneme types. If not completed, another phoneme type is selected in step S35, and then the processes from step S32 are repeated. . On the other hand, when the processing is completed in step S34, the weighting factor learning processing is ended. The contribution degree of each second prosodic feature parameter is given empirically (or experimentally) in advance, and the contribution degree is stored in the weight coefficient vector memory 31 as a weight coefficient vector for each target phoneme.

【００８０】図８は、図２の音声単位選択部１２によっ
て実行される音声単位選択処理のフローチャートであ
る。図８において、まず、ステップＳ４１で、入力され
た音素列のうち最初から１個目の音素を選択する。次い
で、ステップＳ４２で、選択された音素と同一の音素種
類を有する音素の重み係数ベクトルを重み係数ベクトル
メモリ３１から読み出し、目標サブコスト及び必要な特
徴パラメータを特徴パラメータメモリ３０から読み出し
てリストアップする。そして、ステップＳ４３ですべて
の音素について処理したか否かが判断され、完了してい
ないときはステップＳ４４で次の音素を選択した後、ス
テップＳ４２の処理を繰り返す。一方、ステップＳ４３
で完了していないときは、ステップＳ４５に進む。FIG. 8 is a flowchart of the voice unit selection process executed by the voice unit selection unit 12 of FIG. In FIG. 8, first, in step S41, the first phoneme from the beginning is selected from the input phoneme sequence. Next, in step S42, the weighting coefficient vector of the phoneme having the same phoneme type as the selected phoneme is read from the weighting coefficient vector memory 31, and the target sub-cost and necessary characteristic parameters are read from the characteristic parameter memory 30 and listed. Then, in step S43, it is determined whether or not all phonemes have been processed. If not completed, the next phoneme is selected in step S44, and then the process of step S42 is repeated. On the other hand, step S43
If not completed in step S45, the process proceeds to step S45.

【００８１】ステップＳ４５では、入力された音素列に
対して数４を用いて各音素候補における全体のコストを
計算する。次いで、ステップＳ４６では、計算されたコ
ストに基づいて、上位Ｎ２個の最良の音素候補をそれぞ
れの目標音素に対して選択する。そして、ステップＳ４
７では、数５を用いてビタビサーチにより、全体のコス
トを最小にする音素候補の組み合わせの索引情報と、そ
の各音素の開始時刻と時間長とともに検索した後、音声
合成部１３に出力して、当該音声単位選択処理を終了す
る。In step S45, the total cost of each phoneme candidate is calculated using the equation 4 for the input phoneme sequence. Next, in step S46, the top N2 best phoneme candidates are selected for each target phoneme based on the calculated costs. And step S4
7, the index information of the combination of phoneme candidates that minimizes the overall cost and the start time and time length of each phoneme are searched by the Viterbi search using Equation 5, and then output to the speech synthesis unit 13. , And ends the voice unit selection process.

【００８２】さらに、音声合成部１３は、音声単位選択
部１２から出力される索引情報と、その各音素の開始時
刻と時間長とに基づいて、音声波形信号データベースメ
モリ２１ａに対してアクセスして単位選択された音素候
補のデジタル音声波形信号データを読み出して、逐次Ｄ
／Ａ変換して変換後のアナログ音声信号をスピーカ１４
を介して出力する。これにより、入力された音素列に対
応する音声合成された音声がスピーカ１４から出力され
る。Furthermore, the voice synthesis unit 13 accesses the voice waveform signal database memory 21a based on the index information output from the voice unit selection unit 12 and the start time and time length of each phoneme. The digital voice waveform signal data of the phoneme candidate selected as a unit is read out and sequentially read
A / A conversion and the converted analog audio signal to the speaker 14
Output via. As a result, the synthesized voice corresponding to the input phoneme string is output from the speaker 14.

【００８３】以上説明したように、本実施形態の音声合
成装置においては、出力音声の自然性を最大にするため
に、大規模な自然音声のデータベースを用いて処理を最
小に抑える方法について述べた。本実施形態は４つの処
理部１０乃至１３から構成される。＜音声分析部１０＞正書法の書き起こしテキストを伴っ
た任意の音声波形信号データを入力とし、この音声波形
信号データベース中のすべての音素について、それらの
性質を記述する特徴ベクトルを与える処理部。＜重み係数学習部１１＞音声波形信号データベースの特
徴ベクトルと音声波形信号データベースの原波形を用い
て、目的の音声を合成する場合に最も適するように音声
単位を選ぶための、各特徴パラメータの最適重み係数を
重みベクトルとして決定する処理部。＜音声単位選択部１２＞音声波形信号データベースの全
音素の特徴ベクトルと重みベクトルと目的音声の発話内
容の記述から音声波形信号データベースメモリ２１ａの
索引情報を作成する処理部。＜音声合成部１３＞作成された索引情報に従って、音声
波形データベースメモリ２１ａ内の音声波形信号データ
ベース中の音声波形信号データの音声セグメントに飛び
飛びにアクセスし、目的の音声波形信号の音声セグメン
トを連結しかつＤ／Ａ変換してスピーカ１４に出力して
音声を合成する処理部。As described above, in the speech synthesizer of this embodiment, a method of minimizing the processing by using a large-scale natural speech database has been described in order to maximize the naturalness of the output speech. . This embodiment is composed of four processing units 10 to 13. <Speech analysis unit 10> A processing unit which receives, as an input, arbitrary speech waveform signal data accompanied by a transcribed text of the orthography and gives a feature vector describing the properties of all the phonemes in this speech waveform signal database. <Weighting coefficient learning unit 11> Optimizing each feature parameter for selecting a voice unit that is most suitable for synthesizing a target voice using a feature vector of the voice waveform signal database and an original waveform of the voice waveform signal database. A processing unit that determines a weighting factor as a weighting vector. <Speech unit selection unit 12> A processing unit that creates index information of the speech waveform signal database memory 21a from the feature vectors and weight vectors of all phonemes in the speech waveform signal database and the description of the utterance content of the target speech. <Voice synthesizer 13> According to the created index information, the voice segments of the voice waveform signal data in the voice waveform signal database in the voice waveform database memory 21a are randomly accessed to connect the voice segments of the target voice waveform signal. A processing unit for D / A converting and outputting to the speaker 14 to synthesize voice.

【００８４】本実施形態においては、音声波形信号の圧
縮や音声基本周波数Ｆ₀や音素時間長の修正は不要にな
ったが、代わって音声サンプルを注意深くラベル付け
し、大規模な音声波形信号データベースの中から最適な
ものを選択することが必要となる。本実施形態の音声合
成方法の基本単位は音素であり、これは辞書やテキスト
−音素変換プログラムで生成されるが、同一の音素であ
っても音声波形信号データベース中に音素の十分なバリ
エーションを含んでいることが要求される。音声波形信
号データベースからの音声単位選択処理では目的の韻律
的環境に適合し、しかも接続したときに隣接音声単位間
での不連続性が最も低い音素サンプルの組合せが選ばれ
る。このために、音素毎に各特徴パラメータの最適重み
係数が決定される。In the present embodiment, the compression of the speech waveform signal and the modification of the speech fundamental frequency F ₀ and the phoneme time length are not necessary, but instead the speech samples are carefully labeled and a large speech waveform signal database is created. It is necessary to select the most suitable one from among. The basic unit of the speech synthesis method of this embodiment is a phoneme, and this is generated by a dictionary or a text-phoneme conversion program.However, even if the same phoneme is included in the speech waveform signal database, sufficient variations of phonemes are included. It is required to be out. In the process of selecting a voice unit from the voice waveform signal database, a combination of phoneme samples that matches the target prosodic environment and has the lowest discontinuity between adjacent voice units when connected is selected. Therefore, the optimum weighting factor of each feature parameter is determined for each phoneme.

【００８５】本実施形態の音声合成装置の特徴は、次の
通りである。＜単位選択規準としての韻律的情報の利用＞スペクトル
的特徴は韻律的特徴と不可分であるとの立場から、音声
単位の選択規準に韻律的な特徴を導入した。＜音響的及び韻律的特徴パラメータの重み係数の自動学
習＞音素環境や音響的特徴、韻律的特徴等の各種の特徴
量が音声単位の選択にどれだけの寄与があるかを音声波
形信号データベース中の全音声サンプルを利用すること
で自動的に決定し、テキストデータベース及び音声波形
データベースを基本とする音声合成装置を構築した。＜音声波形信号の直接接続＞上記の自動学習により、大
規模音声波形信号データベースから最適な音声サンプル
を選び出すことにより、何らの信号処理も利用しない任
意音声合成装置を構築した。＜音声波形信号データベースの外部情報化＞音声波形信
号データベースを完全に外部情報として取り扱うことに
より、単にＣＤ−ＲＯＭ等に記憶した音声波形信号デー
タを取り替えることで任意の言語、任意の話者に利用で
きる音声合成装置を構築した。The features of the speech synthesizer of this embodiment are as follows. <Use of prosodic information as unit selection criterion> From the standpoint that spectral features are inseparable from prosodic features, we introduced prosodic features into the selection criterion for speech units. <Automatic learning of weighting factors for acoustic and prosodic feature parameters> In the speech waveform signal database, how much various feature quantities such as phoneme environment, acoustic features, and prosodic features contribute to the selection of speech units We constructed a speech synthesizer based on a text database and a speech waveform database, which was automatically determined by using all the speech samples. <Direct Connection of Speech Waveform Signal> By the above automatic learning, an optimal speech sample is selected from the large-scale speech waveform signal database, thereby constructing an arbitrary speech synthesizer that does not use any signal processing. <External information of voice waveform signal database> By treating the voice waveform signal database completely as external information, the voice waveform signal data stored in the CD-ROM or the like can be simply replaced to be used for any language and any speaker. We built a speech synthesizer that can.

【００８６】[0086]

【実施例】本発明者は、本実施形態の図１の音声データ
量削減装置を用いて、音声波形データベースの音声デー
タ量を削減し、かつ音声データ量が削減された音声波形
データベースに基づいて、図２の音声合成装置を用いて
音声合成した実験結果について以下に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor uses the voice data amount reduction apparatus of FIG. 1 of the present embodiment to reduce the voice data amount of a voice waveform database, and based on the voice waveform database in which the voice data amount is reduced. The experimental results of speech synthesis using the speech synthesizer of FIG. 2 will be described below.

【００８７】まず、本発明者らによる音素バランス調査
について説明する。本特許出願人が所有するテキストデ
ータベース及び音声波形データベース（以下、ＣＨＡＴ
Ｒという。）で使用される日本語話者のラベルデータに
基づいて音素バランス調査を行った。この調査では、音
素ラベルデータから抽出した音素列から、２音素ずつ組
み合わせたバイフォンデータを作成した。例えば、ラベ
ルデータの発話内容が「あらゆる」であった場合、抽出
した音素列は、First, the phoneme balance survey by the present inventors will be described. A text database and a voice waveform database owned by the present applicant (hereinafter, CHAT)
Called R. The phoneme balance survey was conducted based on the label data of Japanese speakers used in). In this investigation, biphone data in which two phonemes are combined is created from a phoneme string extracted from phoneme label data. For example, when the utterance content of the label data is "everything", the extracted phoneme sequence is

【数１５】＃ａｒａｙｕｒｕ＃となる。ここで、＃は発話開始記号又は発話終了記号を
示す。この音素列から２音素ずつ組み合わせたバイフォ
ンリストは、[Equation 15] #a r a u u r u #. Here, # indicates an utterance start symbol or an utterance end symbol. The biphone list that combines two phonemes from this phoneme sequence is

【数１６】＃＿ａａ＿ｒｒ＿ａａ＿ｙｙ＿ｕ
ｕ＿ｒｒ＿ｕｕ＿＃の８種類となる。## EQU16 ## #_a a_r r_a a_y y_u
There are eight types, u_r r_u u_ #.

【００８８】次いで、バイフォンリストの作成例につい
て説明する。この作成例の結果を次の表に示す。Next, an example of creating a biphone list will be described. The results of this preparation example are shown in the following table.

【００８９】[0089]

【表５】バイフォンリストの作成例 ――――――――――――――――――――――――――――――――――― 発話内容：あらゆる現実を全て自分の方へねじ曲げたのだ。 ――――――――――――――――――――――――――――――――――― 音素列： # a r a y u r u g e N j i ts u o # s u b e t e # j i b u N n o h o o e # n e j i m a g e t a n o d a # ――――――――――――――――――――――――――――――――――― バイフォンリスト： #_a a_r r_a a_y y_u u_r r_u u_g g_e e_N N_j j_i i_ts ts_u u_o o_# #_s s_u u_b b_e e_t t_e e_# #_j j_i i_b b_u u_N N_n n_o o_h h_o o_o o_e e_# #_n n_e e_j j_i i_m m_a a_g g_e e_t t_a a_n n_o o_d d_a a_# ―――――――――――――――――――――――――――――――――――[Table 5] Biphone list creation example ――――――――――――――――――――――――――――――――――― Utterance content: All the reality was twisted toward myself. ――――――――――――――――――――――――――――――――――― Phoneme sequence: # a r a y u r u g e N j i ts u o # s u b e t e # j i b u N n o h o o e # n e j i m a g e t a n o d a # ――――――――――――――――――――――――――――――――――― Biphone list: #_a a_r r_a a_y y_u u_r r_u u_g g_e e_N N_j j_i i_ts ts_u u_o o_ # #_s s_u u_b b_e e_t t_e e_ # #_j j_i i_b b_u u_N N_n n_o o_h h_o o_o o_e e_ # #_n n_e e_j j_i i_m m_a a_g g_e e_t t_a a_n n_o o_d d_a a_ # ―――――――――――――――――――――――――――――――――――

【００９０】上記の場合、バイフォンリストは、５０件
４４種類となり、６件が重複したバイフォンであること
がわかる。ここで、件数別バイフォンリストを次の表に
示す。In the above case, it is understood that the biphone list has 50 cases and 44 types, and 6 cases are duplicate biphones. The following table shows the biphone list by number of cases.

【００９１】[0091]

【表６】件数別バイフォンリスト ――――――――――――――――――――――――――――――――――― １件：#_a a_r r_a a_y y_u u_r r_u u_g e_N N_j i_ts ts_u u_o o_# #_s s_u u_b b_e t_e #_j i_b b_u u_N N_n o_h h_o o_o o_e #_n n_e e_j i_m m_a a_g t_a a_n o_d d_a a_# ――――――――――――――――――――――――――――――――――― ２件：g_e, e_t, e_#, n_o ――――――――――――――――――――――――――――――――――― ３件：j_i ―――――――――――――――――――――――――――――――――――[Table 6] Biphone list by number of cases ――――――――――――――――――――――――――――――――――― 1 case: #_a a_r r_a a_y y_u u_r r_u u_g e_N N_j i_ts ts_u u_o o_ # #_s s_u u_b b_e t_e #_j i_b b_u u_N N_n o_h h_o o_o o_e #_n n_e e_j i_m m_a a_g t_a a_n o_d d_a a_ # ――――――――――――――――――――――――――――――――――― 2 cases: g_e, e_t, e_ #, n_o ――――――――――――――――――――――――――――――――――― 3 cases: j_i ―――――――――――――――――――――――――――――――――――

【００９２】次いで、ＣＨＡＴＲ内のある日本語話者Ｍ
ＹＡに対してのバイフォンリストを作成し、以下のよう
に出現頻度を集計することにより音素バランスを調べ
た。ここで、話者ＭＹＡの音声波形データベースの情報
を次の表に示す。Next, a Japanese speaker M in CHATR
The phoneme balance was investigated by creating a biphone list for YA and collecting the appearance frequencies as follows. Here, the information of the voice waveform database of the speaker MYA is shown in the following table.

【００９３】[0093]

【表７】話者ＭＹＡの音声波形データベースの情報 ―――――――――――――――――― 発話内容ファイル数 ―――――――――――――――――― ５０３文章５０３ファイル２２文章２２ファイル旅行会話６７５ファイル ―――――――――――――――――― 合計１２００ファイル ――――――――――――――――――[Table 7] Information of the speech waveform database of speaker MYA ―――――――――――――――――― Utterance content Number of files ―――――――――――――――――― 503 sentences 503 files 22 sentences 22 files Travel Conversation 675 files ―――――――――――――――――― Total 1200 files ――――――――――――――――――

【００９４】日本語話者ＭＹＡに対する音素バランス調
査の結果を以下に示す。The results of the phoneme balance survey for the Japanese speaker MYA are shown below.

【００９５】[0095]

【表８】バイフォンリスト集計結果 ―――――――――――――――――――――――――――― バイフォンの件数：６０，４０５件（ラベリングミス除く）バイフォンの種類：４１５通り（ラベリングミス除く） ――――――――――――――――――――――――――――[Table 8] Biphone list count result ―――――――――――――――――――――――――――― Number of biphones: 60,405 (excluding labeling mistakes) Biphone type: 415 types (excluding labeling mistakes) ――――――――――――――――――――――――――――

【００９６】[0096]

【表９】出現頻度の多いバイフォンリスト１０種類 ――――――――――――― バイフォン件数 ――――――――――――― ｏ＿ｏ１４３６ａ＿＃１３２６ｄ＿ｅ９９２ｍ＿ａ９８８ａ＿ｉ９７０ｎ＿ｏ９５５ｋ＿ａ８８６ｔ＿ｏ７８６ｇ＿ａ７７２ｉ＿ｍ７４４ ―――――――――――――[Table 9] 10 types of biphone list with high appearance frequency ――――――――――――― Biphone count ――――――――――――― o_o 1436 a_ # 1326 d_e 992 m_a 988 a_i 970 n_o 955 k_a 886 t_o 786 g_a 772 i_m 744 ―――――――――――――

【００９７】[0097]

【表１０】出現頻度１件のバイフォン：４２通り ――――――――――――――――――――――――――――――――――― I_g I_n U_b U_r d_y f_o t_y I_kk a_ff a_gg ch_e dd_a e_dd e_gg e_hh e_ss ff_u gg_a gg_u hh_a hh_i hh_o i_hh i_jj jj_i o_hh o_zz pp_U ss_o ss_u tt_U u_dd zz_u U_tts cch_I cch_e cch_u e_cch o_tts ssh_I ssh_U tts_U ―――――――――――――――――――――――――――――――――――[Table 10] Biphone with 1 occurrence frequency: 42 ways ――――――――――――――――――――――――――――――――――― I_g I_n U_b U_r d_y f_o t_y I_kk a_ff a_gg ch_e dd_a e_dd e_gg e_hh e_ss ff_u gg_a gg_u hh_a hh_i hh_o i_hh i_jj jj_i o_hh o_zz pp_U ss_o ss_u tt_U u_dd zz_u U_tts cch_I cch_e cch_u e_cch o_tts ssh_I ssh_U tts_U ―――――――――――――――――――――――――――――――――――

【００９８】日本語話者ＭＹＡに対する音素バランス調
査におけるバイフォンリストの詳細結果を以下に示す。The detailed results of the biphone list in the phoneme balance survey for the Japanese speaker MYA are shown below.

【００９９】[0099]

【表１１】出現頻度５００件以上のバイフォンリスト（バイフォン／件数） ――――――――――――――――――――――――――――――――――― o_o 1436 a_# 1326 d_e 992 m_a 988 a_i 970 n_o 955 k_a 886 t_o 786 g_a 772 i_m 744 a_r 738 t_a 716 w_a 696 n_i 678 a_s 651 o_# 604 o_k 603 n_a 594 s_U 586 r_u 582 k_u 581 o_n 573 y_o 563 a_k 551 r_i 544 i_# 536 o_r 517 k_o 507 sh_i 506 r_a 506 u_u 504 e_# 503 a_N 503 ―――――――――――――――――――――――――――――――――――[Table 11] Biphone list with appearance frequency of 500 or more (Biphone / Number) ――――――――――――――――――――――――――――――――――― o_o 1436 a_ # 1326 d_e 992 m_a 988 a_i 970 n_o 955 k_a 886 t_o 786 g_a 772 i_m 744 a_r 738 t_a 716 w_a 696 n_i 678 a_s 651 o_ # 604 o_k 603 n_a 594 s_U 586 r_u 582 k_u 581 o_n 573 y_o 563 a_k 551 r_i 544 i_ # 536 o_r 517 k_o 507 sh_i 506 r_a 506 u_u 504 e_ # 503 a_N 503 ―――――――――――――――――――――――――――――――――――

【０１００】[0100]

【表１２】出現頻度１００件以上５００件未満のバイフォンリスト（バイフォン／件数） ――――――――――――――――――――――――――――――――――― #_k 490 t_e 482 m_o 465 u_n 452 r_e 444 i_n 440 sh_I 424 e_N 415 s_a 412 o_m 405 a_n 397 s_u 394 I_t 394 i_r 365 a_sh 364 #_h 363 ts_u 358 h_a 356 #_s 356 g_o 355 k_i 349 o_d 348 e_s 348 o_t 340 e_r 340 i_N 332 y_u 326 d_a 324 U_# 317 u_# 313 d_o 312 u_r 311 y_a 310 o_sh 309 ch_i 299 s_o 299 s_e 288 i_t 286 a_t 281 a_m 280 U_k 279 j_i 276 h_o 274 o_g 271 N_d 267 k_e 262 n_e 261 m_i 259 o_N 254 e_k 254 e_e 248 i_k 243 sh_o 242 r_o 239 o_y 238 N_n 238 #_n 237 i_g 234 #_o 233 u_g 232 u_k 230 i_d 227 u_d 222 e_t 218 a_d 218 i_i 217 o_h 216 o_s 214 a_g 210 o_i 205 e_n 205 m_e 200 #_# 199 I_k 198 #_m 192 tt_e 188 z_a 188 u_m 188 k_y 188 #_g 187 a_tt 181 e_d 180 e_g 178 #_i 173 b_a 170 u_N 169 e_sh 167 e_i 164 a_a 162 #_d 160 #_j 159 #_a 159 #_t 157 z_u 156 e_m 155 N_k 153 o_ch 152 j_u 152 #_y 152 i_sh 149 o_z 145 b_u 140 i_o 139 #_sh 137 i_h 137 u_i 132 N_g 131 o_w 129 i_s 129 tt_a 127 k_U 126 e_w 125 u_o 124 i_w 124 e_o 123 a_ts 118 j_o 115 a_w 114 g_e 112 o_b 111 o_j 110 a_o 110 u_t 109 U_t 109 k_I 108 a_e 106 N_t 106 u_s 102 ―――――――――――――――――――――――――――――――――――[Table 12] Biphone list with 100 or more and less than 500 occurrences (Biphone / Number) ――――――――――――――――――――――――――――――――――― #_k 490 t_e 482 m_o 465 u_n 452 r_e 444 i_n 440 sh_I 424 e_N 415 s_a 412 o_m 405 a_n 397 s_u 394 I_t 394 i_r 365 a_sh 364 #_h 363 ts_u 358 h_a 356 #_s 356 g_o 355 k_i 349 o_d 348 e_s 348 o_t 340 e_r 340 i_N 332 y_u 326 d_a 324 U_ # 317 u_ # 313 d_o 312 u_r 311 y_a 310 o_sh 309 ch_i 299 s_o 299 s_e 288 i_t 286 a_t 281 a_m 280 U_k 279 j_i 276 h_o 274 o_g 271 N_d 267 k_e 262 n_e 261 m_i 259 o_N 254 e_k 254 e_e 248 i_k 243 sh_o 242 r_o 239 o_y 238 N_n 238 #_n 237 i_g 234 #_o 233 u_g 232 u_k 230 i_d 227 u_d 222 e_t 218 a_d 218 i_i 217 o_h 216 o_s 214 a_g 210 o_i 205 e_n 205 m_e 200 # _ # 199 I_k 198 #_m 192 tt_e 188 z_a 188 u_m 188 k_y 188 #_g 187 a_tt 181 e_d 180 e_g 178 #_i 173 b_a 170 u_N 169 e_sh 167 e_i 164 a_a 162 #_d 160 #_j 159 #_a 159 #_t 157 z_u 156 e_m 155 N_k 153 o_ch 152 j_u 152 #_y 152 i_sh 149 o_z 145 b_u 140 i_o 139 #_sh 137 i_h 137 u_i 132 N_g 131 o_w 129 i_s 129 tt_a 127 k_U 126 e_w 125 u_o 124 i_w 124 e_o 123 a_ts 118 j_o 115 a_w 114 g_e 112 o_b 111 o_j 110 a_o 110 u_t 109 U_t 109 k_I 108 a_e 106 N_t 106 u_s 102 ―――――――――――――――――――――――――――――――――――

【０１０１】[0101]

【表１３】出現頻度５０件以上１００件未満のバイフォンリスト（バイフォン／件数） ――――――――――――――――――――――――――――――――――― u_y 99 m_u 99 i_y 99 u_sh 98 h_i 97 h_I 97 a_ch 95 i_ch 93 ts_U 92 ch_I 92 o_a 92 N_j 92 f_u 91 u_b 90 g_u 90 g_i 90 U_s 90 #_ch 88 i_b 83 sh_a 82 f_U 81 b_e 81 a_h 81 o_tt 80 N_w 80 i_ts 78 u_j 77 N_sh 76 a_y 76 b_i 75 #_w 74 a_z 73 a_b 73 o_e 71 i_a 71 b_o 71 sh_u 70 i_tt 69 e_b 68 N_b 68 N_# 68 tt_o 67 i_z 67 #_b 67 a_j 66 i_j 65 e_a 65 #_r 65 #_f 65 u_w 63 u_h 63 o_ts 62 u_z 62 h_y 61 N_m 61 r_y 60 e_ts 57 ch_o 57 h_e 56 N_s 55 e_y 51 ―――――――――――――――――――――――――――――――――――[Table 13] Biphone list with appearance frequency of 50 or more and less than 100 (Biphone / Number) ――――――――――――――――――――――――――――――――――― u_y 99 m_u 99 i_y 99 u_sh 98 h_i 97 h_I 97 a_ch 95 i_ch 93 ts_U 92 ch_I 92 o_a 92 N_j 92 f_u 91 u_b 90 g_u 90 g_i 90 U_s 90 #_ch 88 i_b 83 sh_a 82 f_U 81 b_e 81 a_h 81 o_tt 80 N_w 80 i_ts 78 u_j 77 N_sh 76 a_y 76 b_i 75 #_w 74 a_z 73 a_b 73 o_e 71 i_a 71 b_o 71 sh_u 70 i_tt 69 e_b 68 N_b 68 N_ # 68 tt_o 67 i_z 67 #_b 67 a_j 66 i_j 65 e_a 65 #_r 65 #_f 65 u_w 63 u_h 63 o_ts 62 u_z 62 h_y 61 N_m 61 r_y 60 e_ts 57 ch_o 57 h_e 56 N_s 55 e_y 51 ―――――――――――――――――――――――――――――――――――

【０１０２】[0102]

【表１４】出現頻度１０件以上５０件未満のバイフォンリスト（バイフォン／件数） ――――――――――――――――――――――――――――――――――― z_e 47 u_e 47 o_f 47 N_r 47 N_o 47 I_ts 46 ch_u 45 i_e 45 i_pp 42 z_o 42 N_h 42 #_z 41 n_y 40 #_u 40 a_u 39 #_ts 38 o_u 34 n_u 34 j_a 34 g_y 34 pp_a 33 u_ts 32 p_a 32 I_s 32 #_e 32 kk_a 30 e_h 29 N_y 29 u_ch 28 e_tt 26 U_sh 26 i_u 26 N_p 26 i_kk 25 b_y 25 e_z 23 e_j 23 a_f 23 a_kk 22 U_n 21 u_tt 20 i_f 20 #_p 20 kk_u 19 o_kk 18 e_kk 18 u_f 18 i_ssh 17 pp_o 17 N_ch 17 I_ch 17 o_p 17 m_y 17 u_a 16 p_u 16 N_z 16 N_i 16 kk_o 15 ch_a 15 p_e 15 j_e 15 N_e 15 I_# 15 o_pp 14 t_i 14 U_ts 13 p_o 13 ssh_i 12 pp_u 12 kk_i 12 w_o 12 p_i 12 kk_y 11 a_pp 11 U_d 11 ―――――――――――――――――――――――――――――――――――[Table 14] Biphone list with appearance frequency of 10 or more and less than 50 (Biphone / Number) ――――――――――――――――――――――――――――――――――― z_e 47 u_e 47 o_f 47 N_r 47 N_o 47 I_ts 46 ch_u 45 i_e 45 i_pp 42 z_o 42 N_h 42 #_z 41 n_y 40 #_u 40 a_u 39 #_ts 38 o_u 34 n_u 34 j_a 34 g_y 34 pp_a 33 u_ts 32 p_a 32 I_s 32 #_e 32 kk_a 30 e_h 29 N_y 29 u_ch 28 e_tt 26 U_sh 26 i_u 26 N_p 26 i_kk 25 b_y 25 e_z 23 e_j 23 a_f 23 a_kk 22 U_n 21 u_tt 20 i_f 20 #_p 20 kk_u 19 o_kk 18 e_kk 18 u_f 18 i_ssh 17 pp_o 17 N_ch 17 I_ch 17 o_p 17 m_y 17 u_a 16 p_u 16 N_z 16 N_i 16 kk_o 15 ch_a 15 p_e 15 j_e 15 N_e 15 I_ # 15 o_pp 14 t_i 14 U_ts 13 p_o 13 ssh_i 12 pp_u 12 kk_i 12 w_o 12 p_i 12 kk_y 11 a_pp 11 U_d 11 ―――――――――――――――――――――――――――――――――――

【０１０３】[0103]

【表１５】出現頻度１件以上１０件未満のバイフォンリスト（バイフォン／件数） ――――――――――――――――――――――――――――――――――― u_kk 9 ss_e 9 i_p 9 e_u 9 N_f 9 ssh_a 8 a_ssh 8 u_pp 8 pp_i 8 u_p 8 r_U 8 f_i 8 pp_y 7 e_ch 7 U_tt 7 p_y 7 e_f 7 e_ssh 6 cch_i 6 kk_e 6 i_ss 6 N_ts 6 e_p 6 d_i 6 a_p 6 U_z 6 U_p 6 U_g 6 tts_u 5 ssh_o 5 o_cch 5 I_tt 5 I_sh 5 f_e 5 u_ssh 4 i_tts 4 i_cch 4 cch_a 4 u_ss 4 tt_i 4 p_U 4 U_h 4 N_u 4 N_a 4 a_cch 3 ss_a 3 pp_e 3 kk_U 3 dd_o 3 a_ss 3 U_kk 3 I_pp 3 j_I 3 f_a 3 I_d 3 u_cch 2 o_ssh 2 cch_o 2 sh_e 2 i_dd 2 hh_e 2 ff_e 2 e_pp 2 e_ff 2 a_hh 2 w_e 2 p_I 2 d_u 2 U_j 2 I_p 2 I_m 2 I_h 2 I_b 2 tts_U 1 ssh_U 1 ssh_I 1 o_tts 1 e_cch 1 cch_u 1 cch_e 1 cch_I 1 U_tts 1 zz_u 1 u_dd 1 tt_U 1 ss_u 1 ss_o 1 pp_U 1 o_zz 1 o_hh 1 jj_i 1 i_jj 1 i_hh 1 hh_o 1 hh_i 1 hh_a 1 gg_u 1 gg_a 1 ff_u 1 e_ss 1 e_hh 1 e_gg 1 e_dd 1 dd_a 1 ch_e 1 a_gg 1 a_ff 1 I_kk 1 t_y 1 f_o 1 d_y 1 U_r 1 U_b 1 I_n 1 I_g 1 ―――――――――――――――――――――――――――――――――――[Table 15] Biphone list with 1 to 10 occurrences (Biphone / Number) ――――――――――――――――――――――――――――――――――― u_kk 9 ss_e 9 i_p 9 e_u 9 N_f 9 ssh_a 8 a_ssh 8 u_pp 8 pp_i 8 u_p 8 r_U 8 f_i 8 pp_y 7 e_ch 7 U_tt 7 p_y 7 e_f 7 e_ssh 6 cch_i 6 kk_e 6 i_ss 6 N_ts 6 e_p 6 d_i 6 a_p 6 U_z 6 U_p 6 U_g 6 tts_u 5 ssh_o 5 o_cch 5 I_tt 5 I_sh 5 f_e 5 u_ssh 4 i_tts 4 i_cch 4 cch_a 4 u_ss 4 tt_i 4 p_U 4 U_h 4 N_u 4 N_a 4 a_cch 3 ss_a 3 pp_e 3 kk_U 3 dd_o 3 a_ss 3 U_kk 3 I_pp 3 j_I 3 f_a 3 I_d 3 u_cch 2 o_ssh 2 cch_o 2 sh_e 2 i_dd 2 hh_e 2 ff_e 2 e_pp 2 e_ff 2 a_hh 2 w_e 2 p_I 2 d_u 2 U_j 2 I_p 2 I_m 2 I_h 2 I_b 2 tts_U 1 ssh_U 1 ssh_I 1 o_tts 1 e_cch 1 cch_u 1 cch_e 1 cch_I 1 U_tts 1 zz_u 1 u_dd 1 tt_U 1 ss_u 1 ss_o 1 pp_U 1 o_zz 1 o_hh 1 jj_i 1 i_jj 1 i_hh 1 hh_o 1 hh_i 1 hh_a 1 gg_u 1 gg_a 1 ff_u 1 e_ss 1 e_hh 1 e_gg 1 e_dd 1 dd_a 1 ch_e 1 a_gg 1 a_ff 1 I_kk 1 t_y 1 f_o 1 d_y 1 U_r 1 U_b 1 I_n 1 I_g 1 ―――――――――――――――――――――――――――――――――――

【０１０４】以上示したように、ある日本語話者に対し
て、バイフォンの分布にかなりバラツキがあることがわ
かる。これに基づいて、本発明者らは、音声合成後の品
質を低下させずに、図３の処理を用いて音声データ量の
削減を行った。As shown above, it can be seen that the distribution of biphones varies considerably for a Japanese speaker. Based on this, the present inventors reduced the amount of voice data using the process of FIG. 3 without degrading the quality after voice synthesis.

【０１０５】音声波形データベースの縮小の実験におい
ては、音素的にバランスのとれた文章セット（文数ｎ＝
５２５）及び旅行用対話テキストのセット（文数ｎ＝６
７５）を読み上げる男性話者ＭＹＡ（職業アナウンサー
ではない）による音声波形データを使用した。そして、
本発明者らは、上述の冗長性尺度、すなわち類似度の判
断基準に従って音声波形データベースを元のサイズの３
分の２及び半分にまで、すなわち１２００発話から各々
８０６及び６２５発話にまで音声データ量を削減した。
ここで、それぞれのデータベースを第１減量ＭＹＡと第
２減量ＭＹＡと呼ぶ。音声データ量を削減した後に、得
られた音声波形データベースのスコアを、韻律的特徴パ
ラメータ及び音響特徴パラメータにかかる距離に関して
測定した。表１６は、３つの音声波形データベースの各
々における音声ファイルの計数値を示している。表１７
は韻律的特徴パラメータのスコアを示し、表１８は音響
特徴パラメータのスコアを示している。In the experiment of reducing the voice waveform database, a phoneme-balanced sentence set (sentence number n =
525) and a set of travel dialogue texts (sentence number n = 6)
75) The voice waveform data by a male speaker MYA (not a professional announcer) is used. And
According to the above-mentioned redundancy measure, that is, the criterion of similarity, the present inventors set the speech waveform database to 3 times the original size.
The amount of voice data was reduced to two and a half, that is, from 1200 utterances to 806 and 625 utterances, respectively.
Here, the respective databases are referred to as a first weight loss MYA and a second weight loss MYA. After reducing the amount of voice data, the scores of the obtained voice waveform database were measured with respect to the distance over the prosodic feature parameter and the acoustic feature parameter. Table 16 shows the count value of the audio files in each of the three audio waveform databases. Table 17
Shows the score of the prosodic feature parameter, and Table 18 shows the score of the acoustic feature parameter.

【０１０６】[0106]

【表１６】音声波形データベースの縮小 ――――――――――――――――――――――――――――――――――― データベース音声セク゛メント数音声波形ＤＢ内の文の数テストされた文の数 ――――――――――――――――――――――――――――――――――― ＭＹＡ１２００５２５／６７５１１９８第１減量ＭＹＡ８０６４７４／３３２８０４第２減量ＭＹＡ６２５３３８／２３７６２３ ――――――――――――――――――――――――――――――――――― （注）ＤＢはデータベースを示す。以下、同様である。[Table 16] Reduction of voice waveform database ――――――――――――――――――――――――――――――――――― Number of database speech segments Number of sentences in speech waveform DB Number of sentences tested ――――――――――――――――――――――――――――――――――― MYA 1200 525/675 1198 First weight loss MYA 806 474/332 804 Second weight loss MYA 625 338/237 623 ――――――――――――――――――――――――――――――――――― (Note) DB indicates a database. The same applies hereinafter.

【０１０７】[0107]

【表１７】韻律的特徴パラメータに基づくスコア ――――――――――――――――――――――――――――――――――― データベース全体のスコア音声波形ＤＢ内の文の数スコアの平均値 ――――――――――――――――――――――――――――――――――― ＭＹＡ９５６０１１９８＊７．９８０５１８第１減量ＭＹＡ６９８３８０４８．６８５６２７第２減量ＭＹＡ５６０５６２３８．９９７５５７ ――――――――――――――――――――――――――――――――――― （注）＊は最低のスコアを示す。[Table 17] Score based on prosodic feature parameters ――――――――――――――――――――――――――――――――――― Score of entire database Number of sentences in speech waveform DB Average score ――――――――――――――――――――――――――――――――――― MYA 9560 1198 * 7.980518 1st weight loss MYA 6983 804 8.685727 2nd weight loss MYA 5605 623 8.997557 ――――――――――――――――――――――――――――――――――― (Note) * indicates the lowest score.

【０１０８】[0108]

【表１８】バイスペクトルに基づくスコア ――――――――――――――――――――――――――――――――――― データベース全体のスコア音声波形ＤＢ内の文の数スコアの平均値 ――――――――――――――――――――――――――――――――――― ＭＹＡ１５０２０１１９８１２．５３７７７０第１減量ＭＹＡ９７５７８０４＊１２．１３６１１９第２減量ＭＹＡ７５７１６２３１２．１５２６８７ ――――――――――――――――――――――――――――――――――― （注）＊は最低のスコアを示す。[Table 18] Bispectral score ――――――――――――――――――――――――――――――――――― Score of entire database Number of sentences in speech waveform DB Average score ――――――――――――――――――――――――――――――――――― MYA 15020 1198 12.5537770 1st weight loss MYA 9757 804 * 12.136119 Second weight loss MYA 7571 623 12.152687 ――――――――――――――――――――――――――――――――――― (Note) * indicates the lowest score.

【０１０９】表１７及び表１８から明らかなように、音
声波形データベースの音声データ量を削減した結果、韻
律的特徴パラメータの変動性ではスコアで約０．７ポイ
ント低下しているものの、音響特徴パラメータにおける
平滑性の面ではスコアで０．４ポイント向上しているこ
とが分かる。As is apparent from Tables 17 and 18, as a result of reducing the amount of voice data in the voice waveform database, the variability of the prosodic feature parameters is reduced by about 0.7 points in score, but the acoustic feature parameters are reduced. It can be seen that in terms of smoothness, the score is improved by 0.4 points.

【０１１０】以上説明したように、本実施形態によれ
ば、バイフォンリストに基づいて各１対のバイフォンに
対する韻律的特徴パラメータと音響的特徴パラメータと
に関する所定の類似度を計算し、上記計算された類似度
が所定の第１のしきい値以上であり、かつ上記バイフォ
ンのリスト中の同一のバイフォンの数が所定の第２のし
きい値以上であるときに、当該１対のバイフォンのうち
の一方のバイフォンの音素に係る音声波形信号の音声セ
グメントのデータを上記音声波形データベースから削除
することにより音声データ量を削減する。従って、音声
合成時の音声品質を実質的に低下させることなく、音声
波形データベースを格納するメモリ容量を削減すること
ができ、音声合成時の探索速度を高めることができる。As described above, according to the present embodiment, the predetermined similarity regarding the prosodic feature parameter and the acoustic feature parameter for each pair of biphones is calculated based on the biphone list, and is calculated as described above. When the similarity is greater than or equal to a predetermined first threshold and the number of identical biphones in the list of biphones is greater than or equal to a predetermined second threshold, the pair of biphones is The amount of voice data is reduced by deleting the data of the voice segment of the voice waveform signal related to the phoneme of one of the biphones from the voice waveform database. Therefore, the memory capacity for storing the voice waveform database can be reduced without substantially lowering the voice quality during voice synthesis, and the search speed during voice synthesis can be increased.

【０１１１】[0111]

【発明の効果】以上詳述したように本発明に係る音声合
成装置のための音声データ量削減装置によれば、１対の
音素のリストに基づいて各１対の音素に対する韻律的特
徴パラメータと音響的特徴パラメータとに関する所定の
類似度を計算し、上記計算された類似度が所定の第１の
しきい値以上であるときに、当該各１対の音素のうちの
一方の１対の音素に係る音声波形信号の音声セグメント
のデータを上記音声波形データベースから削除すること
により音声データ量を削減する。従って、音声合成時の
音声品質を実質的に低下させることなく、音声波形デー
タベースを格納するメモリ容量を削減することができ、
音声合成時の探索速度を高めることができる。As described above in detail, according to the voice data amount reducing apparatus for the voice synthesizing apparatus according to the present invention, the prosodic characteristic parameters for each pair of phonemes are determined based on the list of the pair of phonemes. A predetermined similarity with respect to the acoustic feature parameter is calculated, and when the calculated similarity is equal to or more than a predetermined first threshold value, one pair of phonemes of each pair of phonemes is calculated. By deleting the data of the voice segment of the voice waveform signal according to the above from the voice waveform database, the voice data amount is reduced. Therefore, the memory capacity for storing the voice waveform database can be reduced without substantially lowering the voice quality during voice synthesis.
The search speed at the time of voice synthesis can be increased.

【０１１２】また、上記計算された類似度が所定の第１
のしきい値以上でありかつ上記バイフォンのリスト中の
同一のバイフォンの数が所定の第２のしきい値以上であ
るときに、当該各１対の音素のうちの一方の１対の音素
に係る音声波形信号の音声セグメントのデータを上記音
声波形データベースから削除することにより音声データ
量を削減する。従って、音声波形データベースを格納す
るメモリ容量を削減することができ、音声合成時の探索
速度を高めることができるとともに、音声波形データベ
ースにおいて所定数の同一のバイフォンに対する音声波
形データを確保して、音声合成後の音声の品質を所定以
上に確保することができる。In addition, the calculated similarity is the first predetermined value.
Is equal to or greater than the threshold of and the number of identical biphones in the list of biphones is equal to or greater than a predetermined second threshold, one pair of phonemes of each pair of phonemes is selected. The amount of audio data is reduced by deleting the data of the audio segment of the audio waveform signal from the audio waveform database. Therefore, the memory capacity for storing the voice waveform database can be reduced, the search speed at the time of voice synthesis can be increased, and the voice waveform data for a predetermined number of the same biphones can be secured in the voice waveform database. It is possible to secure the quality of the synthesized voice to a predetermined level or higher.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明に係る一実施形態である音声データ量
削減処理装置のブロック図である。FIG. 1 is a block diagram of an audio data amount reduction processing device according to an embodiment of the present invention.

【図２】本発明に係る一実施形態である自然発話音声
波形信号接続型音声合成装置のブロック図である。FIG. 2 is a block diagram of a spontaneous-speech speech waveform signal connection-type speech synthesizer according to an embodiment of the present invention.

【図３】図１の音声データ量削減処理部によって実行
される音声データ量削減処理を示すフローチャートであ
る。FIG. 3 is a flowchart showing audio data amount reduction processing executed by an audio data amount reduction processing unit in FIG.

【図４】図２の音声単位選択部によって計算される音
声単位選択コストの定義を示すモデル図である。FIG. 4 is a model diagram showing a definition of a voice unit selection cost calculated by a voice unit selection unit of FIG.

【図５】図２の音声分析部によって実行される音声分
析処理のフローチャートである。5 is a flowchart of a voice analysis process executed by the voice analysis unit of FIG.

【図６】図２の重み係数学習部によって実行される重
み係数学習処理の第１の部分のフローチャートである。FIG. 6 is a flowchart of a first part of a weighting factor learning process executed by the weighting factor learning unit in FIG.

【図７】図２の重み係数学習部によって実行される重
み係数学習処理の第２の部分のフローチャートである。FIG. 7 is a flowchart of a second part of the weighting coefficient learning process executed by the weighting coefficient learning unit in FIG.

【図８】図２の音声単位選択部によって実行される音
声単位選択処理のフローチャートである。8 is a flowchart of a voice unit selection process executed by a voice unit selection unit of FIG.

【符号の説明】[Explanation of symbols]

１０…音声分析部、１１…重み係数学習部、１２…音声単位選択部、１３…音声合成部、１４…スピーカ、２１，２１ａ…音声波形信号データベースメモリ、２２，２２ａ…テキストデータベースメモリ、２３…音素ＨＭＭメモリ、３０…特徴パラメータメモリ、３１…重み係数ベクトル、４０…基本韻律データ生成部、４１…評価韻律データ生成部、４２…バイフォンリスト生成部、４３…バイスペクトラムデータ生成部、４４…評価データ行列生成部、４５…音声データ削減処理部、５０…基本韻律データメモリ、５１…評価韻律データメモリ、５２…バイフォンリストメモリ、５３…バイスペクトラムデータメモリ、５４…評価データ行列メモリ。 10 ... voice analysis section, 11 ... Weighting coefficient learning unit, 12 ... voice unit selection section, 13 ... voice synthesis unit, 14 ... speaker, 21, 21a ... Voice waveform signal database memory, 22, 22a ... Text database memory, 23 ... Phoneme HMM memory, 30 ... Feature parameter memory, 31 ... Weighting coefficient vector, 40 ... Basic prosody data generation unit, 41 ... Evaluation prosody data generation unit, 42 ... Biphone list generation unit, 43 ... Bispectrum data generation unit, 44 ... Evaluation data matrix generation unit, 45 ... Voice data reduction processing unit, 50 ... Basic prosody data memory, 51 ... Evaluation prosody data memory, 52 ... Biphone list memory, 53 ... Bispectrum data memory, 54 ... Evaluation data matrix memory.

フロントページの続き (56)参考文献特開平11−95796（ＪＰ，Ａ) 特開平10−39889（ＪＰ，Ａ) 特開平８−248975（ＪＰ，Ａ) 特開昭56−51800（ＪＰ，Ａ) 特開平７−84590（ＪＰ，Ａ) 特開平10−49193（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 Continuation of front page (56) References JP-A-11-95796 (JP, A) JP-A-10-39889 (JP, A) JP-A-8-248975 (JP, A) JP-A-56-51800 (JP , A) JP-A-7-84590 (JP, A) JP-A-10-49193 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 13/06

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】音素ラベルに対応した音声波形信号の音
声セグメントのデータからなる音声波形データベースを
記憶する記憶装置を備え、上記自然発話の音声波形信号
の音声セグメントを連結することにより任意の音素列を
音声合成する音声合成装置のための音声データ量削減装
置であって、上記音声波形データベースに含まれる１対の音素のリス
トを生成する生成手段と、上記生成された１対の音素のリストに基づいて各１対の
音素に対する韻律的特徴パラメータと音響的特徴パラメ
ータとに関する所定の類似度を計算し、上記計算された
類似度が所定の第１のしきい値以上であるとき、当該各
１対の音素のうちの一方の１対の音素に係る音声波形信
号の音声セグメントのデータを上記音声波形データベー
スから削除することにより音声データ量を削減するとと
もに、上記計算された類似度が所定の第１のしきい値以
上であり、かつ上記１対の音素のリスト中の同一の１対
の音素の数が所定の第２のしきい値以上であるときに、
当該各１対の音素のうちの一方の１対の音素に係る音声
波形信号の音声セグメントのデータを上記音声波形デー
タベースから削除する削減手段とを備えたことを特徴と
する音声合成装置のための音声データ量削減装置。1. A storage device for storing a voice waveform database consisting of voice segment data of a voice waveform signal corresponding to a phoneme label, wherein an arbitrary phoneme sequence is provided by connecting voice segments of the voice waveform signal of spontaneous speech. A voice data amount reducing device for a voice synthesizing device for synthesizing a voice, comprising: generating means for generating a list of a pair of phonemes included in the voice waveform database; and a list of the generated pair of phonemes. Based on this, a predetermined similarity regarding the prosodic feature parameter and the acoustic feature parameter for each pair of phonemes is calculated, and when the calculated similarity is equal to or greater than a predetermined first threshold value, the respective 1 By deleting the data of the voice segment of the voice waveform signal of one pair of phonemes of the pair of phonemes from the voice waveform database, When you reduce the data amount bet
Also, the calculated similarity is less than or equal to a predetermined first threshold value.
The same pair as above and in the list of phonemes above
When the number of phonemes of is greater than or equal to a predetermined second threshold,
Speech relating to one pair of phonemes of each pair of phonemes
The audio segment data of the waveform signal is
A voice data amount reducing device for a voice synthesizing device, comprising: a reducing means for deleting from a database .

【請求項２】上記類似度は、それぞれ所定の重み係数
で重み付けされた、上記韻律的特徴パラメータに関する
類似度のスコアと、上記音響特徴パラメータに関する類
似度のスコアとの線形結合の式を用いて計算されること
を特徴とする請求項１記載の音声合成装置のための音声
データ量削減装置。2. The similarity is calculated using a linear combination formula of a similarity score relating to the prosodic feature parameter and a similarity score relating to the acoustic feature parameter, each weighted by a predetermined weighting coefficient. audio data amount reduction device for speech synthesis apparatus according to claim 1, wherein the computed.

【請求項３】上記韻律的特徴パラメータは、音素時間
長と、音声基本周波数Ｆ₀と、経過時間に対する音声基
本周波数Ｆ₀の傾きと、パワーとを含むことを特徴とす
る請求項１又は２記載の音声合成装置のための音声デー
タ量削減装置。Wherein said prosodic feature parameters, the phoneme duration, the voice fundamental frequency F _0, the slope of the voice fundamental frequency F ₀ with respect to the elapsed time, according to claim 1 or 2, characterized in that it comprises a power A voice data amount reduction device for the voice synthesis device described.

【請求項４】上記音響的特徴パラメータは、スペクト
ラム情報を含むことを特徴とする請求項１乃至３のうち
の１つに記載の音声合成装置のための音声データ量削減
装置。Wherein said acoustic feature parameters, the audio data amount reduction device for speech synthesis device according to one of claims 1 to 3, characterized in that it comprises a spectrum information.

【請求項５】請求項１乃至４のうちの１つに記載の音
声合成装置のための音声データ量削減装置によって音声
データ量が削減された音声波形データベースに基づい
て、入力された自然発話文の音素列に対して、音素候補
を上記音声波形データベースから検索して連結すること
により音声合成を行なう音声合成手段を備えたことを特
徴とする音声合成装置。5. A spontaneous utterance sentence input based on a voice waveform database in which the voice data amount is reduced by the voice data amount reducing device for a voice synthesizing device according to any one of claims 1 to 4. A voice synthesizing device comprising a voice synthesizing means for synthesizing a voice by retrieving and connecting a phoneme candidate from the voice waveform database to the above phoneme sequence.

【請求項６】請求項１乃至４のうちの１つに記載の音
声合成装置のための音声データ量削減装置によって音声
データ量が削減された音声波形データベースに基づい
て、入力された自然発話文の音素列に対して、目標音素
と音素候補との間の近似コストと、時間的に隣接して連
結されるべき音素候補間の近似コストとを含むコストが
最小となるように、音素候補を上記音声波形データベー
スから検索して連結することにより音声合成を行なう音
声合成手段を備えたことを特徴とする音声合成装置。6. Based on the speech waveform database audio data amount has been reduced by the voice data amount reduction device for speech synthesis device according to one of claims 1 to 4, the inputted spontaneous speech sentences The phoneme candidate is selected so that the cost including the approximate cost between the target phoneme and the phoneme candidate and the approximate cost between the phoneme candidates that should be temporally adjacent to each other is minimized for the phoneme sequence of. A voice synthesizing device comprising voice synthesizing means for synthesizing a voice by searching the voice waveform database and connecting the voice waveform databases.