JP2004117662A

JP2004117662A - Voice synthesizing system

Info

Publication number: JP2004117662A
Application number: JP2002279089A
Authority: JP
Inventors: Toshimitsu Minowa; 蓑輪　利光; Hirofumi Nishimura; 西村　洋文; Akira Mochizuki; 望月　亮; Toshiyuki Isono; 礒野　敏幸
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-09-25
Filing date: 2002-09-25
Publication date: 2004-04-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizing system capable of synthesizing a voice of high quality. <P>SOLUTION: The voice synthesizing system is provided with an attribute vector giving means 111 for giving an attribute vector to each element, an element classifying means 112 for classifying elements into a plurality of clusters on the basis of scalar quantities which respective elements have, an attribute vector classifying means 113 for classifying attribute vectors by clusters, a cluster division means 114 for dividing clusters correspondingly to classifications of attribute vectors, a representative element determination means 115 for determining a representative element of each cluster, a representative attribute vector determination means 116 for determining a representative attribute vector of each cluster, a target attribute vector generation means 123 for generating a target attribute vector. a representative attribute vector selection means 124 for selecting a representative attribute vector most approximating the target attribute vector from representative attribute vectors, and a synthesizing means 126 for using representative elements corresponding to the selected representative attribute vector to synthesize a voice. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を合成するための音声合成システム、音声合成辞書構築装置、音声合成装置、音声合成方法、音声合成辞書構築プログラム、および音声合成プログラムに関する。
【０００２】
【従来の技術】
従来の音声合成システムは、一般に、韻律素片や音声波形素片などの複数の素片が韻律条件に基づいて複数のクラスタに分類され、クラスタ毎に各クラスタを代表する代表素片が決定され、決定された複数の代表素片が韻律条件とともに音声合成辞書に予め登録されるようになっていた。
【０００３】
また、従来の音声合成システムは、音声合成する際、音声合成したいテキストの韻律条件と一致する韻律条件の代表素片を音声合成辞書内において検索し、韻律条件が一致する代表素片が音声合成辞書内に存在する場合には、韻律条件が一致する代表素片を選択し、韻律条件が一致する代表素片が音声合成辞書内に存在しない場合には、テキストの韻律条件に最も近い韻律条件の代表素片を選択し、音声を合成するようになっていた（例えば、特許文献１を参照）。
【０００４】
【特許文献１】
特開２０００−２５０５７０号公報（第１−６頁、図１）
【０００５】
【発明が解決しようとする課題】
しかしながら、このような従来の音声合成システムでは、音声合成辞書にほぼ同じ代表素片が複数登録されてしまうという問題があった。また、代表素片が代表するクラスタの分散が大きくなって、選択された代表素片間のばらつきが大きくなり、聴覚上ばらついた印象の音声が合成されてしまうという問題があった。
【０００６】
本発明はこのような問題を解決するためになされたもので、冗長な代表素片を無くすとともに、選択される代表素片間のばらつきを小さくして高品質の音声を合成することができる音声合成システム、音声合成辞書構築装置、音声合成装置、音声合成方法、音声合成辞書構築プログラム、および音声合成プログラムを提供するものである。
【０００７】
【課題を解決するための手段】
本発明の音声合成システムは、複数の韻律素片および複数の音声波形素片を含む母集合における各素片に関する複数の属性を有した属性ベクトルを前記各素片に付与する属性ベクトル付与手段と、前記各素片が有するスカラ量に基づいて前記素片を複数のクラスタに分類する素片分類手段と、前記クラスタ毎に前記属性ベクトルを分類する属性ベクトル分類手段と、前記属性ベクトルの分類に対応して前記クラスタを分割するクラスタ分割手段と、前記クラスタ毎に代表素片を決定する代表素片決定手段と、前記クラスタ毎に代表属性ベクトルを決定する代表属性ベクトル決定手段と、前記代表素片と前記代表属性ベクトルとを用いて音声を合成する際、目標となる複数の属性を有した目標属性ベクトルを生成する目標属性ベクトル生成手段と、前記代表属性ベクトルの中から前記目標属性ベクトルに最も近いものを選択する代表属性ベクトル選択手段と、選択された前記代表属性ベクトルに対応する前記代表素片を用いて音声を合成する合成手段とを備える構成を有している。
【０００８】
この構成により、冗長な代表素片を無くして記憶容量を節約することができるとともに、代表素片が代表するクラスタの分散を小さくして代表素片間のばらつきを小さくすることにより高品質の音声を合成することができる。
【０００９】
本発明の音声合成システムは、前記クラスタにおけるクラスタ歪の大小を判定するクラスタ歪判定手段を備え、前記素片の分類と前記属性ベクトルの分類とを繰り返し、前記クラスタ歪判定手段は、前記クラスタ歪が予め決められた閾値以下になったとき、分類を停止させるようにした構成を有している。
【００１０】
この構成により、詳細なクラスタが得られることになり、代表素片が代表するクラスタの分散が小さくなって、高品質の音声を合成することができる。
【００１１】
本発明の音声合成システムは、前記クラスタ歪判定手段が、前記クラスタ歪の変化量が予め決められた閾値以下になったとき、分類を停止させるようにした構成を有している。
【００１２】
この構成により、冗長なクラスタリングを行なわないで済むとともに、最適な代表素片が音声合成辞書に登録され、高品質の音声合成をすることができる。
【００１３】
本発明の音声合成システムは、前記素片分類手段が、前記音声波形素片、アクセント句毎の基本周波数パターン、アクセント句毎の基本周波数の最大値、音素毎の継続時間長、および音素毎のパワーパターンの少なくとも一つを前記素片として分類するようにした構成を有している。
【００１４】
この構成により、音声合成に用いられる各種の素片について適当な代表素片が音声合成辞書に登録されて、音声合成の際に適当な代表素片が選択されることになり、高品質の音声を合成することができる。
【００１５】
本発明の音声合成システムは、前記素片分類手段が、聴覚的な検知限に基づいて前記素片を分類するようにした構成を有している。
【００１６】
この構成により、聴覚的に無駄のない代表素片が音声合成辞書に登録されて、音声合成の際に適当な代表素片が選択されることになり、高品質の音声を合成することができる。
【００１７】
本発明の音声合成システムは、前記素片分類手段が、前記素片に関して表示し、前記素片を分類するための指示入力を受け付けるようにした構成を有している。
【００１８】
この構成により、視覚的に確認あるいは指示された適当な代表素片が音声合成辞書に登録されて、音声合成の際に適当な代表素片が選択されることになり、高品質の音声を合成することができる。
【００１９】
本発明の音声合成システムは、前記属性ベクトル分類手段が、ＬＢＧアルゴリズムおよびＫ平均アルゴリズムの何れかを用いて前記属性ベクトルを分類するようにした構成を有している。
【００２０】
この構成により、ＬＢＧアルゴリズムの場合には、クラスタ歪が予め設定した閾値を下回るクラスタを得ることができるので高品質の音声を合成することができ、Ｋ平均アルゴリズムの場合には、クラスタ数を予め設定できるので高品質の音声を合成するのに際して必要なハードウエア資源を設計し易くすることができる。
【００２１】
本発明の音声合成システムは、前記代表属性ベクトル選択手段が、ユークリッド距離およびマハラノビス距離の何れかを用いて前記目標属性ベクトルと前記代表属性ベクトルとの距離を算出するようにした構成を有している。
【００２２】
この構成により、目標属性ベクトルと代表属性ベクトルとの近さを精密に評価することができ、高品質の音声を合成することができる。
【００２３】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を用いて説明する。
【００２４】
（第１の実施の形態）
図１は、本発明の第１の実施の形態の音声合成システムを示すブロック図である。
【００２５】
図１において、音声合成システムは、複数の韻律素片および複数の音声波形素片を含む音声コーパス１０１と、音声合成に用いる代表素片が保存される音声合成辞書１０２と、音声コーパス１０１内の素片を分類して音声合成に用いる代表素片を決定し音声合成辞書１０２を構築する音声合成辞書構築装置１１０と、入力されたテキストに対応する音声を音声合成辞書１０２内の代表素片を用いて合成する音声合成装置１２０とを備える。
【００２６】
音声合成辞書構築装置１１０は、音声コーパス１０１内の各素片に関する複数の属性を有した属性ベクトルを各素片に付与する属性ベクトル付与手段１１１と、音声コーパス１０１内の各素片が有するスカラ量に基づいて音声コーパス１０１内の素片を複数のクラスタに分類する素片分類手段１１２と、素片分類手段１１２によって生成されたクラスタ毎に属性ベクトルを分類する属性ベクトル分類手段１１３と、素片分類手段１１２によって生成された各クラスタを属性ベクトル分類手段１１３による属性ベクトルの分類に対応して分割するクラスタ分割手段１１４と、クラスタ毎に代表素片を決定する代表素片決定手段１１５と、クラスタ毎に代表属性ベクトルを決定する代表属性ベクトル決定手段１１６と、代表属性ベクトルと代表素片とを関連付けて音声合成辞書１０２に登録する登録手段１１７とを備える。
【００２７】
音声合成装置１２０は、音声合成されるテキストが入力されるテキスト入力手段１２１と、テキスト入力手段１２１に入力されたテキストについて言語解析および意味解析を行う解析手段１２２と、テキスト入力手段１２１に入力されたテキストについてパラ言語情報その他の追加情報が入力される追加情報入力手段１３１と、解析手段１２２における解析結果と追加情報入力手段１３１に入力された追加情報とに基づいて、音声合成の目標となる素片毎に音声合成の目標となる複数の属性を有した目標属性ベクトルを生成する目標属性ベクトル生成手段１２３と、目標属性ベクトルと音声合成辞書１０２に登録された代表属性ベクトルとの近さを算出して代表属性ベクトルの中から目標属性ベクトルに最も近いものを選択する代表属性ベクトル選択手段１２４と、選択された代表属性ベクトルに対応する代表素片を音声合成辞書１０２から取得する代表素片取得手段１２５と、代表素片取得手段１２５によって取得された代表素片を変形して接続することにより音声を合成する合成手段１２６と、合成手段１２６によって合成された音声を出力する音声出力手段１２７とを備える。
【００２８】
以下、本実施の形態の音声合成システムの動作について説明する。
【００２９】
まず、図２のフローチャートを用いて、音声合成辞書構築装置１１０の動作について説明する。
【００３０】
複数の韻律素片と複数の音声波形素片とを含む母集合が音声コーパス１０１に予め記憶されている。ここで、韻律素片には、アクセント句の基本周波数パターン、アクセント句の基本周波数の最大値、音素毎の継続時間長、および、音素毎のパワーパターンが含まれる。また、音声波形素片には、音素を構成する波形データが含まれる。
【００３１】
このような音声コーパス１０１内の韻律素片および音声波形素片のそれぞれに、属性ベクトル付与手段１１１によって、属性ベクトルが付与される（Ｓ２０１）。ここで、属性ベクトルには、アクセント句のモーラ数、アクセント型、品詞、アクセント句の位置、アクセント句の先行音素、アクセント句の後続音素、係り受けや係り先などの構文情報、その他の素片の属性が要素として含まれる。なお、アクセント句のモーラ数、アクセント型、品詞、およびアクセント句の位置といった属性は、先行アクセント句、当該アクセント句、および後続アクセント句に分けて与えられる。さらに、属性ベクトルには、優しい口調や哀惜口調といった口調、アナウンサ様式やコマーシャル様式といった発話様式、喜びや怒りといった感情、その他のパラ言語情報を示す属性が要素として含まれる。
【００３２】
属性ベクトルａ_ｋは、数式１で定義される。
【数式１】

ただし、
ｋ＝１，…，Ｎ
ここで、識別番号ｋは、分類の対象となる素片を識別する番号である。また、要素δ_ｋｉは、識別番号ｋの素片が注目する属性を有することを示す「１」、および、識別番号ｋの素片が注目する属性を有しないことを示す「０」の何れかの値が与えられる。なお、要素δ_ｋｉは、通常、言語解析や意味解析の結果により値が与えられるが、一部、操作などによって人が適宜に判断することにより値が与えられる場合がある。
【００３３】
次に、素片分類手段１１２によって、音声コーパス１０１内の韻律素片および音声波形素片がそれぞれの物理的な量を示すスカラ量に基づいて複数のクラスタに物理的に分類される（Ｓ２０２）。具体的には、聴覚的な検知限に基づいて素片が分類される。ここで、聴覚的な検知限とは、人の聴覚によって音が区別される尺度である。アクセント句の基本周波数の最大値の場合には例えば０．１ｏｃｔａｖｅといった値が用いられ、音素の継続時間長の場合には例えば５ｍｓといった値が用いられ、音声波形素片の場合には例えば３ｄＢ程度のスペクトル差を用いる。
【００３４】
次に、属性ベクトル分類手段１１３によって、クラスタ毎に属性ベクトルが分類される（Ｓ２０３）。
【００３５】
属性ベクトルの分類には、例えば、Ｋ平均（Ｋ−ｍｅａｎｓ）アルゴリズムや、ＬＢＧアルゴリズム、またはＬＢＧアルゴリズムに似た下記（Ｓ２０３１からＳ２０３４まで）に示すアルゴリズムを用いる。
【００３６】
まず、クラスタの辺縁の近傍で仮のセントロイド（ｃｅｎｔｒｏｉｄ）を２個決める（Ｓ２０３１）。次に、２個のセントロイドのどちらに近いかを判定して集合を２分割する（Ｓ２０３２）。次に、分割された集合のそれぞれにおいて改めてセントロイドを求め、歪み（ｄｉｓｔｏｒｔｉｏｎ）を計算する。（Ｓ２０３３）。ここで、計算された歪みが閾値以下であるか、歪みの変化量が閾値以下である場合には集合の分割を停止し、それ以外の場合には集合を２分割するステップ（Ｓ２０３２）に戻って集合の分割を繰り返す（Ｓ２０３４）。
【００３７】
次に、クラスタ分割手段１１４によって、属性ベクトルの分類に対応して各クラスタが分割される（Ｓ２０４）。ここで、属性ベクトルが互いに異なる集合に分類された素片同士は、互いに異なるクラスタに分類される。
【００３８】
次に、代表素片決定手段１１５によって、クラスタ毎に代表素片が決定される（Ｓ２０５）。例えば、クラスタ毎にセントロイドが代表素片とされる。
【００３９】
次に、代表属性ベクトル決定手段１１６によって、クラスタ毎に代表属性ベクトルが決定される（Ｓ２０６）。
【００４０】
代表属性ベクトルは、各クラスタがどのような特徴を強く持つかを示すため「特徴ベクトル」あるいは「説明ベクトル」ともいう。クラスタＣ_ｌの代表属性ベクトルｆ_ｌは、数式２に示すように、正規化しておく。
【数式２】

ただし、
【数式３】

ｌ＝１，２，…，Ｌ
ここで、識別番号ｌは、クラスタを識別する番号である。また、要素ｓ_ｉは、クラスタＣ_ｌ内の各素片に付与された各属性ベクトルａ_ｋのｉ番目の要素δ_ｋｉの総和である。また、正規化係数ｒ_ｉは、代表属性ベクトルｆ_ｌの要素ｓ_ｉを正規化するための係数であって、例えばクラスタＣ_ｌ内の素片の総数である。
【００４１】
次に、登録手段１１７によって、代表属性ベクトルと代表素片とが関連付けられて音声合成辞書１０２に登録される（Ｓ２０７）。
【００４２】
以上のように、音声コーパス１０１内の複数の素片は、物理空間と属性ベクトル空間との相互の分類がされた状態となり、物理空間と属性ベクトル空間との両方において互いに異なる複数の代表素片が生成される。具体的には、スカラ量および代表属性ベクトルの両方において互いに異なる複数の代表素片が生成された状態となる。また、初段の物理空間における分類（Ｓ２０２）によって生成された初期のクラスタの分散が大きかった場合、初期のクラスタを説明する互いに距離が離れた複数の代表属性ベクトルが得られたことになる。
【００４３】
以下、図３のフローチャートを用いて、音声合成装置１２０の動作について説明する。
【００４４】
まず、テキスト入力手段１２１によって、音声合成するためのテキストが入力される（Ｓ３０１）。また、必要に応じて、追加情報入力手段１３１によって、パラ言語情報その他の追加情報が入力される。
【００４５】
次に、解析手段１２２によって、入力されたテキストについて言語解析および意味解析が行なわれる（Ｓ３０２）。
【００４６】
次に、目標属性ベクトル生成手段１２３によって、音声合成の目標となる素片毎に目標属性ベクトルが生成される（Ｓ３０３）。具体的には、言語解析および意味解析の解析結果と、口調、発話様式および感情を特定するパラ言語情報とに基づいて目標属性ベクトルが作成される。
【００４７】
次に、代表属性ベクトル選択手段１２４によって、生成された目標属性ベクトルと音声合成辞書１０２に記憶された複数の代表属性ベクトルとについて近さが計算され、代表属性ベクトルの中から目標属性ベクトルに最も近いものが選択される（Ｓ３０４）。具体的には、数式４に示すように、目標属性ベクトルｇ_ｊと各クラスタの代表属性ベクトルｆ_ｌとの内積ｐ_ｌを計算し、数式５に示すように、内積ｐ_ｌの総和で各内積ｐ_ｌを正規化し、正規化された内積ｗ_ｌで、目標属性ベクトルｇ_ｊと代表属性ベクトルｆ_ｌとの近さが評価される。
【数式４】

ただし、
【数式５】

【００４８】
なお、内積を求める代わりに、ユークリッド距離またはマハラノビス距離を計算してもよい。内積は、一般には、計算対象となる二つの属性ベクトルの中で「０」以外の要素を対象とした距離評価となるが、音声の場合、「０」であることも重要な情報を担う場合があるので、ユークリッド距離やマハラノビス距離により、「０」となる要素も距離に反映させるとよい。
【００４９】
次に、代表素片取得手段１２５によって、目標属性ベクトルに最も近い代表属性ベクトルに対応する代表素片が音声合成辞書１０２から取り出される（Ｓ３０５）。
【００５０】
次に、合成手段１２６によって、代表素片を用いて音声が合成され（Ｓ３０６）、音声出力手段１２７によって、合成音声が出力される（Ｓ３０７）。
【００５１】
なお、図２および図３に示した処理は、それぞれプログラムによってコンピュータに実行させるようにしてもよい。
【００５２】
以上説明したように、本実施の形態の音声合成システムは、複数の韻律素片および複数の音声波形素片からなる母集合が記憶された音声コーパス１０１と、音声コーパス１０１内の各素片に関する複数の属性を有した属性ベクトルを音声コーパス１０１内の各素片に付与する属性ベクトル付与手段１１１と、音声コーパス１０１内の各素片が有するスカラ量に基づいて音声コーパス１０１内の素片を複数のクラスタに分類する素片分類手段１１２と、音声コーパス１０１内のクラスタ毎に属性ベクトルを分類する属性ベクトル分類手段１１３と、属性ベクトル分類手段１１３における属性ベクトルの分類に対応して音声コーパス１０１内のクラスタを分割するクラスタ分割手段１１４と、音声コーパス１０１内のクラスタ毎に代表素片を決定する代表素片決定手段１１５と、音声コーパス１０１内のクラスタ毎に代表属性ベクトルを決定する代表属性ベクトル決定手段１１６と、代表素片決定手段１１５によって決定された代表素片と代表属性ベクトル決定手段１１６によって決定された代表属性ベクトルとを関連付けて音声合成辞書１０２に登録する登録手段１１７と、音声合成辞書１０２内の代表素片と代表属性ベクトルとを用いて音声を合成する際、音声合成に用いる素片毎に目標となる複数の属性を有した目標属性ベクトルを生成する目標属性ベクトル生成手段１２３と、音声合成辞書１０２内の代表属性ベクトルの中から目標属性ベクトルに最も近いものを選択する代表属性ベクトル選択手段１２４と、代表属性ベクトル選択手段１２４によって選択された代表属性ベクトルに対応する代表素片を用いて音声を合成する合成手段１２６とを備えるので、冗長な代表素片を無くすとともに代表素片が代表するクラスタの分散を小さくして代表素片間のばらつきを小さくすることができ、したがって、多数の制御要因に基づく各目標ベクトルに最適な代表素片を選択することができることになり、高品質の音声を合成することができる。
【００５３】
（第２の実施の形態）
図４は、本発明の第２の実施の形態の音声合成システムを示すブロック図である。図４において、図１に示された第１の実施の形態の音声合成システムの構成要素と同じ構成要素には同じ符号を付してある。
【００５４】
本実施の形態の音声合成辞書構築装置４１０は、音声コーパス１０１内のクラスタにおけるクラスタ歪の大小を判定するクラスタ歪判定手段４４１を備え、素片分類手段１１２と属性ベクトル分類手段１１３とクラスタ分割手段１１４とに、属性ベクトルの分類と素片の分類とを繰り返えさせ、クラスタ歪が予め決められた閾値以下になったと判定したとき、素片分類手段１１２と属性ベクトル分類手段１１３とクラスタ分割手段１１４とにおける分類を停止させるようになっている。また、クラスタ歪判定手段４４１が、クラスタ歪の変化量が予め決められた閾値以下になったと判定したときも、素片の分類と属性ベクトルの分類とを停止させるようになっている。
【００５５】
次に、図５のフローチャートを用いて、音声合成辞書構築装置４１０の動作について説明する。
【００５６】
複数の韻律素片と複数の音声波形素片とを含む母集合が音声コーパス１０１に予め記憶されている。
【００５７】
このような音声コーパス１０１内の韻律素片および音声波形素片のそれぞれに、属性ベクトル付与手段１１１によって、属性ベクトルが付与される（Ｓ５０１）。
【００５８】
次に、素片分類手段１１２によって、音声コーパス１０１内の韻律素片および音声波形素片がそれぞれのスカラ量に基づいて複数のクラスタに分類される（Ｓ５０２）。例えばＬＢＧアルゴリズムを用いると最初に分割されたクラスタの数は２つとなる。
【００５９】
次に、属性ベクトル分類手段１１３によって、クラスタ毎に属性ベクトルが分類される（Ｓ５０３）。
【００６０】
次に、クラスタ分割手段１１４によって、属性ベクトルの分類に対応して各クラスタが分割される（Ｓ５０４）。
【００６１】
次に、クラスタ歪判定手段４４１によって、各クラスタにおけるクラスタ歪の大小が判定される（Ｓ５０５）。具体的には、代表素片とクラスタ内の各素片との距離を計算し、その総和（自乗誤差）をクラスタ歪とし、このクラスタ歪みが、聴覚的な検知限などの予め決められた閾値以下か否かを判定する。クラスタ歪がクラスタ歪比較用の閾値を越えるときには、素片の分類と属性ベクトルの分類と（Ｓ５０２からＳ５０４）とを繰り返し、クラスタ歪がクラスタ歪比較用の閾値以下になったときには、分類が停止される。このようなクラスタ歪自体と閾値との比較に併せて、クラスタ歪の変化量と予め決められた変化量比較用の閾値とを比較し、クラスタ歪の変化量が変化量比較用の閾値以下になったとき、分類を停止させる。
【００６２】
次に、代表素片決定手段１１５によって、クラスタ毎に代表素片が決定される（Ｓ５０６）。
【００６３】
次に、代表属性ベクトル決定手段１１６によって、クラスタ毎に代表属性ベクトルが決定される（Ｓ５０７）。
【００６４】
次に、登録手段１１７によって、代表属性ベクトルと代表素片とが関連付けられて音声合成辞書１０２に登録される（Ｓ５０８）。
【００６５】
なお、図５に示した処理は、プログラムによってコンピュータに実行させるようにしてもよい。
【００６６】
以上説明したように、本実施の形態の音声合成システムは、クラスタ歪の大小を判定するクラスタ歪判定手段４４１を備え、素片の分類と属性ベクトルの分類とを繰り返し、クラスタ歪判定手段４４１は、クラスタ歪が予め決められた閾値以下になったとき、分類を停止させるようにしたので、詳細なクラスタが得られることになり、代表素片が代表するクラスタの分散が小さくなって、高品質の音声を合成することができる。また、クラスタ歪判定手段４４１が、クラスタ歪の変化量が予め決められた閾値以下になったとき、分類を停止させるようにすることにより、冗長なクラスタリングを行なわないで済むとともに、最適な代表素片が音声合成辞書１０２に登録されることになり、高品質の音声合成をすることができる。
【００６７】
なお、前述の第１の実施の形態および第２の実施の形態において、素片の母集合が登録された音声コーパスと代表素片が登録された音声合成辞書とを別にして構成した例について説明したが、音声コーパスと音声合成辞書とを一体にして構成してもよい。
【００６８】
また、音声合成辞書は、音声合成装置に組み込まれていてもよい。
【００６９】
また、素片分類手段１１２は、音声波形素片または韻律素片を視覚的なパターンなどで画面に表示し、素片を分類するための指示入力を受け付けるようにしてもよい。
【００７０】
【発明の効果】
本発明によれば、冗長な代表素片をなくすとともに、代表素片間のばらつきを小さくして高品質の音声を合成することができるという優れた効果を有する音声合成システム、音声合成辞書構築装置、音声合成装置、音声合成方法、音声合成辞書構築プログラムおよび音声合成プログラムを提供することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態の音声合成システムを示すブロック図
【図２】本発明の第１の実施の形態の音声合成システムにおける音声合成辞書構築処理を示すフローチャート
【図３】本発明の第１の実施の形態および第２の実施の形態の音声合成システムにおける音声合成処理を示すフローチャート
【図４】本発明の第２の実施の形態の音声合成システムを示すブロック図
【図５】本発明の第２の実施の形態の音声合成システムにおける音声合成辞書構築処理を示すフローチャート
【符号の説明】
１０１　音声コーパス
１０２　音声合成辞書
１１０、４１０　音声合成辞書構築装置
１１１　属性ベクトル付与手段
１１２　素片分類手段
１１３　属性ベクトル分類手段
１１４　クラスタ分割手段
１１５　代表素片決定手段
１１６　代表属性ベクトル決定手段
１１７　登録手段
４４１　クラスタ歪判定手段
１２０　音声合成装置
１２１　テキスト入力手段
１２２　解析手段
１２３　目標属性ベクトル生成手段
１２４　代表属性ベクトル選択手段
１２５　代表素片取得手段
１２６　合成手段
１２７　音声出力手段
１３１　追加情報入力手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis system for synthesizing speech, a speech synthesis dictionary construction device, a speech synthesis device, a speech synthesis method, a speech synthesis dictionary construction program, and a speech synthesis program.
[0002]
[Prior art]
In a conventional speech synthesis system, generally, a plurality of segments such as a prosodic segment and a speech waveform segment are classified into a plurality of clusters based on a prosodic condition, and a representative segment representing each cluster is determined for each cluster. The plurality of determined representative segments are registered in advance in the speech synthesis dictionary together with the prosody conditions.
[0003]
Further, in the conventional speech synthesis system, when speech synthesis is performed, a representative segment of a prosody condition that matches the prosody condition of the text to be synthesized is searched in the speech synthesis dictionary. If it exists in the dictionary, a representative segment whose prosodic condition matches is selected. If no representative segment whose prosodic condition matches does not exist in the speech synthesis dictionary, the prosodic condition closest to the prosodic condition of the text is selected. (See, for example, Patent Document 1).
[0004]
[Patent Document 1]
JP-A-2000-250570 (pages 1-6, FIG. 1)
[0005]
[Problems to be solved by the invention]
However, such a conventional speech synthesis system has a problem that a plurality of substantially the same representative segments are registered in the speech synthesis dictionary. In addition, there is a problem that the variance of the clusters represented by the representative segments is increased, the variation between the selected representative segments is increased, and voices having impressions that vary in auditory sense are synthesized.
[0006]
SUMMARY OF THE INVENTION The present invention has been made to solve such a problem, and eliminates redundant representative segments while reducing variations between selected representative segments to synthesize high-quality speech. A speech synthesis system, a speech synthesis dictionary construction device, a speech synthesis device, a speech synthesis method, a speech synthesis dictionary construction program, and a speech synthesis program are provided.
[0007]
[Means for Solving the Problems]
The speech synthesis system according to the present invention further includes: an attribute vector assigning unit that assigns, to each segment, an attribute vector having a plurality of attributes for each segment in a population including a plurality of prosodic segments and a plurality of speech waveform segments. A segment classification unit for classifying the segments into a plurality of clusters based on a scalar amount of each segment, an attribute vector classification unit for classifying the attribute vector for each cluster, and a classification of the attribute vector. Cluster dividing means for correspondingly dividing the cluster; representative element determining means for determining a representative element for each cluster; representative attribute vector determining means for determining a representative attribute vector for each cluster; Target attribute vector generating means for generating a target attribute vector having a plurality of target attributes when synthesizing speech using a piece and the representative attribute vector A representative attribute vector selecting unit for selecting a representative attribute vector closest to the target attribute vector from the representative attribute vectors, and a synthesizing unit for synthesizing a voice using the representative segment corresponding to the selected representative attribute vector. It has the structure provided with.
[0008]
With this configuration, storage capacity can be saved by eliminating redundant representative segments, and high-quality voice can be achieved by reducing the variance of the clusters represented by the representative segments and the variation between the representative segments. Can be synthesized.
[0009]
The speech synthesis system according to the present invention further includes a cluster distortion determining unit that determines a magnitude of cluster distortion in the cluster, and repeats the classification of the unit and the classification of the attribute vector, wherein the cluster distortion determining unit includes the cluster distortion. Is stopped when the value becomes equal to or less than a predetermined threshold value.
[0010]
With this configuration, a detailed cluster can be obtained, the variance of the cluster represented by the representative segment is reduced, and high-quality speech can be synthesized.
[0011]
The speech synthesis system according to the present invention has a configuration in which the cluster distortion determination means stops the classification when the amount of change in the cluster distortion becomes equal to or less than a predetermined threshold.
[0012]
With this configuration, it is not necessary to perform redundant clustering, and the optimal representative segment is registered in the speech synthesis dictionary, so that high-quality speech synthesis can be performed.
[0013]
In the speech synthesis system according to the present invention, the segment classification unit may be configured such that the speech waveform segment, a fundamental frequency pattern for each accent phrase, a maximum value of a fundamental frequency for each accent phrase, a duration for each phoneme, and a At least one of the power patterns is classified as the element.
[0014]
With this configuration, appropriate representative units are registered in the voice synthesis dictionary for various units used for speech synthesis, and an appropriate representative unit is selected at the time of speech synthesis. Can be synthesized.
[0015]
The speech synthesis system according to the present invention has a configuration in which the segment classification means classifies the segments based on an auditory detection limit.
[0016]
According to this configuration, a representative unit that is acoustically lean is registered in the speech synthesis dictionary, and an appropriate representative unit is selected at the time of speech synthesis, so that high-quality speech can be synthesized. .
[0017]
The speech synthesis system according to the present invention has a configuration in which the unit classification means displays the units and receives an instruction input for classifying the units.
[0018]
With this configuration, an appropriate representative unit visually confirmed or instructed is registered in the speech synthesis dictionary, and an appropriate representative unit is selected at the time of speech synthesis, thereby synthesizing high quality speech. can do.
[0019]
The speech synthesis system according to the present invention has a configuration in which the attribute vector classifying means classifies the attribute vector using one of an LBG algorithm and a K-means algorithm.
[0020]
With this configuration, in the case of the LBG algorithm, clusters in which the cluster distortion is smaller than a preset threshold can be obtained, so that high-quality speech can be synthesized. In the case of the K-means algorithm, the number of clusters is set in advance. Since it can be set, it is possible to easily design hardware resources necessary for synthesizing high-quality speech.
[0021]
The speech synthesis system according to the present invention has a configuration in which the representative attribute vector selecting means calculates a distance between the target attribute vector and the representative attribute vector using any one of a Euclidean distance and a Mahalanobis distance. I have.
[0022]
With this configuration, the closeness between the target attribute vector and the representative attribute vector can be accurately evaluated, and high-quality speech can be synthesized.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0024]
(First Embodiment)
FIG. 1 is a block diagram showing a speech synthesis system according to a first embodiment of the present invention.
[0025]
In FIG. 1, a speech synthesis system includes a speech corpus 101 including a plurality of prosodic segments and a plurality of speech waveform segments, a speech synthesis dictionary 102 in which representative segments used for speech synthesis are stored, A speech synthesis dictionary construction device 110 for classifying the segments and determining a representative segment to be used for speech synthesis and constructing the speech synthesis dictionary 102, and a speech corresponding to the input text as a representative segment in the speech synthesis dictionary 102. And a speech synthesizer 120 for synthesizing the speech.
[0026]
The speech synthesis dictionary construction device 110 includes an attribute vector assigning unit 111 for assigning an attribute vector having a plurality of attributes for each segment in the speech corpus 101 to each segment, and a scalar included in each segment in the speech corpus 101. A segment classification unit 112 for classifying segments in the speech corpus 101 into a plurality of clusters based on the amount, an attribute vector classification unit 113 for classifying an attribute vector for each cluster generated by the segment classification unit 112, A cluster dividing unit 114 for dividing each cluster generated by the segment classifying unit 112 in accordance with the attribute vector classification by the attribute vector classifying unit 113, a representative segment determining unit 115 for determining a representative segment for each cluster, Representative attribute vector determining means 116 for determining a representative attribute vector for each cluster; And a registration unit 117 for registering the speech synthesis dictionary 102 in association with and.
[0027]
The speech synthesizer 120 includes a text input unit 121 into which a text to be synthesized is input, an analysis unit 122 that performs linguistic and semantic analysis on the text input into the text input unit 121, and an input from the text input unit 121. Based on the result of the analysis performed by the analysis unit 122 and the additional information input to the additional information input unit 131, based on the additional information input unit 131 to which paralinguistic information and other additional information are input for the extracted text. A target attribute vector generating means 123 for generating a target attribute vector having a plurality of attributes which are the targets of speech synthesis for each unit; and determining the proximity between the target attribute vector and the representative attribute vector registered in the speech synthesis dictionary 102. A representative attribute vector that calculates and selects the one closest to the target attribute vector from the representative attribute vectors A selection unit 124, a representative unit acquisition unit 125 for acquiring a representative unit corresponding to the selected representative attribute vector from the speech synthesis dictionary 102, and a modification of the representative unit acquired by the representative unit acquisition unit 125. A synthesizing unit 126 for synthesizing the audio by connecting them together, and an audio output unit 127 for outputting the audio synthesized by the synthesizing unit 126.
[0028]
Hereinafter, the operation of the speech synthesis system according to the present embodiment will be described.
[0029]
First, the operation of the speech synthesis dictionary construction device 110 will be described with reference to the flowchart of FIG.
[0030]
A population set including a plurality of prosodic segments and a plurality of speech waveform segments is stored in the speech corpus 101 in advance. Here, the prosodic segments include the fundamental frequency pattern of the accent phrase, the maximum value of the fundamental frequency of the accent phrase, the duration of each phoneme, and the power pattern of each phoneme. The speech waveform segment includes waveform data constituting a phoneme.
[0031]
An attribute vector is assigned to each of the prosodic segments and the speech waveform segments in the speech corpus 101 by the attribute vector assigning unit 111 (S201). Here, the attribute vector includes the number of mora of the accent phrase, the accent type, the part of speech, the position of the accent phrase, the preceding phoneme of the accent phrase, the succeeding phoneme of the accent phrase, syntax information such as the dependency and the destination, and other fragments. Is included as an element. Note that attributes such as the number of mora of the accent phrase, the accent type, the part of speech, and the position of the accent phrase are given separately for the preceding accent phrase, the accent phrase, and the subsequent accent phrase. Further, the attribute vector includes, as elements, tones such as a gentle tone and a sorrowful tone, utterance styles such as an announcer style and a commercial style, emotions such as joy and anger, and other paralinguistic information.
[0032]
The attribute vector a _k is defined by Expression 1.
[Formula 1]

However,
k = 1, ..., N
Here, the identification number k is a number for identifying a unit to be classified. The element δ _ki is either “1” indicating that the unit of the identification number k has the attribute of interest, or “0” indicating that the unit of the identification number k does not have the attribute of interest. Is given. The element δ _ki is usually given a value based on the result of language analysis or semantic analysis, but in some cases the value may be given by a person appropriately making an operation or the like.
[0033]
Next, the segment classification unit 112 physically classifies the prosodic segments and the speech waveform segments in the speech corpus 101 into a plurality of clusters based on the scalar quantities indicating the respective physical quantities (S202). . Specifically, the segments are classified based on the auditory detection limit. Here, the auditory detection limit is a scale by which sounds are distinguished by human hearing. For example, a value such as 0.1 octave is used for the maximum value of the fundamental frequency of the accent phrase, a value such as 5 ms is used for the duration of a phoneme, and a value of about 3 dB for a speech waveform segment. Is used.
[0034]
Next, the attribute vectors are classified for each cluster by the attribute vector classifying means 113 (S203).
[0035]
For the classification of the attribute vectors, for example, a K-means (K-means) algorithm, an LBG algorithm, or an algorithm similar to the following (from S2031 to S2034) similar to the LBG algorithm is used.
[0036]
First, two temporary centroids are determined near the edge of the cluster (S2031). Next, the set is divided into two by determining which of the two centroids is closer (S2032). Next, a centroid is newly obtained for each of the divided sets, and a distortion is calculated. (S2033). Here, if the calculated distortion is equal to or smaller than the threshold or the amount of change in the distortion is equal to or smaller than the threshold, the division of the set is stopped. Otherwise, the process returns to the step of dividing the set into two (S2032). Then, the set is divided repeatedly (S2034).
[0037]
Next, each cluster is divided by the cluster dividing means 114 in accordance with the classification of the attribute vector (S204). Here, segments whose attribute vectors are classified into different sets are classified into different clusters.
[0038]
Next, the representative segment determining means 115 determines a representative segment for each cluster (S205). For example, a centroid is used as a representative segment for each cluster.
[0039]
Next, the representative attribute vector is determined for each cluster by the representative attribute vector determining means 116 (S206).
[0040]
The representative attribute vector is also referred to as a “feature vector” or “explanation vector” to indicate what feature each cluster has strongly. Representative attribute vector f _l cluster C _l, as shown in Equation 2, previously normalized.
[Formula 2]

However,
(Equation 3)

l = 1,2, ..., L
Here, the identification number 1 is a number for identifying a cluster. Also elements s _i is the sum of i-th element [delta] _ki of each attribute vector a _k assigned to each segment in the cluster C _l. Moreover, normalization coefficients r _i is a coefficient for normalizing the elements s _i representative attribute vector f _l, such as the total number of segments in the cluster C _l.
[0041]
Next, the registration unit 117 associates the representative attribute vector with the representative segment and registers them in the speech synthesis dictionary 102 (S207).
[0042]
As described above, the plurality of segments in the speech corpus 101 are in a state in which the physical space and the attribute vector space are mutually classified, and the plurality of representative segments different from each other in both the physical space and the attribute vector space. Is generated. Specifically, a plurality of representative segments different from each other in both the scalar amount and the representative attribute vector are generated. When the variance of the initial cluster generated by the classification (S202) in the first-stage physical space is large, it means that a plurality of representative attribute vectors that are separated from each other and that describe the initial cluster are obtained.
[0043]
Hereinafter, the operation of the speech synthesizer 120 will be described with reference to the flowchart of FIG.
[0044]
First, a text for speech synthesis is input by the text input unit 121 (S301). Further, the paralinguistic information and other additional information are input by the additional information input means 131 as needed.
[0045]
Next, language analysis and semantic analysis are performed on the input text by the analysis unit 122 (S302).
[0046]
Next, the target attribute vector generation unit 123 generates a target attribute vector for each segment that is a target of speech synthesis (S303). Specifically, the target attribute vector is created based on the analysis results of the language analysis and the semantic analysis and the paralinguistic information for specifying the tone, the speech style, and the emotion.
[0047]
Next, the representative attribute vector selecting unit 124 calculates the closeness between the generated target attribute vector and the plurality of representative attribute vectors stored in the speech synthesis dictionary 102, and determines the closest to the target attribute vector from the representative attribute vectors. A close one is selected (S304). Specifically, as shown in Expression 4, the inner product p ₁ of the target attribute vector g _j and the representative attribute vector f _l of each cluster is calculated, and as shown in Expression 5, each inner product p ₁ is summed up by the inner product p ₁ p _l is normalized, and the proximity between the target attribute vector g _j and the representative attribute vector f _l is evaluated using the normalized inner product w _l .
(Equation 4)

However,
(Equation 5)

[0048]
Instead of calculating the inner product, the Euclidean distance or the Mahalanobis distance may be calculated. The inner product is generally a distance evaluation targeting elements other than "0" in the two attribute vectors to be calculated. However, in the case of voice, "0" may also be important information Therefore, an element that becomes “0” may be reflected in the distance based on the Euclidean distance or the Mahalanobis distance.
[0049]
Next, the representative unit acquisition unit 125 extracts a representative unit corresponding to the representative attribute vector closest to the target attribute vector from the speech synthesis dictionary 102 (S305).
[0050]
Next, the synthesizing unit 126 synthesizes a voice using the representative segment (S306), and the synthesized voice is output by the voice output unit 127 (S307).
[0051]
Note that the processing shown in FIGS. 2 and 3 may be executed by a computer using a program.
[0052]
As described above, the speech synthesis system according to the present embodiment relates to the speech corpus 101 in which a mother set including a plurality of prosodic segments and a plurality of speech waveform segments is stored, and each of the segments in the speech corpus 101. An attribute vector assigning unit 111 for assigning an attribute vector having a plurality of attributes to each segment in the speech corpus 101; and a unit in the speech corpus 101 based on a scalar amount of each segment in the speech corpus 101. A unit classification unit 112 for classifying into a plurality of clusters, an attribute vector classification unit 113 for classifying an attribute vector for each cluster in the speech corpus 101, and a speech corpus 101 corresponding to the classification of the attribute vector in the attribute vector classification unit 113. Clustering means 114 for dividing the clusters within the group, and determining a representative segment for each cluster in the speech corpus 101 Representative segment determining means 115, representative attribute vector determining means 116 for determining a representative attribute vector for each cluster in the speech corpus 101, and a representative segment and representative attribute vector determining means determined by the representative segment determining means 115. Registration means 117 for associating with the representative attribute vector determined by 116 and registering in the speech synthesis dictionary 102, and synthesizing speech using the representative segment and the representative attribute vector in the speech synthesis dictionary 102, A target attribute vector generating unit 123 that generates a target attribute vector having a plurality of target attributes for each segment used; and a representative attribute vector in the speech synthesis dictionary 102 that is closest to the target attribute vector is selected. The representative attribute vector selecting unit 124 and the representative attribute vector selected by the representative attribute vector selecting unit 124 And a synthesizing unit 126 for synthesizing a voice using a representative segment corresponding to the original segment, thereby eliminating redundant representative segments and reducing the variance of clusters represented by the representative segments to reduce variations between the representative segments. Therefore, it is possible to select an optimal representative segment for each target vector based on a large number of control factors, and to synthesize high-quality speech.
[0053]
(Second embodiment)
FIG. 4 is a block diagram showing a speech synthesis system according to the second embodiment of the present invention. 4, the same components as those of the speech synthesis system according to the first embodiment shown in FIG. 1 are denoted by the same reference numerals.
[0054]
The speech synthesis dictionary construction apparatus 410 according to the present embodiment includes a cluster distortion determination unit 441 that determines the magnitude of cluster distortion in a cluster in the speech corpus 101, and includes a unit classification unit 112, an attribute vector classification unit 113, and a cluster division unit. 114, the classification of the attribute vector and the classification of the segment are repeated, and when it is determined that the cluster distortion is equal to or less than a predetermined threshold, the segment classification unit 112, the attribute vector classification unit 113, and the cluster division The classification in the means 114 is stopped. Also, when the cluster distortion determination unit 441 determines that the amount of change in cluster distortion has become equal to or less than a predetermined threshold, the classification of the segments and the classification of the attribute vectors are stopped.
[0055]
Next, the operation of the speech synthesis dictionary construction apparatus 410 will be described using the flowchart of FIG.
[0056]
A population set including a plurality of prosodic segments and a plurality of speech waveform segments is stored in the speech corpus 101 in advance.
[0057]
An attribute vector is assigned to each of the prosodic segments and the speech waveform segments in the speech corpus 101 by the attribute vector assigning unit 111 (S501).
[0058]
Next, the prosody unit and the speech waveform unit in the speech corpus 101 are classified into a plurality of clusters based on the respective scalar amounts by the unit classification unit 112 (S502). For example, when the LBG algorithm is used, the number of initially divided clusters is two.
[0059]
Next, the attribute vector is classified by the attribute vector classifying unit 113 for each cluster (S503).
[0060]
Next, each cluster is divided by the cluster dividing means 114 in accordance with the classification of the attribute vector (S504).
[0061]
Next, the magnitude of the cluster distortion in each cluster is determined by the cluster distortion determining unit 441 (S505). Specifically, the distance between the representative segment and each of the segments in the cluster is calculated, and the sum (square error) of the calculated sum is regarded as cluster distortion. This cluster distortion is determined by a predetermined threshold such as an auditory detection limit. It is determined whether or not: When the cluster distortion exceeds the threshold for cluster distortion comparison, the classification of segments and the classification of attribute vectors (S502 to S504) are repeated, and when the cluster distortion falls below the threshold for cluster distortion comparison, the classification is stopped. Is done. In addition to such a comparison between the cluster distortion itself and the threshold, the amount of change in the cluster distortion is compared with a predetermined threshold for comparing the amount of change. If it does, stop the classification.
[0062]
Next, the representative segment determining means 115 determines a representative segment for each cluster (S506).
[0063]
Next, the representative attribute vector determining means 116 determines a representative attribute vector for each cluster (S507).
[0064]
Next, the registration unit 117 registers the representative attribute vector and the representative segment in the speech synthesis dictionary 102 in association with each other (S508).
[0065]
The process illustrated in FIG. 5 may be executed by a computer using a program.
[0066]
As described above, the speech synthesis system according to the present embodiment includes the cluster distortion determining unit 441 that determines the magnitude of the cluster distortion, repeats the classification of the unit and the classification of the attribute vector, and the cluster distortion determining unit 441 When the cluster distortion becomes equal to or less than a predetermined threshold, the classification is stopped, so that a detailed cluster can be obtained, and the variance of the cluster represented by the representative segment is reduced, resulting in high quality. Can be synthesized. In addition, the cluster distortion determination unit 441 stops the classification when the amount of change in the cluster distortion becomes equal to or less than a predetermined threshold value, so that redundant clustering does not need to be performed and the optimal representative element is not required. The piece is registered in the speech synthesis dictionary 102, and high-quality speech synthesis can be performed.
[0067]
In the above-described first and second embodiments, an example is described in which a speech corpus in which a mother set of segments is registered and a speech synthesis dictionary in which representative segments are registered are separately configured. As described above, the speech corpus and the speech synthesis dictionary may be integrally configured.
[0068]
Further, the speech synthesis dictionary may be incorporated in the speech synthesis device.
[0069]
The unit classification unit 112 may display a speech waveform unit or a prosody unit on a screen in a visual pattern or the like, and may receive an instruction input for classifying the unit.
[0070]
【The invention's effect】
Advantageous Effects of Invention According to the present invention, a speech synthesis system and a speech synthesis dictionary construction apparatus having an excellent effect of eliminating redundant representative segments and reducing variations between representative segments to synthesize high-quality speech. , A speech synthesis method, a speech synthesis dictionary construction program, and a speech synthesis program.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a speech synthesis system according to a first embodiment of the present invention; FIG. 2 is a flowchart showing a speech synthesis dictionary construction process in the speech synthesis system according to the first embodiment of the present invention; FIG. 4 is a flowchart showing a speech synthesis process in the speech synthesis system according to the first and second embodiments of the present invention. FIG. 4 is a block diagram showing a speech synthesis system according to a second embodiment of the present invention. FIG. 5 is a flowchart showing a speech synthesis dictionary construction process in the speech synthesis system according to the second embodiment of the present invention.
101 speech corpus 102

speech synthesis dictionary

110, 410 speech synthesis dictionary construction device 111 attribute vector assigning means 112 segment classification means 113 attribute vector classification means 114 cluster division means 115 representative segment determination means 116 representative attribute vector determination means 117 registration means 441 Cluster distortion determination means 120 Voice synthesis device 121 Text input means 122 Analysis means 123 Target attribute vector generation means 124 Representative attribute vector selection means 125 Representative segment acquisition means 126 Synthesis means 127 Voice output means 131 Additional information input means

Claims

複数の韻律素片および複数の音声波形素片を含む母集合における各素片に関する複数の属性を有した属性ベクトルを前記各素片に付与する属性ベクトル付与手段と、前記各素片が有するスカラ量に基づいて前記素片を複数のクラスタに分類する素片分類手段と、前記クラスタ毎に前記属性ベクトルを分類する属性ベクトル分類手段と、前記属性ベクトルの分類に対応して前記クラスタを分割するクラスタ分割手段と、前記クラスタ毎に代表素片を決定する代表素片決定手段と、前記クラスタ毎に代表属性ベクトルを決定する代表属性ベクトル決定手段と、前記代表素片と前記代表属性ベクトルとを用いて音声を合成する際、目標となる複数の属性を有した目標属性ベクトルを生成する目標属性ベクトル生成手段と、前記代表属性ベクトルの中から前記目標属性ベクトルに最も近いものを選択する代表属性ベクトル選択手段と、選択された前記代表属性ベクトルに対応する前記代表素片を用いて音声を合成する合成手段とを備えることを特徴とする音声合成システム。Attribute vector assigning means for assigning, to each segment, an attribute vector having a plurality of attributes for each segment in a population including a plurality of prosodic segments and a plurality of speech waveform segments, and a scalar included in each segment. A unit classification unit for classifying the units into a plurality of clusters based on the quantity, an attribute vector classification unit for classifying the attribute vector for each cluster, and dividing the cluster in accordance with the classification of the attribute vector A cluster dividing unit, a representative unit determining unit for determining a representative unit for each cluster, a representative attribute vector determining unit for determining a representative attribute vector for each cluster, and the representative unit and the representative attribute vector. A target attribute vector generating means for generating a target attribute vector having a plurality of target attributes when synthesizing the voice by using the target attribute vector; A representative attribute vector selecting unit for selecting a closest one from the target attribute vector to the target attribute vector, and a synthesizing unit for synthesizing a voice using the representative segment corresponding to the selected representative attribute vector. Speech synthesis system.

前記クラスタにおけるクラスタ歪の大小を判定するクラスタ歪判定手段を備え、前記素片の分類と前記属性ベクトルの分類とを繰り返し、前記クラスタ歪判定手段は、前記クラスタ歪が予め決められた閾値以下になったとき、分類を停止させるようにしたことを特徴とする請求項１に記載の音声合成システム。Cluster distortion determining means for determining the magnitude of cluster distortion in the cluster, repeating the classification of the segment and the classification of the attribute vector, the cluster distortion determining means, the cluster distortion is less than a predetermined threshold 2. The speech synthesis system according to claim 1, wherein the classification is stopped when the time comes.

前記クラスタ歪判定手段は、前記クラスタ歪の変化量が予め決められた閾値以下になったとき、分類を停止させるようにしたことを特徴とする請求項２に記載の音声合成システム。3. The speech synthesis system according to claim 2, wherein the cluster distortion determination unit stops the classification when a change amount of the cluster distortion becomes equal to or less than a predetermined threshold.

前記素片分類手段は、前記音声波形素片、アクセント句毎の基本周波数パターン、アクセント句毎の基本周波数の最大値、音素毎の継続時間長、および音素毎のパワーパターンの少なくとも一つを前記素片として分類するようにしたことを特徴とする請求項１ないし請求項３の何れかに記載の音声合成システム。The segment classification means may include at least one of the speech waveform segment, a fundamental frequency pattern for each accent phrase, a maximum value of a fundamental frequency for each accent phrase, a duration for each phoneme, and a power pattern for each phoneme. 4. A speech synthesis system according to claim 1, wherein the speech synthesis system classifies the speech as a unit.

前記素片分類手段は、聴覚的な検知限に基づいて前記素片を分類するようにしたことを特徴とする請求項１ないし請求項４の何れかに記載の音声合成システム。The speech synthesis system according to claim 1, wherein the segment classification unit classifies the segments based on an auditory detection limit.

前記素片分類手段は、前記素片に関して表示し、前記素片を分類するための指示入力を受け付けるようにしたことを特徴とする請求項１ないし請求項５の何れかに記載の音声合成システム。6. The speech synthesis system according to claim 1, wherein the unit classification unit displays the unit and receives an instruction input for classifying the unit. .

前記属性ベクトル分類手段は、ＬＢＧアルゴリズムおよびＫ平均アルゴリズムの何れかを用いて前記属性ベクトルを分類するようにしたことを特徴とする請求項１ないし請求項６の何れかに記載の音声合成システム。7. The speech synthesis system according to claim 1, wherein said attribute vector classifying means classifies said attribute vector using one of an LBG algorithm and a K-means algorithm.

前記代表属性ベクトル選択手段は、ユークリッド距離およびマハラノビス距離の何れかを用いて前記目標属性ベクトルと前記代表属性ベクトルとの距離を算出するようにしたことを特徴とする請求項１ないし請求項７の何れかに記載の音声合成システム。8. The method according to claim 1, wherein the representative attribute vector selecting unit calculates a distance between the target attribute vector and the representative attribute vector using one of a Euclidean distance and a Mahalanobis distance. The speech synthesis system according to any one of the above.

複数の韻律素片および複数の音声波形素片を含む母集合における各素片に関する複数の属性を有した属性ベクトルを前記各素片に付与する属性ベクトル付与手段と、前記各素片が有するスカラ量に基づいて前記素片を複数のクラスタに分類する素片分類手段と、前記クラスタ毎に前記属性ベクトルを分類する属性ベクトル分類手段と、前記属性ベクトルの分類に対応して前記クラスタを分割するクラスタ分割手段と、前記クラスタ毎に代表素片を決定する代表素片決定手段と、前記クラスタ毎に代表属性ベクトルを決定する代表属性ベクトル決定手段とを備え、音声合成に用いる前記代表素片と前記代表属性ベクトルとを含む音声合成辞書を構築するようにしたことを特徴とする音声合成辞書構築装置。Attribute vector assigning means for assigning, to each segment, an attribute vector having a plurality of attributes for each segment in a population including a plurality of prosodic segments and a plurality of speech waveform segments, and a scalar included in each segment. A unit classification unit for classifying the units into a plurality of clusters based on the quantity, an attribute vector classification unit for classifying the attribute vector for each cluster, and dividing the cluster in accordance with the classification of the attribute vector A cluster segmenting unit, a representative unit determining unit for determining a representative unit for each cluster, and a representative attribute vector determining unit for determining a representative attribute vector for each cluster; A speech synthesis dictionary construction apparatus, wherein a speech synthesis dictionary including the representative attribute vector is constructed.

請求項９に記載の前記音声合成辞書構築装置によって構築された前記音声合成辞書が含む前記代表素片と前記代表属性ベクトルとを用いて音声を合成する音声合成装置であって、目標となる複数の属性を有した目標属性ベクトルを生成する目標属性ベクトル生成手段と、前記代表属性ベクトルの中から前記目標属性ベクトルに最も近いものを選択する代表属性ベクトル選択手段と、選択された前記代表属性ベクトルに対応する前記代表素片を用いて音声を合成する合成手段とを備えることを特徴とする音声合成装置。A speech synthesizer for synthesizing speech using the representative segment and the representative attribute vector included in the speech synthesis dictionary constructed by the speech synthesis dictionary construction apparatus according to claim 9, wherein a plurality of target speeches are obtained. Target attribute vector generating means for generating a target attribute vector having the following attributes: representative attribute vector selecting means for selecting a representative attribute vector closest to the target attribute vector from the representative attribute vectors; and the selected representative attribute vector And a synthesizing means for synthesizing a voice using the representative segment corresponding to (a).

複数の韻律素片および複数の音声波形素片を含む母集合における各素片がそれぞれ有するスカラ量と前記素片に関連付けされた属性ベクトルとに基づいて前記素片を複数のクラスタに分類し、前記クラスタ毎に代表素片と代表属性ベクトルとを決定して保存し、音声を合成する際、目標となる複数の属性を有した目標属性ベクトルを生成し、前記代表属性ベクトルの中から前記目標属性ベクトルに最も近いものを選択し、選択された前記代表属性ベクトルに対応する前記代表素片を用いて音声を合成することを特徴とする音声合成方法。Classifying the segments into a plurality of clusters based on a scalar amount and an attribute vector associated with each of the segments in a population including a plurality of prosodic segments and a plurality of speech waveform segments, A representative element and a representative attribute vector are determined and stored for each cluster, and when synthesizing speech, a target attribute vector having a plurality of target attributes is generated, and the target attribute vector is selected from the representative attribute vectors. A speech synthesizing method, wherein a speech closest to an attribute vector is selected, and speech is synthesized using the representative segment corresponding to the selected representative attribute vector.

複数の韻律素片および複数の音声波形素片を含む母集合における各素片に関する複数の属性を有した属性ベクトルを前記各素片に付与するステップと、前記各素片がそれぞれ有するスカラ量に基づいて前記素片を複数のクラスタに分類するステップと、前記クラスタ毎に前記属性ベクトルを分類するステップと、前記属性ベクトルの分類に対応して前記クラスタを分割するステップと、前記クラスタ毎に代表素片を決定するステップと、前記クラスタ毎に代表属性ベクトルを決定するステップと、前記代表素片と前記代表属性ベクトルとを用いて音声を合成する際、目標となる複数の属性を有した目標属性ベクトルを生成するステップと、前記代表属性ベクトルの中から前記目標属性ベクトルに最も近いものを選択するステップと、選択された前記代表属性ベクトルに対応する前記代表素片を用いて音声を合成するステップとを含むことを特徴とする音声合成方法。Assigning to each segment an attribute vector having a plurality of attributes relating to each segment in a population set including a plurality of prosodic segments and a plurality of speech waveform segments; and Classifying the segment into a plurality of clusters based on the plurality of clusters, classifying the attribute vector for each cluster, dividing the cluster in accordance with the classification of the attribute vector, Determining a segment; determining a representative attribute vector for each cluster; and synthesizing a voice using the representative segment and the representative attribute vector. Generating an attribute vector; selecting the representative attribute vector closest to the target attribute vector; Speech synthesis method characterized by comprising the step of synthesizing a speech using the representative segment corresponding to the representative attribute vector.

複数の韻律素片および複数の音声波形素片を含む母集合における各素片に関する複数の属性を有した属性ベクトルを前記各素片に付与するステップと、前記各素片がそれぞれ有するスカラ量に基づいて前記素片を複数のクラスタに分類するステップと、前記クラスタ毎に前記属性ベクトルを分類するステップと、前記属性ベクトルの分類に対応して前記クラスタを分割するステップと、前記クラスタ毎に代表素片を決定するステップと、前記クラスタ毎に代表属性ベクトルを決定するステップとをコンピュータに実行させ、音声合成に用いる前記代表素片と前記代表属性ベクトルとを含む音声合成辞書を構築するようにしたことを特徴とする音声合成辞書構築プログラム。Assigning to each segment an attribute vector having a plurality of attributes relating to each segment in a population set including a plurality of prosodic segments and a plurality of speech waveform segments; and Classifying the segment into a plurality of clusters based on the plurality of clusters, classifying the attribute vector for each cluster, dividing the cluster in accordance with the classification of the attribute vector, A step of determining a unit and a step of determining a representative attribute vector for each cluster are performed by a computer so as to construct a speech synthesis dictionary including the representative unit used for speech synthesis and the representative attribute vector. A speech synthesis dictionary construction program characterized by the following.

請求項１３に記載の前記音声合成辞書構築プログラムによって構築された前記音声合成辞書が含む前記代表素片と前記代表属性ベクトルとを用いて音声合成する際、目標となる複数の属性を有した目標属性ベクトルを生成するステップと、前記代表属性ベクトルの中から前記目標属性ベクトルに最も近いものを選択するステップと、選択された前記代表属性ベクトルに対応する前記代表素片を用いて音声を合成するステップとをコンピュータに実行させることを特徴とする音声合成プログラム。A target having a plurality of attributes as targets when performing voice synthesis using the representative segment and the representative attribute vector included in the voice synthesis dictionary constructed by the voice synthesis dictionary construction program according to claim 13. Generating an attribute vector, selecting the closest one of the representative attribute vectors to the target attribute vector, and synthesizing a speech using the representative segment corresponding to the selected representative attribute vector And a computer for executing the steps.