JP2004279436A

JP2004279436A - Speech synthesizer and computer program

Info

Publication number: JP2004279436A
Application number: JP2003066521A
Authority: JP
Inventors: Campbell Nick; ニックキャンベル
Original assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Current assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Priority date: 2003-03-12
Filing date: 2003-03-12
Publication date: 2004-10-07
Anticipated expiration: 2023-03-12
Also published as: JP3706112B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer capable of performing speech synthesis which is naturally heard by using a naturally spoken voice DB. <P>SOLUTION: The speech synthesizer includes a balance sentence voice DB 34 wherein language information is labeled, the naturally spoken voice DB 42, a waveform generation part 36 and a speech signal synthesis part 38 which composes a speech signal of an input XML 30 given non-language information and the balance sentence voice DB 34, a candidate selection part 44 which selects a plurality of candidates for similar voice data from the naturally spoken voice DB 42 as to respective parts of an acoustic target 40 outputted from the speech signal synthesis part 38, a filter part 48 which calculates metric features from the candidate array 46 and selects information matching the non-language information given to the input XML 30 most, and a waveform generation part 52 which composes a synthesized speech signal 54 by connecting the selected natural speaking data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は音声合成技術に関し、特に、自然発話の音声データベースから、自然に聞こえる音声を合成するための技術に関する。
【０００２】
【従来の技術】
音声合成とは、もともと自然なものという事はできない。しかし、自然に聞こえる音声を合成する技術に関する需要は存在する。たとえば何らかの原因で発話を行なう事ができない人のためのコミュニケーションの補助、音声から音声への自動翻訳、電話を介した音声による情報提供、又は顧客からの電話による問合せに対する対応などにおいてそうした音声合成技術が必要とされる。
【０００３】
自然に聞こえる音声を合成しようとする場合、話の内容に従って異なるトーンの音声を使い分ける必要がある。そのためには、音声合成に使用される音声を要素に細分し、それぞれにその要素がどの様な場合に用いられる音声であるかを表すラベルを付ける必要がある。
【０００４】
現在のところ、そうした自然に聞こえる音声合成を行なうために使用可能と思われる、大規模な自然発話音声のコーパスがいくつか存在する。しかし、コーパスに含まれる音声を分割して各々にラベル付けを行なう作業は膨大なものとなる。また、自然発話の音響的特徴をモデル化する事に関連してまだ解決されていない多くの問題が存在する。
【０００５】
一方、音素バランス文を読上げた音声からなる音声データベース（以下これを「バランス文音声ＤＢ」と呼ぶ。）では、そうしたラベル付けは比較的容易である。バランス文音声ＤＢは、全ての音素及び全ての韻律をデータベース化している。
【０００６】
従来、バランス文音声ＤＢを用いた音声合成技術として、たとえば非特許文献１また非特許文献２で紹介されたＣＨＡＴＲと呼ばれる、音素を選択して連結するものが存在する。
【０００７】
音素を連結する事による音声合成の標準的な方法は、非特許文献１又は非特許文献２に記載された様に２段階を経る。第１の段階では、合成すべきテキスト（ターゲット）に従った音素上の及び韻律上の制約を反映した目的コスト関数を用いて、音声の各区間ごとに適切な候補をいくつかのコーパスから選択する。第２の段階では、合成後の音声をできるだけ滑らかにする様に、連結のためのコストを最小化する様、各区間の候補の中から一つずつを選択し、それらを連結して音声合成を行なう。
【０００８】
このプロセスのターゲットは、通常は、所望の出力音声を音素的に及び韻律的に表した、予め知られた記号表現（アルファ−ニューメリック）である。
【０００９】
【非特許文献１】
キャンベル、Ｗ．Ｎ．、ブラック、Ａ．Ｗ．、「ＣＨＡＴＲ多言語音声再配列合成システム、ＩＥＩＣＥ技報ＳＰ９６−７，４５−５２，１９９６（Ｃａｍｐｂｅｌｌ，Ｗ．Ｎ．”ＣＨＡＴＲａｍｕｌｔｉｌｉｎｇｕａｌｓｐｅｅｃｈｒｅ−ｓｅｑｕｅｎｃｉｎｇｓｙｎｔｈｅｓｉｓｓｙｓｔｅｍ”、ＴｅｃｈｎｉｃａｌＲｅｐｏｒｔｏｆＩＥＩＣＥＳＰ９６−７，４５−５２，１９９６）
【非特許文献２】
キャンベル、Ｗ．Ｎ．、「ＣＨＡＴＲ合成のための音声コーパスの処理」、音声処理に関する国際会議予稿集１８３−１８６，１９９７（Ｃａｍｐｂｅｌｌ，Ｗ．Ｎ．，”ＰｒｏｃｅｓｓｉｎｇａＳｐｅｅｃｈＣｏｒｐｕｓｆｏｒＣＨＡＴＲＳｙｎｔｈｅｓｉｓ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＳｐｅｅｃｈＰｒｏｃｅｓｓｉｎｇ１８３−１８６，１９９７）
【非特許文献３】
Ｐ．アルク及びＥ．ヴィルクマン、「逆フィルタリングにより推定した、声門容積速度波形のキャラクタリゼーションのための振幅ドメイン指数」、ＳｐｅｅｃｈＣｏｍｍ．，第１８巻、第２号、ｐｐ．１３１−１３８，１９９６（Ｐ．ＡｌｋｕａｎｄＥ．Ｖｉｌｋｍａｎ，”Ａｍｐｌｉｔｕｄｅｄｏｍａｉｎｑｕｏｔｉｅｎｔｆｏｒｃｈａｒａｃｔｅｒｉｚａｔｉｏｎｏｆｔｈｅｇｌｏｔｔａｌｖｅｌｏｃｉｔｙｗａｖｅｆｏｒｍｅｓｔｉｍａｔｅｄｂｙｉｎｖｅｒｓｅｆｉｌｔｅｒｉｎｇ”，ＳｐｅｅｃｈＣｏｍｍ．，ｖｏｌ．１８、ｎｏ．２、ｐｐ．１３１−１３８、１９９６）
【非特許文献４】
Ｐ．アルク、Ｔ．ベックストローム、及びＥ．ヴィルクマン、「声門気流のパラメータ化のための正規化振幅指数」、Ｊ．Ａｃｏｕｓｔ．Ｓｏｃ．Ａｍ．，ｖｏｌ．１１２，ｎｏ．２，ｐｐ．７０１−７１０，２００２（Ｐ．Ａｌｋｕ、Ｔ．Ｂａｅｃｋｓｔｒｏｅｍ、ａｎｄＥ．Ｖｉｌｋｍａｎ、”Ｎｏｒｍａｌｉｚｅｄａｍｐｌｉｔｕｄｅｑｕｏｔｉｅｎｔｆｏｒｐａｒａｍｅｔｒｉｚａｔｉｏｎｏｆｔｈｅｇｌｏｔｔａｌｆｌｏｗ”、Ｊ．Ａｃｏｕｓｔ．Ｓｏｃ．Ａｍ．，ｖｏｌ．１１２，ｎｏ．２，ｐｐ．７０１−７１０，２００２）
【発明が解決しようとする課題】
今日まで「コーパスベースの」音声合成として行なわれてきた研究の大部分は、実際は「データベースの」音声合成についてのものであったといわれている。その相違は、発話スタイルをどの程度カバーしているか、どの様な種類の発話スタイルに関するものであるか、という点である。
【００１０】
「コーパス」とは、ある言語を多少とも代表するテキスト又は音声の集合であり、ある言語に関する言語学的説明のための出発点又はある言語についての仮説を検証するための手段として使用できるものの事をいう。この場合、実際に使用されている言語の真正の実例についてのシステマチックな研究のためには、その集合が、ある言語の状態又は変化を特徴付ける様に選ばれた、自然発生的な言語（すなわちテキスト又は音声）の集まりである事が重要である。
【００１１】
ある特定の言語学的特徴を示す目的のために書いたテキストは、通常は言語学的研究のための真のコーパスに含ませるべきではないと考えられる。なぜならそれらは、「真正の」ものという基準を満たさず、従って「自然発生的なもの」でもないためである。
【００１２】
しかし、今までのところ、音声合成の研究に使用されてきたデータベースの大多数は、特定の目的のために設計されたものであり、通常は職業的なアナウンサが注意をはらって読んだものをスタジオ録音したものからなっている。それらは「使用されている音声」を代表するものではなく、常日頃経験する、言葉を用いた生活で出会う様な自然な発話スタイル及び発話状況に応じた変化形を含んでいるものでもない。
【００１３】
バランス音声ＤＢは、詳細にラベル付けを行なう事が可能である。しかしバランス音声ＤＢに含まれる音声は、話し言葉のフォーマルな言語学的特徴の多くの例を含んではいるが、話し言葉による社会的、相互作用的な機能という局面での特徴についてはほとんど含んでいない。バランス音声ＤＢを用いて音声合成を行なった場合、その結果得られる合成音声は硬い発音となり、自然な音声として聞こえるものではない。
【００１４】
もしも音声合成をより自然な形で行なう方向で発展させるのであれば、話し言葉による相互作用の全ての局面を表す事ができるコーパスであって、かつ話者の状態、態度、及び意図など、話し言葉をその意図に沿って解釈するための手掛かりを提供する非言語的情報をも含んだコーパスに基づいた研究を行なう事が必要である。
【００１５】
これを解決するために、自然発話ＤＢを用いる事が考えられる。しかし自然発話ＤＢを音声合成に用いようとすると、前述した通りラベル付けの作業が膨大となり、さらにラベル付けのための音響的な特徴をモデル化する事も困難であるという問題がある。そのため、従来は、自然発話音声ＤＢを用いて自然に聞こえる音声合成を行なう事が困難であるという問題点があった。
【００１６】
本発明はこの様な問題を解決するためになされたものであって、自然発話音声ＤＢを用いて自然に聞こえる音声合成を行なう事ができる音声合成装置を提供する事を目的とする。
【００１７】
この発明の他の目的は、自然発話音声ＤＢのラベル付けを行なう事なく、自然発話音声ＤＢを用いて自然に聞こえる音声合成を行なう事ができる音声合成装置を提供する事である。
【００１８】
この発明のさらに他の目的は、最初のターゲットから何らかの手段で音響的ターゲットを生成し、この音響ターゲットに類似した音声を自然発話音声ＤＢから抽出する事により、自然に聞こえる音声合成を行なう事ができる音声合成装置を提供する事である。
【００１９】
この発明の別の目的は、ターゲットの非言語的、パラ言語的特徴に沿った発話スタイルで、自然に聞こえる音声合成を行なう事ができる音声合成装置を提供する事である。
【００２０】
【課題を解決するための手段】
本発明の第１の局面に係る音声合成装置は、予め言語情報についてのラベル付けがされた朗読音声データからなる朗読音声データベースと、自然発話音声データからなる自然発話音声データベースと、非言語情報が予め付与されたテキスト情報を受け、朗読音声データベースからテキスト情報に付与された非言語情報と合致する言語情報が付与された音声データを抽出する事により、テキスト情報に対応する音声信号を合成するための音声合成手段と、自然発話音声データベースから音声信号の各部分について、各部分との間に定義される距離の小さいものから順番に自然発話音声データを複数個選択するための候補選択手段と、音声信号の各部分について、自然発話音声データベースから、候補選択手段により選択された複数個の自然発話データの各々について予め定められた韻律的特徴を算出し、テキスト情報に付与されている非言語情報と合致するものを選択するためのフィルタ手段と、フィルタ手段により選択された自然発話データに基づいて音声信号を合成するための手段とを含む。
【００２１】
好ましくは、テキスト情報に予め付与されている非言語情報は、予め定められた韻律的特徴を示す特徴ベクトルであり、フィルタ手段は、候補選択手段により選択された複数個の自然発話データの各々について予め定められた韻律的特徴を示す特徴ベクトルを算出し、テキスト情報に予め付与されている特徴ベクトルとの類似度が最も高いものを選択するための手段を含む。
【００２２】
さらに好ましくは、予め定められた韻律的特徴は、正規化振幅指数、音声信号のパワー、音声信号の持続時間、及び基本周波数のうち少なくとも一つを含む。
【００２３】
候補選択手段は、音声信号の各部分について、自然発話音声データベースから、各部分との間でＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングにより算出されるＤＰ距離が予め定められたしきい値より小さなものを選択するための手段を含んでもよい。
【００２４】
候補選択手段は、音声信号の各部分について、自然発話音声データベースから、各部分との間でＤＰマッチングにより算出されるＤＰ距離の小さなものから順番に予め定められた複数個だけ選択するための手段を含んでもよい。
【００２５】
本願発明の第２の局面は、コンピュータにより実行されると、当該コンピュータを上記したいずれかの音声合成装置として動作させるコンピュータプログラムに関する。
【００２６】
【発明の実施の形態】
−使用した自然発話ＤＢとその特徴−
スタジオ録音された音声と、日頃親しんでいる音声との間の最も大きな相違は、日頃親しんでいる音声で経験する発話スタイルが非常に大きな範囲にわたっているという点である。これは、話者が、発話時にその状況における発話のフォーマルさを示すために、喉頭部の設定を種々に変化させるためと思われる。
【００２７】
出願人において作成した音声コーパスの話者の一人について、１００時間以上の録音を行なって得た音声データを、発話サイズのチャンク（かたまり）に分割した。これらチャンクについてさらに、発話スタイルの特徴を３段階で示す様にラベル付けした。ラベルは以下の３種類である。
【００２８】
（ａ）話者の状態（感情及び態度）
（ｂ）話のスタイル（友好的、丁寧、柔らか、ためらいがち、など）
（ｃ）各発話の間の話者の声の調子（ブレシー、暖か、緊張気味など）
なおここでブレシー（ｂｒｅａｔｈｙ）とは気息性という事を意味し、典型的には丁寧でやさしく話すときの話し方の特徴である。この逆はプレスト（ｐｒｅｓｓｅｄ）という。
【００２９】
これら３つのラベルからなるベクトルを、音声から抽出した音響的特徴（ピッチ、パワー、話す速度、気息性の度合いなど）と組合せた。さらに、この結果得られる多次元空間の複雑さを軽減するために主成分分析（ＰＣＡ）を行なった。ＰＣＡ分析の第１次元は話者と相手との間の関係（仲のよさ）によく対応し、第２次元は発話内容（誠実さ）によく対応し、第３次元は話者の態度（熱意）によく対応した。
【００３０】
これは、相手との関係及び対話の目的に応じて、話者がその声の質、ピッチの幅、及びその表現を変化させているためだと思われる。別の人には別の話し方で話すというのは常識に適っている。しかし、音声関連の分野では、家族、友人、仕事上の知人、他人、及び機械などに対して人が話すときの発話スタイル及び音声の特徴がどの様に相違するかについては、ほとんどデータが蓄積されてこなかった。
【００３１】
実施の形態の説明をする前に、その背景となる上記した発話スタイル及び音声の特徴の相違について説明する。図１に、二人の話者（ＦＩＡ及びＦＡＮ）についての正規化振幅指数（ＮｏｒｍａｌｉｓｅｄＡｍｐｌｉｔｕｄｅＱｕｏｔｉｅｎｔ：ＮＡＱ）の分布を示す。ＮＡＱとは、振幅係数（Ａｍｐｌｉｔｕｄｅ
Ｑｕｏｔｉｅｎｔ：ＡＱ）を基本周波数ｆ０で正規化したものである。
【００３２】
ＡＱとは、非特許文献３においてアルク（Ａｌｋｕ）により示されているものであって、音声信号から声道の影響を除去するために、最適化した、時間的に変化するフォルマントを用いて音声を逆フィルタリングする事により得られる声門（声帯）気流の波形の微分の推定値であり、その波形のピークツーピークの振幅の最大値を、波形の微分のサイクルツーサイクルの振幅の最小値で除したものである。ＡＱは声門の発音のモード（「声の質」）を示す。
【００３３】
ＡＱは、そのままでは発話波形の基本周期と弱い相関を持つが、基本周波数ｆ０で除する事によりその影響を削減できる。その結果得られるのがＮＡＱである。
【００３４】
図１の下半分に示すのは、二人の日本人の女性話者（ＦＩＡ，ＦＡＮ）の発話について測定したＮＡＱのヒストグラムである。図１の上半分には、非特許文献４でＡｌｋｕらにより、５人の男性話者および５人の女性話者について報告された「プレスト」、「地声（通常）」、及び「ブレシー（気息性）」に関する測度測度と比較したものである。図１から、個人により多少の変動はあるが、分布全体の形状は類似したものである事、及びそれが前記文献に記載された「プレスト」、「地声（通常）」、及び「ブレシー（気息性）」という範囲に当てはまる事が分かる。話者ＦＡＮのデータに見られる歪みは、以下に説明する様によりくだけた（プレストな）発話スタイルが優勢である事により説明できる。
【００３５】
以下、この変動がランダムなものではない事、この変動が発話の非言語的特徴、たとえば対話相手との関係、発言の意図、及び発話スタイルなどとの相関により最もよく説明できる事、ならびにそのためこの変動を韻律的パラメータとして考えるべき事を示す。
【００３６】
出願人は、約２５０時間の音声データを収集し、聞き取りによりテキスト入力を行った。そのうち約１００時間分について発話スタイルと発話とその目的との間の関係という特徴に関するラベル付けを行なった。音声の音響的測定を行ない、知覚上の属性と物理的属性との間の相関に関する分析を行なった。
【００３７】
以下の説明では、一人の日本人女性話者から得られたデータについての検討をする。この女性は、頭部に装着した高性能なマイクを用いて毎日の会話を録音した。分析はこの女性の発話に対してのみ行なわれたが、ときには相手の発言もラベル付けを行なう作業者に聞き取れた。
【００３８】
データは、音響的及び知覚的なラベルを適切に付す事ができた１３，６０４発話からなる。「発話」とは、文書化の担当者にとっては、知覚できる切れ目のない音声部分の事をいい、おそらくは「イントネーションフレーズ」に対応するものである。その長さは単一シラブルから３５シラブルまでにわたっている。
【００３９】
データはＣＲＡＮのパブリックドメインの統計ソフトウェアパッケージ「Ｒ」を用いて分析された。相手（「誰に」）、発話スタイル（「どの様に」）、及び発話活動（「何のために」）からなる特徴集合を生成し、ＮＡＱと音声の基本周波数ｆ０という測度と照合する事により何らかの相関があるかどうかについて検討した。
【００４０】
対話の相手は次の表１に記載の様にグループ分けした。
【００４１】
【表１】

発話スタイルについては本実施の形態では簡略化し、「家族」、「友人」及び「他人」、さらに自分に対する発話という分類の各々について「丁寧」、「親しい」、及び「くだけた」というグループに分けた。全部で２４の発話カテゴリがあったが、ここではそのうちの次の５つについて論じる。すなわち「情報の提供」、「あいづち」、「情報の要求」、「つぶやき」、及び「繰返しの要求」である。
【００４２】
−発話の韻律とＮＡＱ−
正規化前には、ＡＱの基本周波数ｆ０とｒ＝−０．４０６の相関を有していた。正規化（ＮＡＱ＝ｌｏｇ（ＡＱ）＋ｌｏｇ（ｆ０））により得られたＮＡＱは基本周波数ｆ０とｒ＝０．１８２の相関を有していた。
【００４３】
図２は、家族に対する発話についてのＮＡＱと基本周波数ｆ０とを示す。図２において、ｍ１、ｍ２、ｍ３、ｍ４、ｍ５、ｍ６、及びｍ８は、それぞれ母、父、娘、夫、姉、姉の子、及び叔母を示す。図２から、いくつか興味ある傾向がわかる。すなわち、話者（女性）の娘（１歳）に対する発話が、基本周波数ｆ０及び気息性のいずれにおいても最も高い値を示している。気息性から、家族の序列が次の様に定まる。すなわち、娘＞姉の子＞父＞母＝姉＞叔母＞夫という順序である。この順序が、家族内での対話において、「気配り」をされている程度を示すという事が可能かも知れない。ラベル付け作業者も、この結果は発話を聞いているときの印象と一致している事を確認した。
【００４４】
図３は、対話の相手によるＮＡＱと基本周波数ｆ０とを示す。図３において「ｆ」は友人を示す。「ｍ」は家族、「ｔ」は他人を示す。興味深いのは、友人に対する「ａ」（注意深い発話）に関するＮＡＱの値は高く「ｂ」（親しい会話）及び「ｃ」（くだけた会話）の間では違いが見られないのに対して、家族間の会話ではこの関係が逆転している事である。すなわち、注意深い会話と親しい会話との間では違いが見られないのに対して、くだけた会話ではＮＡＱの値はかなり低くなっている。他人との会話については、くだけた会話はないが、注意深い会話及び親しい会話は予想した通りのＮＡＱの相違を示した。
【００４５】
図４は、発話とその目的についての相違について論ずる。既に述べた事から、注意深い会話においては、より「手ごろな」会話と比較してＮＡＱの値が高くなる事が予測される。図４は、この予想が正しい事を示す。図４は５つのカテゴリ（つぶやき（「？」）、間投詞（「Ｉ」）、情報の提供（「ｅ」）、情報の要求（「ｒｅ」）、及び繰返しの要求（「ｒｚ」））についてのＮＡＱと基本周波数ｆ０とを示す。
【００４６】
図４を参照して、情報の提供のＮＡＱの値は、情報の要求についての値よりもかなり低い。また、繰返しの要求のＮＡＱの値が最も高い。「つぶやき」については他とは別カテゴリであると考えられるが、それは図４によっても裏打ちされる。すなわち、つぶやきについてはｆ０がきわだって低く、気息性（高ＮＡＱ値）の声質を示している。
【００４７】
以上から、ＮＡＱにより測定した声質が、会話の相手、発話スタイル、及び発話の目的と大きな相関を持っている事が分かる。ＮＡＱは、会話においてはらう「注意」の程度によって一定の変化をし、基本周波数とは独立に変化する。従って、この声質を、基本周波数ｆ０、発話の長さ、及び振幅とともに韻律的特徴と考える事ができ、意味上の非言語的な相違を示すために音声合成において制御すべきものと考える。
【００４８】
−音声合成装置の構成−
上に述べた考え方に従い、ＮＡＱにより測定した声質を制御することにより、意味上の非言語的な相違が反映された音声合成を行なう音声合成装置の実施の形態について以下説明する。
【００４９】
図５に、この一実施の形態に係る音声合成装置のブロック図を示す。図５を参照して、この音声合成装置は、入力される音声合成の対象となるテキスト及び非言語情報を表す属性などを含む入力ＸＭＬ（ＥｘｔｅｎｄｅｄＭａｒｋ−ＵｐＬａｎｇｕａｇｅ）文３０を前処理し、音声合成のターゲットとなるテキストを作成する前処理部３２と、予め準備された特定の話者のバランス文音声ＤＢ３４と、前処理部３２により生成されたターゲットテキストに対し、バランス文音声ＤＢから適切な音素列を選択し連結する事により、入力ＸＭＬ３０に対する音声波形データを生成するための波形生成部３６と、波形生成部３６により生成された音声波形データに基づいて音声信号を合成するための音声信号合成部３８とを含む。
【００５０】
波形生成部３６及び音声信号合成部３８にはいずれも従来の音声合成技術を用いる事ができる。バランス文音声ＤＢ３４の音声は自然な音声ではないので、生成される音声は生硬で、自然とはいえない音声となる。ただし、バランス文音声ＤＢ３４に含まれる各音素については、音素バランス文の朗読文から得られたものなので、適切にラベル付けをする事が可能である。その結果、音声信号合成部３８から出力される音声信号は、生硬ではあるが、入力ＸＭＬ３０で指定された非言語情報に比較的よくあった音声信号となる。
【００５１】
本実施の形態に係る装置は、この様に音声信号合成部３８の出力として得られた音声信号を、自然な音声合成のための音響的ターゲット４０としてさらに自然発話音声データを用いて音声合成を行ない、自然な発話に近い合成音声信号５４を得る点にある。そのために本実施の形態の装置は、上記した各構成要素に加えて、バランス文音声ＤＢ３４の話者と同じ話者（又はよく似た声を出す人）の自然な発話を集める事により予め準備された自然発話音声ＤＢ４２を用いる。自然発話音声ＤＢ４２は、上記した話者の自然発話を収集する事により得られたもので、様々な状況での音声データを集めてある。ただし、この自然発話音声ＤＢ４２内の音声データには、上記した非言語情報に合わせて音声を抽出するためのラベル付けなどはしていない。自然発話についてそうしたラベル付けをする事が、従来の技術の説明で述べた様に困難だからである。
【００５２】
この装置はさらに、音響的ターゲット４０の各時間期間について、ＤＰマッチングによって自然発話音声ＤＢ４２の中から比較的近い（ＤＰ距離が小さい、すなわち類似度が高い）音声データを音声合成のための候補として複数個選択し、候補列４６として出力するための候補選択部４４と、候補列４６内の各候補について所定の韻律的属性を求め、その部分について入力ＸＭＬ３０で指定された非言語的情報と合致した韻律的属性を示すもののみを選択するためのフィルタ部４８とを含む。ここで使用される時間期間は、可変長である。
【００５３】
フィルタ部４８が各候補列から求める韻律的属性としては、よく知られている基本周波数ｆ０、音声データのパワー、発話の長さに加えて、上記したＮＡＱを含む。たとえばこれら各要素について、入力ＸＭＬ３０では各発話単位（たとえば文）について予め特徴ベクトル（又は特徴ベクトルを計算するための情報）が非言語情報として付与されている。各候補についてもこれらの情報を計算する事ができ、比較のための特徴ベクトルを作成する事ができる。フィルタ部４８は、各候補について計算された特徴ベクトルと、入力ＸＭＬ３０でその発話単位について付与されていた特徴ベクトルとの間の距離を計算し、最も小さな距離を示した候補であって、かつ連結したときになめらかに連結できる様な候補を選択する。フィルタ部４８は、この様にして最終的に音声合成をするための最終音声データ列５０を出力する。
【００５４】
この装置はさらに、最終音声データ列５０に基づいて波形生成を行なうための波形生成部５２を含む。波形生成部５２が出力する合成音声信号５４は、自然発話音声ＤＢ４２から抽出した音声データに基づいて合成されており、かつその各発話単位は入力ＸＭＬ３０においてその発話単位に付与されていた非言語情報によく合致したものとなる。従って、合成音声信号５４は、自然に聞こえる音声であって、かつ指定された発話モードによく合致したものとなる。
【００５５】
−音声合成装置の動作−
この装置は以下の様に動作する。入力ＸＭＬ３０が前処理部３２に与えられると、前処理部３２は音声合成すべきテキストを各発話単位で作成し、かつ入力ＸＭＬ３０において各発話単位に付与されていた非言語情報を抽出する。波形生成部３６は、バランス文を朗読した音声から作成した朗読音声データベースであるバランス文音声ＤＢ３４から、前処理部３２によって与えられたテキストを合成するための音声データをバランス文音声ＤＢ３４から抽出する。波形生成部３６はこの際、前処理部３２から与えられた非言語情報と一致するラベルが付された音声データを抽出する。波形生成部３６はさらに、抽出した音声データを従来の技術に従ってなめらかに連結し、音声信号合成部３８に与える。
【００５６】
音声信号合成部３８は、この音声データ列に基づいて、従来の技術に従って音声合成を行ない、自然発話音声合成のための音響的ターゲット４０を出力し候補選択部４４に与える。この音響的ターゲット４０の例を図６に示す。図６に示す例では、音響的ターゲット４０は時間期間９２，９４，９６及び９８を含む。この期間は可変長である。またこれらの時間期間は互いに一部重複していてもよい図５を参照して、候補選択部４４は、図６に示す各区間９２，９４，９６及び９８について、自然発話音声ＤＢ４２からＤＰマッチングにより音響的ターゲット４０の波形と類似した音声データ候補列１１２，１１４，１１６，１１８をそれぞれ抽出する。音声データ候補列１１２、１１４，１１６，１１８の各々は複数の音声データ候補を含む。本実施の形態では、候補選択部４４は、ＤＰ距離の小さなものから順番に所定の複数個を候補として選択する。候補選択部４４はこれら音声データ候補列１１２、１１４、１１６、１１８を図５に示す候補列４６としてフィルタ部４８に与える。
【００５７】
フィルタ部４８は、たとえば図６に示す時間期間９２について、音声データ候補列１１２に含まれる各候補の特徴ベクトルを算出する。そしてこの特徴ベクトルと、入力ＸＭＬ３０において付与されていた特徴ベクトルとを比較して、その間で計算されるコサイン尺度（すなわち類似度）が小さなものであって、かつ連続する期間の音声データと滑らかに連結できる様な候補１３２を選択する。同様にフィルタ部４８は、時間期間９４，９６，９８等についても複数の候補から候補１３４、１３６、１３８を抽出する。これらが図５に示す最終音声データ列５０となる。
【００５８】
波形生成部５２はこれら最終音声データ列５０を滑らかに連結した合成音声信号５４を出力する。
【００５９】
以上説明した本実施の形態の装置によれば、一旦バランス文音声ＤＢ３４を用いて音響的ターゲット４０を生成し、この音響的ターゲット４０に近く、かつ入力ＸＭＬ３０に付与されていた非言語的特徴と一致した韻律的特徴を示す音声データを自然発話音声ＤＢ４２から抽出する事ができる。この音声データ列から合成した合成音声信号５４を得る事ができる。そのため、合成音声信号５４は、自然に聞こえる音声であってかつ最初に指定された非言語的特徴によく合致したものとなる。また、自然発話音声ＤＢ４２からの抽出のために、自然発話音声ＤＢ４２中の音声データに予めラベル付けをしておく必要はない。バランス文音声ＤＢ３４のラベル付けだけをしておけばよく、これは容易に行なう事ができる。
【００６０】
上記した実施の形態では、候補選択部４４は、ＤＰ距離の小さなものから順番に所定の複数個を選択する。しかし本発明はその様な実施の形態には限定されない。たとえば、候補選択部４４は、ＤＰ距離が所定のしきい値より小さなもののみを候補として選択する様にしてもよい。また、ＤＰ距離の小さなものから順番に、かつ所定のしきい値より小さなもののみを選択する様にしてもよい。
【００６１】
なお、ここに説明した実施の形態の装置は１又は複数のコンピュータ及び当該１又は複数のコンピュータ上で実行されるソフトウェアにより実現する事ができる。そのソフトウェアの制御構造は、図５に示したブロック図とよく対応している。そのため、ここではその詳細は説明しない。当業者であれば、上記した説明からソフトウェアをどの様に構成すればよいかは明らかであろう。
【００６２】
今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。
【図面の簡単な説明】
【図１】本発明の一実施の形態の装置の原理を説明するための図である。
【図２】家族に対するＮＡＱと基本周波数ｆ０とを示すための図である。
【図３】相手の種類によるＮＡＱと基本周波数ｆ０とを示すための図である。
【図４】発話の目的によるＮＡＱと基本周波数ｆ０とを示すための図である。
【図５】本発明の一実施の形態の装置のブロック図である。
【図６】本発明の一実施の形態の装置の動作を説明するための図である。
【符号の説明】
３０入力ＸＭＬ、３２前処理部、３４バランス文音声ＤＢ、３６波形生成部、３８音声信号合成部、４０音響的ターゲット、４２自然発話音声ＤＢ、４４候補選択部、４６候補列、４８フィルタ部、５０最終音声データ列、５２波形生成部、５４合成音声信号[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis technique, and more particularly, to a technique for synthesizing a naturally audible speech from a speech database of naturally uttered speech.
[0002]
[Prior art]
Speech synthesis cannot be naturally natural. However, there exists a need for a technique for synthesizing naturally sounding speech. For example, such speech synthesis technology for assisting communication for people who cannot speak for any reason, automatic translation from speech to speech, providing information by speech via telephone, or responding to inquiries from customers by telephone, etc. Is required.
[0003]
When trying to synthesize a sound that sounds natural, it is necessary to use different tones of sound according to the content of the story. For this purpose, it is necessary to subdivide the speech used for speech synthesis into elements, and to label each of the elements with the type of speech used.
[0004]
At present, there are several large corpus of spontaneously uttered speech that could be used to perform such natural sounding speech synthesis. However, the task of dividing the speech contained in the corpus and labeling each speech is enormous. There are also many unsolved problems associated with modeling acoustic features of spontaneous speech.
[0005]
On the other hand, such labeling is relatively easy in a voice database (hereinafter, referred to as a “balanced sentence voice DB”) composed of voices read out from phoneme balance sentences. The balance sentence voice DB is a database of all phonemes and all prosody.
[0006]
2. Description of the Related Art Conventionally, as a speech synthesis technique using a balanced sentence speech DB, for example, there is a technique called CHATR, which is introduced in Non-Patent Documents 1 and 2, and selects and connects phonemes.
[0007]
A standard method of speech synthesis by connecting phonemes involves two steps as described in Non-Patent Document 1 or Non-Patent Document 2. In the first stage, an appropriate candidate is selected from several corpora for each section of speech using an objective cost function reflecting constraints on phonemes and prosody according to a text (target) to be synthesized. I do. In the second step, one of the candidates for each section is selected one by one so as to minimize the cost for connection so as to make the synthesized speech as smooth as possible and to connect them to perform voice synthesis. Perform
[0008]
The target of this process is typically a known symbolic representation (alpha-numeric) that phonetically and prosodicly represents the desired output speech.
[0009]
[Non-patent document 1]
Campbell, W.C. N. , Black, A. W. "CHATR Multilingual Speech Rearrangement Synthesis System, IEICE Technical Report SP96-7, 45-52, 1996 (Campbell, W.N.," CHATR multi-special speech re-sequencing synthesis system ", Technical SP-Emission Report, Technical Information, 1996). 45-52, 1996)
[Non-patent document 2]
Campbell, W.C. N. , "Processing of Speech Corpus for CHATR Synthesis", Proceedings of International Conference on Speech Processing, 183-186, 1997 (Campbell, WN, "Processing a Speech Corpus for CHATR Synthesis", Proceedings of the International Communication Corporation). 183-186, 1997)
[Non-Patent Document 3]
P. Alc and E.A. Wilkman, "Amplitude Domain Index for Characterization of Glottal Volume Velocity Waveform, Estimated by Inverse Filtering", Speech Comm. , Vol. 18, No. 2 pp. 131-138, 1996 (P. Alku and E. Vilkman, "Amplitude domininquoient for charactrization of the global velocity wave, 19th edition, 1996, 2nd edition, 18th.
[Non-patent document 4]
P. Arc, T .; Vextrom, and E.C. Wilkman, "Normalized Amplitude Index for Parameterizing Glottal Airflow", J. Amer. Acoustic. Soc. Am. , Vol. 112, no. 2, pp. 701-710, 2002 (P. Alku, T. Baeckström, and E. Vilkman, "Normalized amplitude qualitative for parametrization of the total flow," J. Acoustic. 701-710, 2002).
[Problems to be solved by the invention]
It is said that much of the work done to date as "corpus-based" speech synthesis was actually about "database" speech synthesis. The difference is in how much the utterance style is covered and what kind of utterance style is involved.
[0010]
A "corpus" is a collection of text or speech that is more or less representative of a language and can be used as a starting point for a linguistic explanation of a language or as a means to test hypotheses about a language. Say. In this case, for a systematic study of genuine instances of the language actually used, a set of naturally occurring languages (i.e., those selected to characterize the state or change of a language) It is important that it is a collection of texts or sounds.
[0011]
Texts written for the purpose of demonstrating certain linguistic features usually should not be included in a true corpus for linguistic research. Because they do not meet the criterion of being "authentic" and are therefore not "naturally occurring".
[0012]
However, to date, the vast majority of databases used in speech synthesis research have been designed for specific purposes, and are typically read by a professional announcer with care. It consists of studio recordings. They do not represent “used voices”, nor do they include natural speech styles that are commonly encountered in everyday life, such as those encountered in a life using words, and variations in accordance with speech situations.
[0013]
The balanced voice DB can be labeled in detail. However, the speech included in the balance speech DB includes many examples of formal linguistic features of spoken language, but hardly includes features in the aspect of social and interactive functions based on spoken language. When speech synthesis is performed using the balanced speech DB, the resulting synthesized speech has a hard pronunciation and cannot be heard as a natural speech.
[0014]
If speech synthesis is to be developed in a more natural way, it should be a corpus that can represent all aspects of spoken language interaction, and that the spoken language, such as the state, attitude, and intention of the speaker, should be described. It is necessary to conduct a corpus-based study that also includes nonverbal information that provides clues to the interpretation.
[0015]
In order to solve this, it is conceivable to use a natural utterance DB. However, if the natural utterance DB is used for speech synthesis, there is a problem that the labeling operation becomes enormous as described above, and it is also difficult to model acoustic features for labeling. Therefore, conventionally, there has been a problem that it is difficult to synthesize a sound that can be heard naturally using the naturally uttered speech DB.
[0016]
The present invention has been made in order to solve such a problem, and an object of the present invention is to provide a speech synthesis apparatus that can synthesize speech that sounds natural by using a naturally uttered speech DB.
[0017]
Another object of the present invention is to provide a speech synthesizer capable of performing a natural-sound speech synthesis using a spontaneous speech DB without labeling the spontaneous speech DB.
[0018]
Still another object of the present invention is to generate a sound target from an initial target by some means, and to extract a sound similar to the sound target from a natural utterance voice DB to synthesize a sound that sounds natural. The purpose of the present invention is to provide a speech synthesizer capable of performing the above.
[0019]
Another object of the present invention is to provide a speech synthesizer capable of synthesizing a sound that sounds natural in an utterance style according to the non-verbal and para-linguistic features of the target.
[0020]
[Means for Solving the Problems]
A speech synthesis device according to a first aspect of the present invention includes a reading voice database including reading voice data in which linguistic information is labeled in advance, a natural utterance voice database including natural utterance voice data, and a non-verbal information. To synthesize a speech signal corresponding to text information by receiving text information given in advance and extracting speech data with linguistic information matching the non-linguistic information given to the text information from the reading speech database Candidate synthesis means for selecting a plurality of spontaneously uttered speech data in order from the smallest distance defined between the respective parts of the speech signal from the spontaneously uttered speech database, For each part of the voice signal, a plurality of natural utterances selected by the candidate selection means from the natural utterance voice database Filter means for calculating a predetermined prosodic feature for each of the data, and selecting the one that matches the non-verbal information given to the text information, based on the natural utterance data selected by the filter means Means for synthesizing an audio signal.
[0021]
Preferably, the non-verbal information added to the text information in advance is a feature vector indicating a predetermined prosodic feature, and the filter unit performs a process for each of the plurality of natural utterance data selected by the candidate selection unit. Means for calculating a feature vector indicating a predetermined prosodic feature and selecting a feature vector having the highest similarity to a feature vector previously assigned to text information is included.
[0022]
More preferably, the predetermined prosodic feature includes at least one of a normalized amplitude index, a power of the audio signal, a duration of the audio signal, and a fundamental frequency.
[0023]
The candidate selecting means selects, for each part of the audio signal, from the spontaneously uttered voice database, a DP distance calculated by DP (Dynamic Programming) matching with each part is smaller than a predetermined threshold value. Means may be included.
[0024]
The candidate selecting means is means for selecting, for each part of the audio signal, a predetermined plurality of pieces from the naturally uttered speech database in ascending order of the DP distance calculated by DP matching with each part. May be included.
[0025]
A second aspect of the present invention relates to a computer program which, when executed by a computer, causes the computer to operate as any of the above-described speech synthesizers.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION
-Natural utterance DB used and its characteristics-
The biggest difference between studio-recorded audio and everyday familiar audio is that the utterance styles experienced by everyday familiar audio span a very large range. This is presumably because the speaker varied the laryngeal settings in order to indicate the formalness of the utterance in the situation when speaking.
[0027]
For one speaker of the speech corpus created by the applicant, speech data obtained by recording for 100 hours or more was divided into utterance-size chunks. These chunks were further labeled to show speech style features in three stages. Labels are of the following three types.
[0028]
(A) Speaker's condition (emotion and attitude)
(B) Style of conversation (friendly, polite, soft, hesitant, etc.)
(C) Tone of the speaker between each utterance (blessed, warm, nervous, etc.)
Here, "breathy" means breathiness, and is typically a characteristic of a manner of speaking when politely and gently speaking. The converse is called pressed.
[0029]
Vectors consisting of these three labels were combined with acoustic features (pitch, power, speaking speed, degree of breathiness, etc.) extracted from speech. In addition, principal component analysis (PCA) was performed to reduce the complexity of the resulting multidimensional space. The first dimension of the PCA analysis corresponds well to the relationship between the speaker and the other party (good friendship), the second dimension corresponds well to the content of the utterance (honesty), and the third dimension corresponds to the attitude (enthusiasm) of the speaker. ) Well corresponded.
[0030]
This is probably because the speaker changes the quality of the voice, the width of the pitch, and the expression depending on the relationship with the other party and the purpose of the dialogue. It is common sense to speak differently to another person. However, in the voice-related field, little data is accumulated on how the speech style and voice characteristics of people talking to family, friends, business acquaintances, others, and machines etc. differ. I was not done.
[0031]
Before describing the embodiments, differences in the above-described utterance styles and characteristics of voice, which are the background, will be described. FIG. 1 shows the distribution of Normalized Amplitude Quantities (NAQ) for two speakers (FIA and FAN). NAQ is an amplitude coefficient (Amplitude).
Quantitative (AQ) is normalized by the fundamental frequency f0.
[0032]
AQ is described by Alku in Non-Patent Document 3, and uses an optimized, time-varying formant to remove the influence of the vocal tract from an audio signal. Is the estimated value of the derivative of the glottal (vocal cord) airflow waveform obtained by inverse filtering of the waveform, and the maximum value of the peak-to-peak amplitude of the waveform is divided by the minimum value of the cycle-to-cycle amplitude of the waveform derivative. It was done. AQ indicates the glottal pronunciation mode ("voice quality").
[0033]
AQ has a weak correlation with the fundamental period of the utterance waveform as it is, but its effect can be reduced by dividing by the fundamental frequency f0. The result is the NAQ.
[0034]
The lower half of FIG. 1 is a NAQ histogram measured for the utterances of two Japanese female speakers (FIA, FAN). In the upper half of FIG. 1, Alku et al., In Non-Patent Document 4, reported five male speakers and five female speakers for “presto”, “ground voice (normal)”, and “bresseie ( Breathability) ". From FIG. 1, although there is some variation depending on the individual, the shape of the whole distribution is similar, and it is described in the above-mentioned literature that "presto", "ground voice (normal)", and "bressey ( (Breathing)). The distortion seen in the data of the speaker FAN can be explained by the prevailing (presto) utterance style as described below.
[0035]
In the following, this variation is not random, and it can be best explained by its correlation with nonverbal features of the utterance, such as its relationship with the dialogue partner, the intent of the utterance, and the style of the utterance. We show that fluctuation should be considered as a prosodic parameter.
[0036]
Applicants collected approximately 250 hours of audio data and conducted text input by listening. Labeling was performed on the characteristics of the utterance style and the relationship between utterance and its purpose for about 100 hours. We performed acoustic measurements of speech and analyzed the correlation between perceptual and physical attributes.
[0037]
In the following discussion, we examine data obtained from one Japanese female speaker. The woman recorded her daily conversation using a smart microphone on her head. The analysis was performed only on the woman's utterance, but sometimes the other person was also heard by the labeling worker.
[0038]
The data consisted of 13,604 utterances that could be properly labeled acoustically and perceptually. An "utterance" refers to a continuous sound portion that can be perceived by a documenting person, and probably corresponds to an "intonation phrase". Its length ranges from a single syllable to 35 syllables.
[0039]
Data was analyzed using CRAN's public domain statistical software package "R". Generate a feature set consisting of the partner ("to whom"), the utterance style ("how"), and the utterance activity ("what"), and compare it with the NAQ and the measure of the fundamental frequency f0 of speech. Was examined to see if there was any correlation.
[0040]
The participants in the dialog were grouped as described in Table 1 below.
[0041]
[Table 1]

The utterance style is simplified in the present embodiment, and the classifications of "family", "friends" and "others" and utterances to oneself are further divided into "attentive", "close", and "kakuta" groups. Was. There were a total of 24 utterance categories, the following five of which are discussed here. That is, "information provision", "aizuchi", "information request", "tweet", and "repetition request".
[0042]
-Prosody of speech and NAQ-
Prior to the normalization, there was a correlation between the fundamental frequency f0 of the AQ and r = −0.406. The NAQ obtained by normalization (NAQ = log (AQ) + log (f0)) had a correlation between the fundamental frequency f0 and r = 0.182.
[0043]
FIG. 2 shows the NAQ and the fundamental frequency f0 for the utterance to the family. In FIG. 2, m1, m2, m3, m4, m5, m6, and m8 indicate a mother, a father, a daughter, a husband, an older sister, a younger sister, and an aunt, respectively. FIG. 2 shows some interesting trends. That is, the utterance of the speaker (female) to the daughter (1 year old) shows the highest value in both the fundamental frequency f0 and the breathiness. Based on breathability, the order of the family is determined as follows. That is, the order is daughter>sister'schild>father> mother = older sister>aunt> husband. It may be possible for this order to indicate the degree of "attentive" in the dialogue within the family. Labelers also confirmed that this result was consistent with the impression of listening to the utterance.
[0044]
FIG. 3 shows the NAQ and the fundamental frequency f0 of the conversation partner. In FIG. 3, "f" indicates a friend. “M” indicates a family, and “t” indicates another person. Interestingly, the value of NAQ for “a” (attentive utterance) for a friend is high and there is no difference between “b” (close conversation) and “c” (close conversation), while the family Is that this relationship is reversed. That is, while there is no difference between a careful conversation and a close conversation, the value of NAQ is considerably low in a casual conversation. Conversations with others were not informal, but careful and close conversations showed the expected NAQ differences.
[0045]
FIG. 4 discusses the differences between speech and its purpose. From what has already been described, it is expected that the value of NAQ will be higher in a careful conversation than in a more "reasonable" conversation. FIG. 4 shows that this prediction is correct. FIG. 4 shows five categories (murmur ("?"), Interjection ("I"), information provision ("e"), information request ("re"), and repetition request ("rz")). And the fundamental frequency f0.
[0046]
Referring to FIG. 4, the value of the NAQ for providing information is significantly lower than the value for requesting information. Further, the value of the NAQ of the repeated request is the highest. “Tweets” are considered to be a different category from the others, but this is also backed by FIG. In other words, f0 is extremely low for the tweet, indicating a voice quality of breathiness (high NAQ value).
[0047]
From the above, it can be seen that the voice quality measured by the NAQ has a great correlation with the conversation partner, the utterance style, and the purpose of the utterance. The NAQ changes constantly depending on the degree of “attention” in conversation, and changes independently of the fundamental frequency. Therefore, this voice quality can be considered as a prosodic feature together with the fundamental frequency f0, the length of the utterance, and the amplitude, and should be controlled in the speech synthesis to indicate a semantic non-linguistic difference.
[0048]
-Configuration of speech synthesizer-
An embodiment of a speech synthesis apparatus that performs speech synthesis reflecting semantic non-linguistic differences by controlling voice quality measured by NAQ according to the above-described concept will be described below.
[0049]
FIG. 5 shows a block diagram of a speech synthesizer according to this embodiment. Referring to FIG. 5, the speech synthesis apparatus pre-processes an input XML (Extended Mark-UpLanguage) sentence 30 that includes an input text to be synthesized and an attribute indicating non-verbal information, and performs speech synthesis. A pre-processing unit 32 that creates a text to be a target, a balanced sentence voice DB 34 of a specific speaker prepared in advance, and an appropriate phoneme from the balanced sentence DB for the target text generated by the pre-processing unit 32. A waveform generator 36 for generating audio waveform data for the input XML 30 by selecting and connecting columns, and an audio signal synthesizer for synthesizing an audio signal based on the audio waveform data generated by the waveform generator 36. A part 38.
[0050]
A conventional speech synthesis technique can be used for both the waveform generation unit 36 and the speech signal synthesis unit 38. Since the voice of the balance sentence voice DB 34 is not a natural voice, the generated voice is a raw and hard-to-natural voice. However, since each phoneme included in the balance sentence voice DB 34 is obtained from the reading sentence of the phoneme balance sentence, it is possible to appropriately label the phonemes. As a result, the audio signal output from the audio signal synthesizing unit 38 is an audio signal which is raw but relatively suitable for the non-verbal information specified by the input XML 30.
[0051]
The apparatus according to the present embodiment performs speech synthesis using the speech signal obtained as an output of the speech signal synthesis unit 38 as an acoustic target 40 for natural speech synthesis and further using naturally uttered speech data. This is to obtain a synthesized speech signal 54 close to a natural utterance. To this end, the apparatus according to the present embodiment prepares in advance by collecting natural utterances of the same speaker (or a person who utters a very similar voice) as the speaker of the balanced sentence voice DB 34 in addition to the above-described components. The natural speech voice DB 42 obtained is used. The spontaneously uttered voice DB 42 is obtained by collecting the above-described spontaneous utterances of the speaker, and collects voice data in various situations. However, the voice data in the naturally uttered voice DB 42 is not labeled for extracting voice in accordance with the above-mentioned non-verbal information. This is because such labeling of natural speech is difficult as described in the description of the related art.
[0052]
The apparatus further includes, for each time period of the acoustic target 40, speech data that is relatively close (small DP distance, that is, high similarity) from the natural utterance speech DB 42 by DP matching as a candidate for speech synthesis. A candidate selection unit 44 for selecting a plurality of pieces and outputting them as a candidate string 46; obtaining a predetermined prosodic attribute for each candidate in the candidate string 46; and matching the part with the non-verbal information specified by the input XML 30 And a filter unit 48 for selecting only those indicating the prosodic attributes. The time period used here is of variable length.
[0053]
The prosodic attributes obtained by the filter unit 48 from each candidate sequence include the above-described NAQ in addition to the well-known fundamental frequency f0, power of voice data, and utterance length. For example, for each of these elements, in the input XML 30, a feature vector (or information for calculating the feature vector) is given in advance as nonlinguistic information for each utterance unit (for example, a sentence). This information can be calculated for each candidate, and a feature vector for comparison can be created. The filter unit 48 calculates the distance between the feature vector calculated for each candidate and the feature vector assigned to the utterance unit in the input XML 30, and is the candidate indicating the smallest distance, and Select a candidate that can be connected smoothly when done. The filter unit 48 outputs a final voice data sequence 50 for finally performing voice synthesis in this way.
[0054]
The apparatus further includes a waveform generator 52 for generating a waveform based on the final audio data sequence 50. The synthesized voice signal 54 output from the waveform generation unit 52 is synthesized based on voice data extracted from the natural voice voice DB 42, and each voice unit is the non-verbal information given to the voice unit in the input XML 30. Will be a good match. Therefore, the synthesized voice signal 54 is a sound that sounds natural and matches well with the specified utterance mode.
[0055]
-Operation of speech synthesizer-
This device operates as follows. When the input XML 30 is provided to the pre-processing unit 32, the pre-processing unit 32 creates a text to be speech-synthesized for each utterance unit, and extracts non-verbal information assigned to each utterance unit in the input XML 30. The waveform generation unit 36 extracts voice data for synthesizing the text given by the pre-processing unit 32 from the balance sentence voice DB 34, which is a reading voice database created from the voice reading the balance sentence. . At this time, the waveform generating unit 36 extracts audio data to which a label matching the non-verbal information provided from the preprocessing unit 32 is attached. The waveform generator 36 further smoothly connects the extracted audio data according to a conventional technique, and provides the audio data to the audio signal synthesizer 38.
[0056]
The voice signal synthesizer 38 performs voice synthesis based on the voice data sequence according to a conventional technique, outputs an acoustic target 40 for naturally uttered voice synthesis, and supplies the target 40 to the candidate selector 44. An example of this acoustic target 40 is shown in FIG. In the example shown in FIG. 6, acoustic target 40 includes

time periods

92, 94, 96, and 98. This period is of variable length. In addition, with reference to FIG. 5 in which these time periods may partially overlap with each other, the candidate selecting unit 44 performs the DP matching from the natural utterance voice DB 42 for each of the

sections

92, 94, 96, and 98 shown in FIG. To extract the audio data candidate strings 112, 114, 116, and 118 similar to the waveform of the acoustic target 40, respectively. Each of the audio data candidate strings 112, 114, 116, and 118 includes a plurality of audio data candidates. In the present embodiment, the candidate selection unit 44 selects a plurality of candidates as candidates in ascending order of the DP distance. The candidate selection unit 44 provides the audio data candidate strings 112, 114, 116, and 118 to the filter unit 48 as a candidate string 46 shown in FIG.
[0057]
The filter unit 48 calculates a feature vector of each candidate included in the audio data candidate sequence 112, for example, for the time period 92 shown in FIG. The feature vector is compared with the feature vector provided in the input XML 30, and the cosine scale (similarity) calculated between the feature vectors is small, and the speech data of the continuous period is smoothly compared to the feature vector. A candidate 132 that can be connected is selected. Similarly, the filter unit 48

extracts candidates

134, 136, and 138 from the plurality of candidates for the

time periods

94, 96, 98, and the like. These are the final audio data strings 50 shown in FIG.
[0058]
The waveform generator 52 outputs a synthesized voice signal 54 in which the final voice data sequence 50 is smoothly connected.
[0059]
According to the apparatus of the present embodiment described above, the acoustic target 40 is once generated using the balance sentence voice DB 34, and the non-verbal features close to the acoustic target 40 and assigned to the input XML 30 are generated. Voice data indicating the matched prosodic feature can be extracted from the naturally uttered voice DB 42. A synthesized audio signal 54 synthesized from the audio data sequence can be obtained. Therefore, the synthesized speech signal 54 is a sound that sounds natural and matches well with the non-verbal feature specified first. Further, in order to extract the speech data from the spontaneously uttered voice DB 42, it is not necessary to label the voice data in the spontaneously uttered voice DB 42 in advance. It is only necessary to label the balance sentence voice DB 34, and this can be easily performed.
[0060]
In the above-described embodiment, the candidate selection unit 44 selects a predetermined plurality in order from the smallest DP distance. However, the present invention is not limited to such an embodiment. For example, the candidate selection unit 44 may select only those having a DP distance smaller than a predetermined threshold as candidates. Alternatively, only those having a smaller DP distance may be selected in order from a smaller DP distance.
[0061]
The apparatus according to the embodiment described here can be realized by one or a plurality of computers and software executed on the one or a plurality of computers. The control structure of the software corresponds well to the block diagram shown in FIG. Therefore, the details are not described here. It will be clear to those skilled in the art how to configure software from the above description.
[0062]
The embodiment disclosed this time is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after considering the description of the detailed description of the invention, and all changes within the meaning and range equivalent to the wording described therein are described. Including.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of an apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram showing NAQ and a fundamental frequency f0 for a family.
FIG. 3 is a diagram showing NAQ and a fundamental frequency f0 according to the type of a partner.
FIG. 4 is a diagram showing NAQ and fundamental frequency f0 according to the purpose of speech.
FIG. 5 is a block diagram of an apparatus according to an embodiment of the present invention.
FIG. 6 is a diagram for explaining the operation of the device according to the embodiment of the present invention.
[Explanation of symbols]
30 input XML, 32 preprocessing section, 34 balanced sentence speech DB, 36 waveform generation section, 38 speech signal synthesis section, 40 acoustic target, 42 naturally uttered speech DB, 44 candidate selection section, 46 candidate sequence, 48 filter section, 50 final voice data sequence, 52 waveform generator, 54 synthesized voice signal

Claims

予め言語情報についてのラベル付けがされた朗読音声データからなる朗読音声データベースと、
自然発話音声データからなる自然発話音声データベースと、
非言語情報が予め付与されたテキスト情報を受け、前記朗読音声データベースから前記テキスト情報に付与された非言語情報と合致する言語情報が付与された音声データを抽出する事により、前記テキスト情報に対応する音声信号を合成するための音声合成手段と、
前記自然発話音声データベースから前記音声信号の各部分について、前記各部分との間に定義される距離の小さいものから順番に自然発話音声データを複数個選択するための候補選択手段と、
前記音声信号の各部分について、前記自然発話音声データベースから、前記候補選択手段により選択された複数個の自然発話データの各々について予め定められた韻律的特徴を算出し、前記テキスト情報に付与されている前記非言語情報と合致するものを選択するためのフィルタ手段と、
前記フィルタ手段により選択された自然発話データに基づいて音声信号を合成するための手段とを含む、音声合成装置。A reading-speech database including reading-speech data pre-labeled for linguistic information,
A naturally uttered speech database consisting of naturally uttered speech data,
By receiving text information to which non-language information is added in advance and extracting voice data to which linguistic information matching the non-language information added to the text information is added from the reading voice database, it is possible to correspond to the text information. Voice synthesis means for synthesizing a voice signal to be transmitted;
Candidate selection means for selecting a plurality of spontaneously uttered speech data in order from the smallest distance defined with respect to each part of the speech signal from the spontaneously uttered speech database,
For each part of the voice signal, a predetermined prosodic feature is calculated for each of the plurality of natural voice data selected by the candidate selecting unit from the natural voice voice database, and the calculated prosodic feature is added to the text information. Filter means for selecting one that matches the non-verbal information that is present,
Means for synthesizing an audio signal based on the natural utterance data selected by the filter means.

前記テキスト情報に予め付与されている非言語情報は、前記予め定められた韻律的特徴を示す特徴ベクトルであり、
前記フィルタ手段は、前記候補選択手段により選択された複数個の自然発話データの各々について前記予め定められた韻律的特徴を示す特徴ベクトルを算出し、前記テキスト情報に予め付与されている特徴ベクトルとの間の類似度が最も高いものを選択するための手段を含む、請求項１に記載の音声合成装置。The non-verbal information that is given to the text information in advance is a feature vector indicating the predetermined prosodic feature,
The filter means calculates a feature vector indicating the predetermined prosodic feature for each of the plurality of natural utterance data selected by the candidate selection means, and a feature vector previously assigned to the text information. The voice synthesizing apparatus according to claim 1, further comprising means for selecting the one having the highest similarity between.

前記予め定められた韻律的特徴は、正規化振幅指数、音声信号のパワー、音声信号の持続時間、及び基本周波数のうち少なくとも一つを含む、請求項２に記載の音声合成装置。The speech synthesis device according to claim 2, wherein the predetermined prosodic feature includes at least one of a normalized amplitude index, power of a speech signal, duration of a speech signal, and a fundamental frequency.

前記候補選択手段は、前記音声信号の各部分について、前記自然発話音声データベースから、前記各部分との間でＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングにより算出されるＤＰ距離が予め定められたしきい値より小さなものを選択するための手段を含む、請求項１〜請求項３のいずれかに記載の音声合成装置。The candidate selection means is configured such that, for each part of the audio signal, a DP distance calculated by DP (Dynamic Programming) matching with the respective parts from the spontaneous speech database is smaller than a predetermined threshold value. The speech synthesizer according to any one of claims 1 to 3, further comprising means for selecting one.

前記候補選択手段は、前記音声信号の各部分について、前記自然発話音声データベースから、前記各部分との間でＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングにより算出されるＤＰ距離の小さなものから順番に予め定められた複数個だけ選択するための手段を含む、請求項１〜請求項３のいずれかに記載の音声合成装置。The candidate selecting means is preset for each part of the audio signal in ascending order of a DP distance calculated by DP (Dynamic Programming) matching with the respective parts from the spontaneous speech database. 4. The speech synthesizer according to claim 1, further comprising means for selecting a plurality.

コンピュータにより実行されると、当該コンピュータを請求項１〜請求項５のいずれかに記載の音声合成装置として動作させる、コンピュータプログラム。A computer program which, when executed by a computer, causes the computer to operate as the speech synthesizer according to claim 1.