JP2004361766A

JP2004361766A - Speaking speed conversion apparatus, speaking speed conversion method, and program

Info

Publication number: JP2004361766A
Application number: JP2003161588A
Authority: JP
Inventors: Yasushi Sato; 寧佐藤
Original assignee: Kenwood KK
Current assignee: Kenwood KK
Priority date: 2003-06-06
Filing date: 2003-06-06
Publication date: 2004-12-24
Anticipated expiration: 2023-06-06
Also published as: JP4411017B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speaking speed conversion apparatus and the like for obtaining synthesized voices which are easy to hear even if environment changes. <P>SOLUTION: When supplied with data representing a fixed form message, a sound piece edition section 8 retrieves the sound piece data of the sound piece matching in reading with the sound pieces within the fixed form message from a sound piece database 10, determines an utterance speed and has the sound piece data converted by a speaking speed conversion section 11 so as to match the speed with the determined speed. The utterance speed is determined based on the speed of a moving body detected by a speed detecting section 12. On the other hand, the sound piece edition section 8 performs the rhythm prediction of the fixed form message and the like and specifies the data most appropriately matching with the respective sound pieces within the fixed form message from the retrieved sound piece data piece by piece. The data indicating the synthesized voices is formed by coupling the specified sound piece data or in turn the waveform data supplied to an acoustic processing section 4 because the specification is not possible to each other. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、話速変換装置、話速変換方法及びプログラムに関する。
【０００２】
【従来の技術】
音声を合成する手法として、録音編集方式と呼ばれる手法がある。録音編集方式は、駅の音声案内システムや、車載用のナビゲーション装置などに用いられている。
録音編集方式は、単語と、この単語を読み上げる音声を表す音声データとを対応付けておき、音声合成する対象の文章を単語に区切ってから、これらの単語に対応付けられた音声データを取得してつなぎ合わせる、という手法である（例えば、特許文献１参照）。
【０００３】
【特許文献１】
特開平１０−４９１９３号公報
【０００４】
【発明が解決しようとする課題】
しかし、音声データを単につなぎ合わせた場合、合成音声の発声スピード（発声する時間の長さ）は、あらかじめ用意されている音声データの発声スピードにより決まる値になる。一方、人の聴覚の特性は、音声を聞く人がいる場所や、移動速度、周囲の状況、また車内などにおいても、車のいる場所、車の速度、車の周囲の状況、車内の状況などの諸環境の条件に大きく影響されるので、これらの要因が変化すれば、同一の音声でも聞こえ方が大きく変化する。
【０００５】
この発明は、上記実状に鑑みてなされたものであり、環境が変化しても聴き取りやすい合成音声を得るための話速変換装置、話速変換方法及びプログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記目的を達成すべく、この発明の第１の観点にかかる話速変換装置は、
音声の波形を表す音声データを取得する音声データ取得手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
取得した音声データのスピードを、生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換手段と、
を備えることを特徴とする。
【０００７】
前記音声データ変換手段は、取得した音声データをサンプリングする変換を行うことにより、変換後の音声データが表す音声のスピードが、話速設定データにより表されるスピードとなるようにするものであってもよい。
【０００８】
前記音声データ変換手段は、取得した音声データが表す波形のうち実質的に無音状態を表している部分を特定し、当該部分の時間長を変化させる変換を行うことにより、音声データが表す音声のスピードを、話速設定データにより表されるスピードに基づいたスピードとなるようにするものであってもよい。
【０００９】
前記話速設定データ生成手段は、移動体の加速度のピークを検出し、検出した最新のピークを所定の複数のランクのいずれかに分類して記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、特定した結果に基づいて話速設定データを生成するものであってもよい。
【００１０】
また、この発明の第２の観点にかかる話速変換装置は、
同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶する音声データ記憶手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記音声データ変換手段が記憶する音声データのうち、表す音声のスピードが、生成された話速設定データにより表されるスピードに最も近いものを選択する音声データ選択手段と、
を備えることを特徴とする。
【００１１】
また、この発明の第３の観点にかかる話速変換装置は、
音声の波形を表す音声データを取得する音声データ取得手段と、
外部の状況を表す物理量を検出して、当該物理量に基づいて、音声が満たすべき条件を表す条件設定データを生成する条件設定データ生成手段と、
取得した音声データを変換して、変換後の音声データが表す音声が、生成された条件設定データにより表される条件を満たすようにする音声データ変換手段と、
を備えることを特徴とする。
【００１２】
前記条件設定データ生成手段は、車両の移動速度、当該車両のブレーキの動き、及び／又は当該車両のハンドルの動きを表す物理量を検出するものであってもよい。
【００１３】
また、この発明の第４の観点にかかる話速変換方法は、
音声の波形を表す音声データを取得し、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成し、
取得した音声データのスピードを、生成された話速設定データにより表されるスピードに基づいて変換する、
ことを特徴とする。
【００１４】
また、この発明の第５の観点にかかる話速変換方法は、
同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶し、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成し、
前記音声データ変換手段が記憶する音声データのうち、表す音声のスピードが、生成された話速設定データにより表されるスピードに最も近いものを選択する、
ことを特徴とする。
【００１５】
また、この発明の第６の観点にかかる話速変換方法は、
音声の波形を表す音声データを取得し、
外部の状況を表す物理量を検出して、当該物理量に基づいて、音声が満たすべき条件を表す条件設定データを生成し、
取得した音声データを変換して、変換後の音声データが表す音声が、生成された条件設定データにより表される条件を満たすようにする、
ことを特徴とする。
【００１６】
また、この発明の第７の観点にかかるプログラムは、
移動体の速度又は加速度を検出する装置を備えたコンピュータを、
音声の波形を表す音声データを取得する音声データ取得手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
取得した音声データのスピードを、生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換手段と、
して機能させるためのものであることを特徴とする。
【００１７】
また、この発明の第８の観点にかかるプログラムは、
移動体の速度又は加速度を検出する装置を備えたコンピュータを、
同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶する音声データ記憶手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記音声データ変換手段が記憶する音声データのうち、表す音声のスピードが、生成された話速設定データにより表されるスピードに合致しているものを選択する音声データ選択手段と、
して機能させるためのものであることを特徴とする。
【００１８】
また、この発明の第９の観点にかかるプログラムは、
外部の状況を表す物理量を検出する装置を備えたコンピュータを、
音声の波形を表す音声データを取得する音声データ取得手段と、
外部の状況を表す物理量を検出して、当該物理量に基づいて、音声が満たすべき条件を表す条件設定データを生成する条件設定データ生成手段と、
取得した音声データを変換して、変換後の音声データが表す音声が、生成された条件設定データにより表される条件を満たすようにする音声データ変換手段と、
して機能させるためのものであることを特徴とする。
【００１９】
【発明の実施の形態】
以下、この発明の実施の形態を、車両などの移動体に搭載されて利用される音声合成システムを例とし、図面を参照して説明する。
図１は、この発明の実施の形態に係る音声合成システムの構成を示す図である。図示するように、この音声合成システムは、本体ユニットＭと、音片登録ユニットＲとにより構成されている。
【００２０】
本体ユニットＭは、言語処理部１と、一般単語辞書２と、ユーザ単語辞書３と、音響処理部４と、検索部５と、伸長部６と、波形データベース７と、音片編集部８と、検索部９と、音片データベース１０と、話速変換部１１と、速度検出部１２とにより構成されている。
【００２１】
言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１は、いずれも、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）等のプロセッサや、このプロセッサが実行するためのプログラムを記憶するメモリなどより構成されており、それぞれ後述する処理を行う。
なお、言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１の一部又は全部の機能を単一のプロセッサが行うようにしてもよい。
【００２２】
一般単語辞書２は、ＰＲＯＭ（ＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やハードディスク装置等の不揮発性メモリより構成されている。一般単語辞書２には、表意文字（例えば、漢字など）を含む単語等と、この単語等の読みを表す表音文字（例えば、カナや発音記号など）とが、この音声合成システムの製造者等によって、あらかじめ互いに対応付けて記憶されている。
【００２３】
ユーザ単語辞書３は、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅ／ＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やハードディスク装置等のデータ書き換え可能な不揮発性メモリと、この不揮発性メモリへのデータの書き込みを制御する制御回路とにより構成されている。なお、プロセッサがこの制御回路の機能を行ってもよく、言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１の一部又は全部の機能を行うプロセッサがユーザ単語辞書３の制御回路の機能を行うようにしてもよい。
ユーザ単語辞書３は、表意文字を含む単語等と、この単語等の読みを表す表音文字とを、ユーザの操作に従って外部より取得し、互いに対応付けて記憶する。ユーザ単語辞書３には、一般単語辞書２に記憶されていない単語等とその読みを表す表音文字とが格納されていれば十分である。
【００２４】
波形データベース７は、ＰＲＯＭやハードディスク装置等の不揮発性メモリより構成されている。波形データベース７には、表音文字と、この表音文字が表す単位音声の波形を表す波形データをエントロピー符号化して得られる圧縮波形データとが、この音声合成システムの製造者等によって、あらかじめ互いに対応付けて記憶されている。単位音声は、規則合成方式の手法で用いられる程度の短い音声であり、具体的には、音素や、ＶＣＶ（Ｖｏｗｅｌ−Ｃｏｎｓｏｎａｎｔ−Ｖｏｗｅｌ）音節などの単位で区切られる音声である。なお、エントロピー符号化される前の波形データは、例えば、ＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）されたデジタル形式のデータからなっていればよい。
【００２５】
音片データベース１０は、ＰＲＯＭやハードディスク装置等の不揮発性メモリより構成されている。
音片データベース１０には、例えば、図２に示すデータ構造を有するデータが記憶されている。すなわち、図示するように、音片データベース１０に格納されているデータは、ヘッダ部ＨＤＲ、インデックス部ＩＤＸ、ディレクトリ部ＤＩＲ及びデータ部ＤＡＴの４種に分かれている。
【００２６】
なお、音片データベース１０へのデータの格納は、例えば、この音声合成システムの製造者によりあらかじめ行われ、及び／又は、音片登録ユニットＲが後述する動作を行うことにより行われる。
【００２７】
ヘッダ部ＨＤＲには、音片データベース１０を識別するデータや、インデックス部ＩＤＸ、ディレクトリ部ＤＩＲ及びデータ部ＤＡＴのデータ量、データの形式、著作権等の帰属などを示すデータが格納される。
【００２８】
データ部ＤＡＴには、音片の波形を表す音片データをエントロピー符号化して得られる圧縮音片データが格納されている。
なお、音片とは、音声のうち音素１個以上を含む連続した１区間をいい、通常は単語１個分又は複数個分の区間からなる。
また、エントロピー符号化される前の音片データは、上述の圧縮波形データの生成のためエントロピー符号化される前の波形データと同じ形式のデータ（例えば、ＰＣＭされたデジタル形式のデータ）からなっていればよい。
【００２９】
ディレクトリ部ＤＩＲには、個々の圧縮音声データについて、
（Ａ）この圧縮音片データが表す音片の読みを示す表音文字を表すデータ（音片読みデータ）、
（Ｂ）この圧縮音片データが格納されている記憶位置の先頭のアドレスを表すデータ、
（Ｃ）この圧縮音片データのデータ長を表すデータ、
（Ｄ）この圧縮音片データが表す音片の発声スピード（再生した場合の時間長）を表すデータ（スピード初期値データ）、
（Ｅ）この音片のピッチ成分の周波数の時間変化を表すデータ（ピッチ成分データ）、
が、互いに対応付けられた形で格納されている。（なお、音片データベース１０の記憶領域にはアドレスが付されているものとする。）
【００３０】
なお、図２は、データ部ＤＡＴに含まれるデータとして、読みが「サイタマ」である音片の波形を表す、データ量１４１０ｈバイトの圧縮音片データが、アドレス００１Ａ３６Ａ６ｈを先頭とする論理的位置に格納されている場合を例示している。（なお、本明細書及び図面において、末尾に“ｈ”を付した数字は１６進数を表す。）
【００３１】
なお、上述の（Ａ）〜（Ｅ）のデータの集合のうち少なくとも（Ａ）のデータ（すなわち音片読みデータ）は、音片読みデータが表す表音文字に基づいて決められた順位に従ってソートされた状態で（例えば、表音文字がカナであれば、五十音順に従って、アドレス降順に並んだ状態で）、音片データベース１０の記憶領域に格納されている。
【００３２】
インデックス部ＩＤＸには、ディレクトリ部ＤＩＲのデータのおおよその論理的位置を音片読みデータに基づいて特定するためのデータが格納されている。具体的には、例えば、音片読みデータがカナを表すものであるとして、カナ文字と、先頭１字がこのカナ文字であるような音片読みデータがどのような範囲のアドレスにあるかを示すデータとが、互いに対応付けて格納されている。
【００３３】
なお、一般単語辞書２、ユーザ単語辞書３、波形データベース７及び音片データベース１０の一部又は全部の機能を単一の不揮発性メモリが行うようにしてもよい。
【００３４】
速度検出部１２は、例えば、速度センサより構成される。速度検出部１２は、この音声合成システムが搭載されている移動体の移動速度を検出し、検出した移動速度を示すデータを生成して音片編集部８へと供給する。
【００３５】
音片登録ユニットＲは、図１に示すように、収録音片データセット記憶部１３と、音片データベース作成部１４と、圧縮部１５とにより構成されている。なお、音片登録ユニットＲは音片データベース１０とは着脱可能に接続されていてもよく、この場合は、音片データベース１０に新たにデータを書き込むときを除いては、音片登録ユニットＲを本体ユニットＭから切り離した状態で本体ユニットＭに後述の動作を行わせてよい。
【００３６】
収録音片データセット記憶部１３は、ハードディスク装置等のデータ書き換え可能な不揮発性メモリより構成されている。
収録音片データセット記憶部１３には、音片の読みを表す表音文字と、この音片を人が実際に発声したものを集音して得た波形を表す音片データとが、この音声合成システムの製造者等によって、あらかじめ互いに対応付けて記憶されている。なお、この音片データは、例えば、ＰＣＭされたデジタル形式のデータからなっていればよい。
【００３７】
音片データベース作成部１４及び圧縮部１５は、ＣＰＵ等のプロセッサや、このプロセッサが実行するためのプログラムを記憶するメモリなどより構成されており、このプログラムに従って後述する処理を行う。
【００３８】
なお、音片データベース作成部１４及び圧縮部１５の一部又は全部の機能を単一のプロセッサが行うようにしてもよく、また、言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１の一部又は全部の機能を行うプロセッサが音片データベース作成部１４や圧縮部１５の機能を更に行ってもよい。また、音片データベース作成部１４や圧縮部１５の機能を行うプロセッサが、収録音片データセット記憶部１３の制御回路の機能を兼ねてもよい。
【００３９】
音片データベース作成部１４は、収録音片データセット記憶部１３より、互いに対応付けられている表音文字及び音片データを読み出し、この音片データが表す音声のピッチ成分の周波数の時間変化と、発声スピードとを特定する。
発声スピードの特定は、例えば、この音片データのサンプル数を数えることにより特定すればよい。
【００４０】
一方、ピッチ成分の周波数の時間変化は、例えば、この音片データにケプストラム解析を施すことにより特定すればよい。具体的には、例えば、音片データが表す波形を時間軸上で多数の小部分へと区切り、得られたそれぞれの小部分の強度を、元の値の対数（対数の底は任意）に実質的に等しい値へと変換し、値が変換されたこの小部分のスペクトル（すなわち、ケプストラム）を、高速フーリエ変換の手法（あるいは、離散的変数をフーリエ変換した結果を表すデータを生成する他の任意の手法）により求める。そして、このケプストラムのピークを与える周波数のうちの最小値を、この小部分におけるピッチ成分の周波数として特定する。
【００４１】
なお、ピッチ成分の周波数の時間変化は、例えば、特開２００３−１０８１７２号公報に開示された手法に従って音片データをピッチ波形データへと変換してから、このピッチ波形データに基づいて特定するようにすると良好な結果が期待できる。具体的には、音片データをフィルタリングしてピッチ信号を抽出し、抽出されたピッチ信号に基づいて、音片データが表す波形を単位ピッチ長の区間へと区切り、各区間について、ピッチ信号との相関関係に基づいて位相のずれを特定して各区間の位相を揃えることにより、音片データをピッチ波形信号へと変換すればよい。そして、得られたピッチ波形信号を音片データとして扱い、ケプストラム解析を行う等することにより、ピッチ成分の周波数の時間変化を特定すればよい。
【００４２】
一方、音片データベース作成部１４は、収録音片データセット記憶部１３より読み出した音片データを圧縮部１５に供給する。
圧縮部１５は、音片データベース作成部１４より供給された音片データをエントロピー符号化して圧縮音片データを作成し、音片データベース作成部１４に返送する。
【００４３】
音片データの発声スピード及びピッチ成分の周波数の時間変化を特定し、この音片データがエントロピー符号化され圧縮音片データとなって圧縮部１５より返送されると、音片データベース作成部１４は、この圧縮音片データを、データ部ＤＡＴを構成するデータとして、音片データベース１０の記憶領域に書き込む。
【００４４】
また、音片データベース作成部１４は、書き込んだ圧縮音片データが表す音片の読みを示すものとして収録音片データセット記憶部１３より読み出した表音文字を、音片読みデータとして音片データベース１０の記憶領域に書き込む。
また、書き込んだ圧縮音片データの、音片データベース１０の記憶領域内での先頭のアドレスを特定し、このアドレスを上述の（Ｂ）のデータとして音片データベース１０の記憶領域に書き込む。
また、この圧縮音片データのデータ長を特定し、特定したデータ長を、（Ｃ）のデータとして音片データベース１０の記憶領域に書き込む。
また、この圧縮音片データが表す音片の発声スピード及びピッチ成分の周波数の時間変化を特定した結果を示すデータを生成し、スピード初期値データ及びピッチ成分データとして音片データベース１０の記憶領域に書き込む。
【００４５】
次に、この音声合成システムの動作を説明する。
まず、言語処理部１が、この音声合成システムに音声を合成させる対象としてユーザが用意した、表意文字を含む文章（フリーテキスト）を記述したフリーテキストデータを外部から取得したとして説明する。
【００４６】
なお、言語処理部１がフリーテキストデータを取得する手法は任意であり、例えば、図示しないインターフェース回路を介して外部の装置やネットワークから取得してもよいし、図示しない記録媒体ドライブ装置にセットされた記録媒体（例えば、フロッピー（登録商標）ディスクやＣＤ−ＲＯＭなど）から、この記録媒体ドライブ装置を介して読み取ってもよい。また、言語処理部１の機能を行っているプロセッサが、自ら実行している他の処理で用いたテキストデータを、フリーテキストデータとして、言語処理部１の処理へと引き渡すようにしてもよい。
【００４７】
フリーテキストデータを取得すると、言語処理部１は、このフリーテキストに含まれるそれぞれの表意文字について、その読みを表す表音文字を、一般単語辞書２やユーザ単語辞書３を検索することにより特定する。そして、この表意文字を、特定した表音文字へと置換する。そして、言語処理部１は、フリーテキスト内の表意文字がすべて表音文字へと置換した結果得られる表音文字列を、音響処理部４へと供給する。
【００４８】
音響処理部４は、言語処理部１より表音文字列を供給されると、この表音文字列に含まれるそれぞれの表音文字について、当該表音文字が表す単位音声の波形を検索するよう、検索部５に指示する。
【００４９】
検索部５は、この指示に応答して波形データベース７を検索し、表音文字列に含まれるそれぞれの表音文字が表す単位音声の波形を表す圧縮波形データを索出する。そして、索出された圧縮波形データを伸長部６へと供給する。
【００５０】
伸長部６は、検索部５より供給された圧縮波形データを、圧縮される前の波形データへと復元し、検索部５へと返送する。検索部５は、伸長部６より返送された波形データを、検索結果として音響処理部４へと供給する。
音響処理部４は、検索部５より供給された波形データを、言語処理部１より供給された表音文字列内での各表音文字の並びに従った順序で、音片編集部８へと供給する。
【００５１】
音片編集部８は、音響処理部４より波形データを供給されると、この波形データを、供給された順序で互いに結合し、合成音声を表すデータ（合成音声データ）として出力する。フリーテキストデータに基づいて合成されたこの合成音声は、規則合成方式の手法により合成された音声に相当する。
【００５２】
なお、音片編集部８が合成音声データを出力する手法は任意であり、例えば、図示しないＤ／Ａ（Ｄｉｇｉｔａｌ−ｔｏ−Ａｎａｌｏｇ）変換器やスピーカを介して、この合成音声データが表す合成音声を再生するようにしてもよい。また、図示しないインターフェース回路を介して外部の装置やネットワークに送出してもよいし、図示しない記録媒体ドライブ装置にセットされた記録媒体へ、この記録媒体ドライブ装置を介して書き込んでもよい。また、音片編集部８の機能を行っているプロセッサが、自ら実行している他の処理へと、合成音声データを引き渡すようにしてもよい。
【００５３】
次に、音響処理部４が、外部より配信された、表音文字列を表すデータ（配信文字列データ）を取得したとする。（なお、音響処理部４が配信文字列データを取得する手法も任意であり、例えば、言語処理部１がフリーテキストデータを取得する手法と同様の手法で配信文字列データを取得すればよい。）
【００５４】
この場合、音響処理部４は、配信文字列データが表す表音文字列を、言語処理部１より供給された表音文字列と同様に扱う。この結果、配信文字列データが表す表音文字列に含まれる表音文字に対応する圧縮波形データが検索部５により索出され、圧縮される前の波形データが伸長部６により復元される。復元された各波形データは音響処理部４を介して音片編集部８へと供給され、音片編集部８が、この波形データを、配信文字列データが表す表音文字列内での各表音文字の並びに従った順序で互いに結合し、合成音声データとして出力する。配信文字列データに基づいて合成されたこの合成音声データも、規則合成方式の手法により合成された音声を表す。
【００５５】
次に、音片編集部８が、定型メッセージデータを取得したとする。なお、定型メッセージデータは、定型メッセージを表音文字列として表すデータである。
音片編集部８が定型メッセージデータを取得する手法は任意であり、例えば、言語処理部１がフリーテキストデータを取得する手法と同様の手法で定型メッセージデータを取得すればよい。
【００５６】
定型メッセージデータが音片編集部８に供給されると、音片編集部８は、定型メッセージに含まれる音片の読みを表す表音文字に合致する表音文字が対応付けられている圧縮音片データをすべて索出するよう、検索部９に指示する。
【００５７】
検索部９は、音片編集部８の指示に応答して音片データベース１０を検索し、該当する圧縮音片データと、該当する圧縮音片データに対応付けられている上述の音片読みデータ、スピード初期値データ及びピッチ成分データとを索出し、索出された圧縮音片データを伸長部６へと供給する。１個の音片につき複数の圧縮音片データが該当する場合も、該当する圧縮音片データすべてが、音声合成に用いられるデータの候補として索出される。一方、圧縮音片データを索出できなかった音片があった場合、検索部９は、該当する音片を識別するデータ（以下、欠落部分識別データと呼ぶ）を生成する。
【００５８】
伸長部６は、検索部９より供給された圧縮音片データを、圧縮される前の音片データへと復元し、検索部９へと返送する。検索部９は、伸長部６より返送された音片データと、索出された音片読みデータ、スピード初期値データ及びピッチ成分データとを、検索結果として話速変換部１１へと供給する。また、欠落部分識別データを生成した場合は、この欠落部分識別データも話速変換部１１へと供給する。
【００５９】
一方、音片編集部８は、移動体の移動速度を表すデータを速度検出部１２より供給されると、このデータが表す移動速度に基づいて、定型メッセージの発声スピード（この定型メッセージを発声する時間長）を決定する。そして、音片編集部８は、話速変換部１１に対し、話速変換部１１に供給された音片データを変換して、当該音片データが表す音片の時間長を、決定した発声スピードに合致するスピード（又は、決定した発声スピードに一定程度以上近いスピード）にすることを指示する。
【００６０】
なお、移動体の移動速度と発声スピードとの対応関係は任意であり、例えば音片編集部８は、移動体の移動速度が大きいほど発声スピードが速くなるように決定すればよい。
【００６１】
話速変換部１１は、音片編集部８の指示に応答し、検索部９より供給された音片データを指示に合致するように変換して、音片編集部８に供給する。具体的には、例えば、検索部９より供給された音片データの元の時間長を、索出されたスピード初期値データに基づいて特定した上、この音片データをリサンプリングして、この音片データのサンプル数を、音片編集部８の指示したスピードに合致する時間長にすればよい。
【００６２】
また、話速変換部１１は、検索部９より供給された音片読みデータ及びピッチ成分データも音片編集部８に供給し、欠落部分識別データを検索部９より供給された場合は、更にこの欠落部分識別データも音片編集部８に供給する。
【００６３】
なお、発声スピードデータが音片編集部８に供給されていない場合、音片編集部８は、話速変換部１１に対し、話速変換部１１に供給された音片データを変換せずに音片編集部８に供給するよう指示すればよく、話速変換部１１は、この指示に応答し、検索部９より供給された音片データをそのまま音片編集部８に供給すればよい。
【００６４】
音片編集部８は、話速変換部１１より音片データ、音片読みデータ及びピッチ成分データを供給されると、供給された音片データのうちから、定型メッセージを構成する音片の波形に最もよく近似できる波形を表す音片データを、音片１個につき１個ずつ選択する。
【００６５】
音片データを選択する基準は任意であり、例えば、音片編集部８は、定型メッセージについて韻律予測を行った上、定型メッセージ内のそれぞれの音片について、話速変換部１１より供給された音片データのうちから、ピッチ成分のの時間変化が韻律予測の結果との間で最も高い相関を示すものを１個ずつ選択するようにすればよい。
【００６６】
具体的には、まず音片編集部８は、定型メッセージデータが表す定型メッセージに、例えば「藤崎モデル」や「ＴｏＢＩ（ＴｏｎｅａｎｄＢｒｅａｋＩｎｄｉｃｅｓ）」等の韻律予測の手法に基づいた解析を加えることにより、この定型メッセージ内の各音片のピッチ成分の周波数の時間変化を予測し、予測結果を表す関数を特定する。一方で、音片編集部８が、話速変換部１１より供給された音片データのピッチ成分の周波数の時間変化を表す関数を、話速変換部１１より供給されたピッチ成分データに基づいて特定する。
【００６７】
そして、音片編集部８は、定型メッセージ内のそれぞれの音片について、この音片のピッチ成分の周波数の時間変化の予測結果を表す関数と、この音片と読みが合致する音片の波形を表す各音片データのピッチ成分の周波数の時間変化を表す関数との相関係数を求め、最も高い相関係数を与えた音片データを選択する。
【００６８】
一方、音片編集部８は、話速変換部１１より欠落部分識別データも供給されている場合には、欠落部分識別データが示す音片の読みを表す表音文字列を定型メッセージデータより抽出して音響処理部４に供給し、この音片の波形を合成するよう指示する。
【００６９】
指示を受けた音響処理部４は、音片編集部８より供給された表音文字列を、配信文字列データが表す表音文字列と同様に扱う。この結果、この表音文字列に含まれる表音文字が示す音声の波形を表す圧縮波形データが検索部５により索出され、この圧縮波形データが伸長部６により元の波形データへと復元され、検索部５を介して音響処理部４へと供給される。音響処理部４は、この波形データを音片編集部８へと供給する。
【００７０】
音片編集部８は、音響処理部４より波形データを返送されると、この波形データと、話速変換部１１より供給された音片データのうち音片編集部８が特定したものとを、定型メッセージデータが示す定型メッセージ内での各音片の並びに従った順序で互いに結合し、合成音声を表すデータとして出力する。
【００７１】
なお、話速変換部１１より供給されたデータに欠落部分識別データが含まれていない場合は、音響処理部４に波形の合成を指示することなく直ちに、音片編集部８が選択した音片データを、定型メッセージデータが示す定型メッセージ内での各音片の並びに従った順序で互いに結合し、合成音声を表すデータとして出力すればよい。
【００７２】
以上説明した、この音声合成システムでは、合成音声の発声スピードは、この音声合成システムが搭載されている移動体の移動速度に応じて変化する。従って、例えばこの音声合成システムをカーナビゲーション装置におけるナビゲーション音声の発生に用いた場合に、車両の移動速度が大きいほど音片編集部８が音片データの発声スピードを速くするよう決定することによって、車両の走行状況に応じた適正な速さでのナビゲーション音声が得られる。例えば、車両が高速で交差点に近づいている場合、車両のスピードに応じた話速で必要な情報を発声することにより、搭乗者は、交差点に進入する前に必要な情報を聴取することができる。また、その他、移動体の速度が変化しても聴き取りやすい合成音声を容易に得ることができる。
【００７３】
なお、この音声合成システムの構成は上述のものに限られない。
例えば、波形データや音片データはＰＣＭ形式のデータである必要はなく、データ形式は任意である。
また、波形データベース７や音片データベース１０は波形データや音片データを必ずしもデータ圧縮された状態で記憶している必要はない。波形データベース７や音片データベース１０が波形データや音片データをデータ圧縮されていない状態で記憶している場合、本体ユニットＭは伸長部６を備えている必要はない。
【００７４】
また、音片データベース作成部１４は、図示しない記録媒体ドライブ装置にセットされた記録媒体から、この記録媒体ドライブ装置を介して、音片データベース１０に追加する新たな圧縮音片データの材料となる音片データや表音文字列を読み取ってもよい。
また、音片登録ユニットＲは、必ずしも収録音片データセット記憶部１３を備えている必要はない。
【００７５】
また、音片データベース作成部１４は、マイクロフォン、増幅器、サンプリング回路、Ａ／Ｄ（Ａｎａｌｏｇ−ｔｏ−Ｄｉｇｉｔａｌ）コンバータ及びＰＣＭエンコーダなどを備えていてもよい。この場合、音片データベース作成部１４は、収録音片データセット記憶部１３より音片データを取得する代わりに、自己のマイクロフォンが集音した音声を表す音声信号を増幅し、サンプリングしてＡ／Ｄ変換した後、サンプリングされた音声信号にＰＣＭ変調を施すことにより、音片データを作成してもよい。
【００７６】
また、ピッチ成分データは音片データが表す音片のピッチ長の時間変化を表すデータであってもよい。この場合、音片編集部８は、音片のピッチ長の時間変化を韻律予測を行うことにより予測し、予測結果と、この音片と読みが合致する音片の波形を表す音片データのピッチ長の時間変化を表すピッチ成分データとの相関を求めるようにすればよい。
【００７７】
また、音片編集部８は、例えば、言語処理部１と共にフリーテキストデータを取得し、このフリーテキストデータが表すフリーテキストに含まれる音片の波形に近い波形を表す音片データを、定型メッセージに含まれる音片の波形に近い波形を表す音片データを選択する処理と実質的に同一の処理を行うことによって選択して、音声の合成に用いてもよい。
この場合、音響処理部４は、音片編集部８が選択した音片データが表す音片については、この音片の波形を表す波形データを検索部５に索出させなくてもよい。なお、音片編集部８は、音響処理部４が合成しなくてよい音片を音響処理部４に通知し、音響処理部４はこの通知に応答して、この音片を構成する単位音声の波形の検索を中止するようにすればよい。
【００７８】
また、音片編集部８は、例えば、音響処理部４と共に配信文字列データを取得し、この配信文字列データが表す配信文字列に含まれる音片の波形に近い波形を表す音片データを、定型メッセージに含まれる音片の波形に近い波形を表す音片データを選択する処理と実質的に同一の処理を行うことによって選択して、音声の合成に用いてもよい。この場合、音響処理部４は、音片編集部８が選択した音片データが表す音片については、この音片の波形を表す波形データを検索部５に索出させなくてもよい。
【００７９】
また、音片編集部８は、音響処理部４より返送された波形データを話速変換部１１に供給することにより、当該波形データが表す波形の時間長を、発声スピードデータが示すスピードに合致させる（又は、当該スピードに一定程度以上近いスピードにする）ようにしてもよい。こうすることにより、音響処理部４が規則合成方式の手法により合成した音声の発生スピードも、移動体の移動速度に応じて変化する。
【００８０】
また、この音声合成システムは音片データベース１０を複数備えていてもよく、この場合は、各音片データベース１０が、互い重複しない異なった範囲の発声スピードに対応付けられていてもよい。この場合は、例えば、各音片データベース１０が記憶している圧縮音片データの読みの組み合わせは共通しているものとし、一方で、ある範囲の発声スピードに対応付けられている音片データベース１０内のある読みの圧縮音片データは、より高い（速い）発声スピードに対応付けられている音片データベース１０内の同一の読みの圧縮音片データより低い（遅い）発声スピードで読み上げられた音声を表しているものとなっていればよい。
【００８１】
この音声合成システムが上述のように音片データベース１０を複数備えている場合、音片編集部８は、例えば、決定した音声スピードに基づいて、どの音片データベース１０を用いるかを決定し、決定した音片データベース１０を示すデータを検索部９に供給すればよい。そして、検索部９は、このデータが示す音片データベース１０から、圧縮音片データの索出を行えばよい。
【００８２】
また、話速変換部１１は、音片データをリサンプリングする代わりに、音片データのうち実質的に無音状態を表している部分を特定し、特定した部分の時間長を調整することにより、当該音片データが表す音片の時間長を、発声スピードデータが示すスピードに合致させてもよい。
【００８３】
また、この音声合成システムが外部の状況について検出する物理量は必ずしも移動体の速度である必要はなく、その他の任意の物理量を表すものであってよい。
【００８４】
従って、速度検出部１２は、例えば、加速度センサ等より構成されていてもよく、速度検出部１２は、この音声合成システムが搭載されている移動体の加速度を検出し、検出した加速度を示すデータを生成して音片編集部８へと供給するようにしてもよい。また、検出した、加速度を積分するための積分回路等を更に備えていてもよく、この場合は、検出した加速度を積分した結果を示すデータを生成し、このデータを音片編集部８へと供給するようにしてもよい。
【００８５】
また、速度検出部１２は、この音声合成システムが搭載されている移動体の加速度のピークを検出し、検出したつど、検出された最新のピークの値を示すデータを生成して、音片編集部８へと供給するようにしてもよい。
一方で音片編集部８は、加速度のピークの値を示すデータを速度検出部１２より供給されるつど、このデータが示すピークの値を所定の複数のランクのいずれかに分類し、最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを特定して、特定した結果に従って発声スピードを決定するようにしてもよい。
【００８６】
なお、この場合、速度検出部１２は、例えば、加速度センサが生成する信号をデジタル形式のデータに変換してピークを検出するためのＡ／Ｄ（Ａｎａｌｏｇ−ｔｏ−Ｄｉｇｉｔａｌ）変換器や論理回路などを備えていればよい。
【００８７】
また、この場合、音片編集部８は、例えばＰＲＯＭ等からなる不揮発性メモリを更に備えるものとし、図３（ａ）にデータ構造を示すテーブルをあらかじめ記憶し、このテーブルを参照することにより発声スピードを決定すればよい。図示するように、このテーブルは、発声スピードを、検出された加速度のピークが属するランクと、このランクに属するピークが検出された頻度とに対応付けた形で格納していればよい。
【００８８】
また、この音声合成システムは、例えば自動車に搭載されて利用されるものである場合、この自動車のブレーキの踏み込みの量のピークを検出して検出結果を表すデータを音片編集部８へ供給するブレーキ用のセンサを更に備えていてもよい。また、この自動車のハンドルの角速度のピークを検出して検出結果を表すデータを音片編集部８へ供給するハンドル用のセンサを更に備えていてもよい。
【００８９】
この場合、音片編集部８は、例えば、ブレーキの踏み込みの量のピークの値を示すデータや、ハンドルの角速度のピークの値を示すデータを供給されるつど、これらのデータが示すピークの値を、ブレーキの踏み込みの量及びハンドルの角速度についてそれぞれ定められた複数のランクのいずれかに分類し、最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを特定する。
【００９０】
そして、音片編集部８は、自動車の加速度のピークの値ｐａと、ブレーキの踏み込みの量のピークの値ｐｂと、ハンドルの角速度のピークの値ｐωとより、数式１の右辺の値αを求める。
【００９１】
【数１】
α＝（Ｗ_Ａ１・ｐａ）＋（Ｗ_Ａ２・ｐｂ）＋（Ｗ_Ａ３・ｐω）
（ただし、Ｗ_Ａ１、Ｗ_Ａ２及びＷ_Ａ３は所定の係数）
【００９２】
また、音片編集部８は、自動車の加速度の最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを示す値ｆａと、ブレーキの踏み込みの量の最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを示す値ｆｂと、ハンドルの角速度のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを示す値ｆωとより、数式２の右辺の値βを求める。
【００９３】
【数２】
β＝（Ｗ_Ｂ１・ｆａ）＋（Ｗ_Ｂ２・ｆｂ）＋（Ｗ_Ｂ３・ｆω）
（ただし、Ｗ_Ｂ１、Ｗ_Ｂ２及びＷ_Ｂ３は所定の係数）
【００９４】
一方でこの場合、音片編集部８は、例えば、図３（ｂ）にデータ構造を示すような、発声スピードをα及びβの値に対応付けた形で格納するテーブルをあらかじめ記憶しているものとし、このテーブルを参照することにより発声スピードを決定すればよい。
【００９５】
また、この音声合成システムは、音声の発声スピードを、現在時刻に基づいて変化させてもよい。この場合は、例えば音片編集部８が、水晶発振器などからなるタイマを備え、現在日時を示すデータをこのタイマから連続的に取得し、取得したデータに基づいて音片データの発生スピードを決定するなどすればよい。
【００９６】
また、この音声合成システムが外部の状況についての物理量の検出結果に応じて変化させる対象は必ずしも音声の発声スピードである必要はなく、音声を特徴付けるその他の任意の要素であってよい。
【００９７】
従って、この音声合成システムは、例えば、音片編集部８は、検索部９より供給される音片データや音響処理部４から音片編集部８を介して供給される波形データの振幅を変化させてもよい。
【００９８】
また、この音声合成システムは、移動体の内部あるいは外部の騒音のレベルを検出し検出結果を表すデータを音片編集部８へ供給するため、例えばマイクロホンやレベル検出回路などを備えていてもよい。この場合、音片編集部８は、例えばこのデータが表す騒音のレベルに基づいて合成音声の振幅を決定し、決定した振幅に合致するように音片データや波形データを変換すればよい。このような構成を有していれば、この音声合成システムは、騒音レベルが高いほど合成音声の振幅を大きくする等して、周囲の騒音が大きくても合成音声の聞きやすさを保つことができる。
【００９９】
また、この音声合成システムは、移動体の内部あるいは外部の騒音が占有する帯域を検出し検出結果を表すデータを音片編集部８へ供給するため、例えばマイクロホンやフーリエ変換装置などを備えていてもよい。この場合、音片編集部８は、例えば、このデータが表す騒音の占有帯域を合成音声の減衰帯域として決定し（あるいはその他、騒音の占有帯域に基づいて合成音声の減衰帯域を決定し）、決定した減衰帯域内のスペクトル成分を音片データや波形データから除去するようにしてもよい。このような構成を有していれば、この音声合成システムは、騒音が占める帯域と合成音声が占める帯域との重複を回避するなどして、周囲の騒音が大きくても合成音声の聞きやすさを保つことができる。
【０１００】
また、この音声合成システムでは、例えば、音片編集部８が音片データや波形データの声質を変化させるようにしてもよい。
具体的には、例えば音片編集部８は、音片データを、この音片データが表す音片のピッチ成分（基本周波数成分）及び高調波成分の時間変化を表すサブバンドデータへと変換し、得られたサブバンドデータを更に変換し、音片の波形を表すデータへと戻す。ただし、サブバンドデータを波形を表すデータへと変換する際、音片編集部８は、このサブバンドデータが表わすそれぞれの成分を、元来表している周波数と異なる周波数（例えば、元来の周波数の２倍の周波数）の成分の時間変化を表すものとして解釈して変換を行う。
【０１０１】
このような構成を有していれば、この音声合成システムは、例えばカーナビゲーション用の合成音声として、夜間には音声のピッチを昼間より高めにして眠気を誘いにくい合成音声を生成する、等することにより自動車を運転する時間帯に適した合成音声を得ることもできる。
【０１０２】
以上、この発明の実施の形態を説明したが、この発明にかかる話速変換装置は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。
例えば、速度センサ（又はその他任意の物理量を検出するためのセンサ）が接続されたパーソナルコンピュータに上述の言語処理部１、一般単語辞書２、ユーザ単語辞書３、音響処理部４、検索部５、伸長部６、波形データベース７、音片編集部８、検索部９、音片データベース１０及び話速変換部１１の動作を実行させるためのプログラムを格納した媒体（ＣＤ−ＲＯＭ、ＭＯ、フロッピー（登録商標）ディスク等）から該プログラムをインストールすることにより、上述の処理を実行する本体ユニットＭを構成することができる。
また、パーソナルコンピュータに上述の収録音片データセット記憶部１３、音片データベース作成部１４及び圧縮部１５の動作を実行させるためのプログラムを格納した媒体から該プログラムをインストールすることにより、上述の処理を実行する音片登録ユニットＲを構成することができる。
【０１０３】
そして、これらのプログラムを実行し本体ユニットＭや音片登録ユニットＲとして機能するパーソナルコンピュータが、図１の音声合成システムの動作に相当する処理として、例えば、図４〜図６に示す処理を行うものとする。
図４は、このパーソナルコンピュータがフリーテキストデータを取得した場合の処理を示すフローチャートである。
図５は、このパーソナルコンピュータが配信文字列データを取得した場合の処理を示すフローチャートである。
図６は、このパーソナルコンピュータが定型メッセージデータ及び発声スピードデータを取得した場合の処理を示すフローチャートである。
【０１０４】
すなわち、このパーソナルコンピュータが、外部より、上述のフリーテキストデータを取得すると（図４、ステップＳ１０１）、このフリーテキストデータが表すフリーテキストに含まれるそれぞれの表意文字について、その読みを表す表音文字を、一般単語辞書２やユーザ単語辞書３を検索することにより特定し、この表意文字を、特定した表音文字へと置換する（ステップＳ１０２）。なお、このパーソナルコンピュータがフリーテキストデータを取得する手法は任意である。
【０１０５】
そして、このパーソナルコンピュータは、フリーテキスト内の表意文字をすべて表音文字へと置換した結果を表す表音文字列が得られると、この表音文字列に含まれるそれぞれの表音文字について、当該表音文字が表す単位音声の波形を波形データベース７より検索し、表音文字列に含まれるそれぞれの表音文字が表す単位音声の波形を表す圧縮波形データを索出する（ステップＳ１０３）。
【０１０６】
次に、このパーソナルコンピュータは、索出された圧縮波形データを、圧縮される前の波形データへと復元し（ステップＳ１０４）、復元された波形データを、表音文字列内での各表音文字の並びに従った順序で互いに結合し、合成音声データとして出力する（ステップＳ１０５）。なお、このパーソナルコンピュータが合成音声データを出力する手法は任意である。
【０１０７】
また、このパーソナルコンピュータが、外部より、上述の配信文字列データを任意の手法で取得すると（図５、ステップＳ２０１）、この配信文字列データが表す表音文字列に含まれるそれぞれの表音文字について、当該表音文字が表す単位音声の波形を波形データベース７より検索し、表音文字列に含まれるそれぞれの表音文字が表す単位音声の波形を表す圧縮波形データを索出する（ステップＳ２０２）。
【０１０８】
次に、このパーソナルコンピュータは、索出された圧縮波形データを、圧縮される前の波形データへと復元し（ステップＳ２０３）、復元された波形データを、表音文字列内での各表音文字の並びに従った順序で互いに結合し、合成音声データとしてステップＳ１０５の処理と同様の処理により出力する（ステップＳ２０４）。
【０１０９】
一方、このパーソナルコンピュータが、外部より、上述の定型メッセージデータ及び発声スピードデータを任意の手法により取得すると（図６、ステップＳ３０１）、まず、この定型メッセージデータが表す定型メッセージに含まれる音片の読みを表す表音文字に合致する表音文字が対応付けられている圧縮音片データをすべて索出する（ステップＳ３０２）。
【０１１０】
また、ステップＳ３０２では、該当する圧縮音片データに対応付けられている上述の音片読みデータ、スピード初期値データ及びピッチ成分データも索出する。なお、１個の音片につき複数の圧縮音片データが該当する場合は、該当する圧縮音片データすべてを索出する。一方、圧縮音片データを索出できなかった音片があった場合は、上述の欠落部分識別データを生成する。そして、このパーソナルコンピュータは、索出された圧縮音片データを、圧縮される前の音片データへと復元する（ステップＳ３０３）。
【０１１１】
一方、このパーソナルコンピュータは、移動体の移動速度（又はその他任意の物理量）を表すデータを速度センサ（例えば、車両の車速を表す車速パルスを供給する装置：図示せず）等より供給されると、このデータが表す移動速度等に基づいて、定型メッセージの発声スピードを決定する（ステップＳ３０４）。なお、移動体の移動速度等と発声スピードとの対応関係は任意である。
【０１１２】
次に、このパーソナルコンピュータは、ステップＳ３０３で復元された音片データを、上述の音片編集部８が行う処理と同様の処理により変換して、当該音片データが表す音片の時間長を、ステップＳ３０４で決定した発声スピード（又は、決定した発声スピードに一定程度以上近いスピード）にする（ステップＳ３０５）。なお、発声スピードデータが供給されていない場合は、復元された音片データを変換しなくてもよい。
【０１１３】
次に、このパーソナルコンピュータは、音片の時間長が変換された音片データのうちから、定型メッセージを構成する音片の波形に最も近い波形を表す音片データを、上述の音片編集部８が行う処理と同様の処理を行うことにより、音片１個につき１個ずつ選択する（ステップＳ３０６）。
【０１１４】
ステップＳ３０６でこのパーソナルコンピュータは、例えば、定型メッセージデータが表す定型メッセージに韻律予測の手法に基づいた解析を加えることにより、この定型メッセージの韻律を予測し、一方で、索出された音片データのピッチ成分の周波数の時間変化を表す関数をピッチ成分データに基づいて特定して、定型メッセージ内のそれぞれの音片について、この音片のピッチ成分の周波数の時間変化の予測結果を表す関数と、この音片と読みが合致する音片の波形を表す各音片データのピッチ成分の周波数の時間変化を表す関数との相関係数を求め、最も高い相関係数を与えた音片データを選択するようにすればよい。
【０１１５】
一方、このパーソナルコンピュータは、欠落部分識別データを生成した場合、欠落部分識別データが示す音片の読みを表す表音文字列を定型メッセージデータより抽出し、この表音文字列につき、音素毎に、配信文字列データが表す表音文字列と同様に扱って上述のステップＳ２０２〜Ｓ２０３の処理を行うことにより、この表音文字列内の各表音文字が示す音声の波形を表す波形データを復元する（ステップＳ３０７）。
【０１１６】
そして、このパーソナルコンピュータは、復元した波形データと、ステップＳ３０６で選択した音片データとを、定型メッセージデータが示す定型メッセージ内での各音片の並びに従った順序で互いに結合し、合成音声を表すデータとして出力する（ステップＳ３０８）。
【０１１７】
なお、パーソナルコンピュータに本体ユニットＭの機能を行わせるプログラムは、パーソナルコンピュータに複数の音片データベース１０の機能を行わせてもよい。この場合、このパーソナルコンピュータは、ステップＳ３０２の処理を開始するまでにステップＳ３０４の処理を完了するものとし、一方でステップＳ３０２においては、例えば、決定した音声スピードに基づいて、どの音片データベース１０を用いるかを決定し、決定した音片データベース１０より圧縮音片データの索出を行えばよい。そして、ステップＳ３０５の処理は省略するようにすればよい。
【０１１８】
パーソナルコンピュータに複数の音片データベース１０の機能を行わせる場合、各音片データベース１０は、例えば、互い重複しない異なった範囲の発声スピードに対応付けられているものとし、各音片データベース１０が記憶している圧縮音片データの読みの組み合わせは共通しており、一方で、ある範囲の発声スピードに対応付けられている音片データベース１０内のある読みの圧縮音片データは、より高い（速い）発声スピードに対応付けられている音片データベース１０内の同一の読みの圧縮音片データより低い（遅い）発声スピードで読み上げられた音声を表しているものになっているものとする。
【０１１９】
また、パーソナルコンピュータに本体ユニットＭや音片登録ユニットＲの機能を行わせるプログラムは、例えば、通信回線の掲示板（ＢＢＳ）にアップロードし、これを通信回線を介して配信してもよく、また、これらのプログラムを表す信号により搬送波を変調し、得られた変調波を伝送し、この変調波を受信した装置が変調波を復調してこれらのプログラムを復元するようにしてもよい。
そして、これらのプログラムを起動し、ＯＳの制御下に、他のアプリケーションプログラムと同様に実行することにより、上述の処理を実行することができる。
【０１２０】
なお、ＯＳが処理の一部を分担する場合、あるいは、ＯＳが本願発明の１つの構成要素の一部を構成するような場合には、記録媒体には、その部分を除いたプログラムを格納してもよい。この場合も、この発明では、その記録媒体には、コンピュータが実行する各機能又はステップを実行するためのプログラムが格納されているものとする。
【０１２１】
【発明の効果】
以上説明したように、この発明によれば、環境が変化しても聴き取りやすい合成音声を得るための話速変換装置、話速変換方法及びプログラムが実現される。
【図面の簡単な説明】
【図１】この発明の実施の形態に係る音声合成システムの構成を示すブロック図である。
【図２】音片データベースのデータ構造を模式的に示す図である。
【図３】（ａ）は、車両の加速度に基づいて発声スピードを決定するために用いるテーブルのデータ構造を示す図であり、（ｂ）は、自動車の加速度、ブレーキの踏み込みの量、及びハンドルの角速度に基づいて発声スピードを決定するために用いるテーブルのデータ構造を示す図である。
【図４】この発明の実施の形態に係る音声合成システムの機能を行うパーソナルコンピュータがフリーテキストデータを取得した場合の処理を示すフローチャートである。
【図５】この発明の実施の形態に係る音声合成システムの機能を行うパーソナルコンピュータが配信文字列データを取得した場合の処理を示すフローチャートである。
【図６】この発明の実施の形態に係る音声合成システムの機能を行うパーソナルコンピュータが定型メッセージデータ及び発声スピードデータを取得した場合の処理を示すフローチャートである。
【符号の説明】
Ｍ本体ユニット
１言語処理部
２一般単語辞書
３ユーザ単語辞書
４音響処理部
５検索部
６伸長部
７波形データベース
８音片編集部
９検索部
１０音片データベース
１１話速変換部
１２速度検出部
Ｒ音片登録ユニット
１３収録音片データセット記憶部
１４音片データベース作成部
１５圧縮部
ＨＤＲヘッダ部
ＩＤＸインデックス部
ＤＩＲディレクトリ部
ＤＡＴデータ部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech speed conversion device, a speech speed conversion method, and a program.
[0002]
[Prior art]
As a method of synthesizing voice, there is a method called a recording and editing method. The recording / editing method is used for a voice guidance system at a station, a navigation device mounted on a vehicle, and the like.
In the recording and editing method, a word is associated with voice data representing a voice that reads the word, and a sentence to be subjected to voice synthesis is divided into words, and then the voice data associated with these words is obtained. It is a technique of joining together (for example, see Patent Document 1).
[0003]
[Patent Document 1]
JP-A-10-49193
[0004]
[Problems to be solved by the invention]
However, when the voice data is simply connected, the utterance speed (the length of utterance time) of the synthesized voice has a value determined by the utterance speed of the prepared voice data. On the other hand, human hearing characteristics include the location of the person listening to the sound, the moving speed, the surrounding conditions, and even the inside of the car, such as the location of the car, the speed of the car, the surrounding conditions of the car, the conditions inside the car, etc. Are greatly affected by various environmental conditions, and if these factors change, the way of hearing the same voice changes greatly.
[0005]
The present invention has been made in view of the above situation, and has as its object to provide a speech speed conversion device, a speech speed conversion method, and a program for obtaining a synthesized voice that is easy to hear even when the environment changes.
[0006]
[Means for Solving the Problems]
To achieve the above object, a speech speed conversion device according to a first aspect of the present invention includes:
Audio data acquisition means for acquiring audio data representing an audio waveform;
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Voice data conversion means for converting the speed of the obtained voice data based on the speed represented by the generated voice speed setting data;
It is characterized by having.
[0007]
The audio data conversion means performs conversion for sampling the obtained audio data so that the speed of the audio represented by the converted audio data becomes the speed represented by the speech speed setting data. Is also good.
[0008]
The audio data conversion means identifies a portion of the waveform represented by the acquired audio data, which substantially represents a silent state, and performs conversion to change the time length of the portion, thereby converting the audio represented by the audio data. The speed may be a speed based on the speed represented by the speech speed setting data.
[0009]
The speech speed setting data generating means detects the peak of the acceleration of the moving object, classifies and stores the latest detected peak into one of a plurality of predetermined ranks, and the peak detected in the past is the latest peak. The frequency at which the peak was classified into the same rank as the classified rank may be specified in the past, and the speech speed setting data may be generated based on the specified result.
[0010]
Further, a speech speed conversion device according to a second aspect of the present invention includes:
Voice data storage means for storing a plurality of voice data representing a plurality of voice waveforms uttering the same reading phrase at different speeds,
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Of the voice data stored by the voice data conversion means, the voice speed to represent, voice data selection means to select the closest to the speed represented by the generated speech speed setting data,
It is characterized by having.
[0011]
Further, a speech speed conversion device according to a third aspect of the present invention includes:
Audio data acquisition means for acquiring audio data representing an audio waveform;
Condition setting data generating means for detecting a physical quantity representing an external situation and generating condition setting data representing a condition to be satisfied by the voice based on the physical quantity;
Voice data conversion means for converting the obtained voice data so that the voice represented by the converted voice data satisfies the condition represented by the generated condition setting data;
It is characterized by having.
[0012]
The condition setting data generating means may detect a physical quantity representing a moving speed of the vehicle, a movement of a brake of the vehicle, and / or a movement of a steering wheel of the vehicle.
[0013]
Further, a speech speed conversion method according to a fourth aspect of the present invention includes:
Acquires audio data representing the audio waveform,
Detecting the speed or acceleration of the moving object, based on the speed or acceleration, generates speech speed setting data that specifies the speed of the voice,
Convert the speed of the acquired voice data based on the speed represented by the generated speech speed setting data,
It is characterized by the following.
[0014]
Further, a speech speed conversion method according to a fifth aspect of the present invention includes:
A plurality of voice data representing waveforms of a plurality of voices uttering the same reading phrase at different speeds are stored,
Detecting the speed or acceleration of the moving object, based on the speed or acceleration, generates speech speed setting data that specifies the speed of the voice,
Among the voice data stored by the voice data conversion means, the speed of the voice to be represented is selected as the speed closest to the speed represented by the generated voice speed setting data,
It is characterized by the following.
[0015]
Further, a speech speed conversion method according to a sixth aspect of the present invention includes:
Acquires audio data representing the audio waveform,
Detecting a physical quantity representing an external situation and generating condition setting data representing a condition to be satisfied by the voice based on the physical quantity,
Converting the acquired audio data so that the audio represented by the converted audio data satisfies the condition represented by the generated condition setting data;
It is characterized by the following.
[0016]
Further, a program according to a seventh aspect of the present invention includes:
A computer equipped with a device for detecting the speed or acceleration of a moving object,
Audio data acquisition means for acquiring audio data representing an audio waveform;
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Voice data conversion means for converting the speed of the obtained voice data based on the speed represented by the generated voice speed setting data;
It is characterized in that it is intended to function as
[0017]
A program according to an eighth aspect of the present invention includes:
A computer equipped with a device for detecting the speed or acceleration of a moving object,
Voice data storage means for storing a plurality of voice data representing a plurality of voice waveforms uttering the same reading phrase at different speeds,
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Among the voice data stored by the voice data conversion means, voice data selection means for selecting the speed of the voice to be represented, which matches the speed represented by the generated speech speed setting data,
It is characterized in that it is intended to function as
[0018]
A program according to a ninth aspect of the present invention includes:
A computer equipped with a device for detecting a physical quantity representing an external situation,
Audio data acquisition means for acquiring audio data representing an audio waveform;
Condition setting data generating means for detecting a physical quantity representing an external situation and generating condition setting data representing a condition to be satisfied by the voice based on the physical quantity;
Voice data conversion means for converting the obtained voice data so that the voice represented by the converted voice data satisfies the condition represented by the generated condition setting data;
It is characterized in that it is intended to function as
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings, taking a voice synthesis system mounted and used on a moving body such as a vehicle as an example.
FIG. 1 is a diagram showing a configuration of a speech synthesis system according to an embodiment of the present invention. As shown in the figure, the speech synthesis system includes a main unit M and a speech unit registration unit R.
[0020]
The main unit M includes a language processing unit 1, a general word dictionary 2, a user word dictionary 3, a sound processing unit 4, a search unit 5, a decompression unit 6, a waveform database 7, a speech unit editing unit 8, , A search unit 9, a speech unit database 10, a speech speed conversion unit 11, and a speed detection unit 12.
[0021]
The language processing unit 1, the sound processing unit 4, the search unit 5, the decompression unit 6, the speech unit editing unit 8, the search unit 9, and the speech speed conversion unit 11 are all a CPU (Central Processing Unit) or a DSP (Digital Signal Processor). ), And a memory for storing a program to be executed by this processor.
Note that a single processor performs part or all of the functions of the language processing unit 1, the sound processing unit 4, the search unit 5, the decompression unit 6, the speech unit editing unit 8, the search unit 9, and the speech speed conversion unit 11. It may be.
[0022]
The general word dictionary 2 is composed of a nonvolatile memory such as a PROM (Programmable Read Only Memory) or a hard disk device. The general word dictionary 2 contains words and the like including ideographic characters (for example, kanji) and phonograms (for example, kana and phonetic symbols) representing the reading of the words and the like. For example, they are stored in association with each other.
[0023]
The user word dictionary 3 includes a data rewritable nonvolatile memory such as an EEPROM (Electrically Erasable / Programmable Read Only Memory) and a hard disk device, and a control circuit that controls writing of data to the nonvolatile memory. . Note that the processor may perform the function of the control circuit, and may include one of the language processing unit 1, the sound processing unit 4, the search unit 5, the decompression unit 6, the speech unit editing unit 8, the search unit 9, and the speech speed conversion unit 11. A part or a processor that performs all functions may perform the function of the control circuit of the user word dictionary 3.
The user word dictionary 3 acquires words and the like including ideographic characters and phonograms representing readings of the words and the like from outside according to user operations, and stores them in association with each other. It is sufficient that the user word dictionary 3 stores words and the like that are not stored in the general word dictionary 2 and phonograms representing their readings.
[0024]
The waveform database 7 is composed of a nonvolatile memory such as a PROM or a hard disk device. In the waveform database 7, phonograms and compressed waveform data obtained by entropy-encoding waveform data representing the waveform of a unit voice represented by the phonograms are mutually exchanged in advance by a manufacturer of the speech synthesis system. They are stored in association with each other. The unit voice is a voice that is short enough to be used in the rule-based synthesis method, and specifically, is a voice separated by a unit such as a phoneme or a VCV (Vowel-Consonant-Vowel) syllable. Note that the waveform data before being subjected to entropy encoding may be, for example, digital data that has been subjected to pulse code modulation (PCM).
[0025]
The sound piece database 10 is configured by a nonvolatile memory such as a PROM or a hard disk device.
The speech piece database 10 stores, for example, data having a data structure shown in FIG. That is, as shown in the figure, the data stored in the speech piece database 10 is divided into four types: a header part HDR, an index part IDX, a directory part DIR, and a data part DAT.
[0026]
The storage of the data in the speech unit database 10 is performed in advance by, for example, the manufacturer of the speech synthesis system, and / or performed by the speech unit registration unit R performing an operation described later.
[0027]
The header section HDR stores data for identifying the speech piece database 10 and data indicating the data amount of the index section IDX, the directory section DIR and the data section DAT, the data format, the attribution of copyright, and the like.
[0028]
The data section DAT stores compressed speech piece data obtained by entropy-encoding speech piece data representing the waveform of the speech piece.
Note that a speech unit refers to one continuous section including one or more phonemes in a voice, and usually includes one or a plurality of sections.
The speech piece data before the entropy encoding is composed of data in the same format as the waveform data before the entropy encoding for generating the above-described compressed waveform data (for example, PCM digital format data). It should just be.
[0029]
In the directory section DIR, for each compressed audio data,
(A) data representing phonetic characters indicating the reading of a speech unit represented by the compressed speech unit data (speech unit reading data);
(B) data representing a head address of a storage location where the compressed speech piece data is stored;
(C) data representing the data length of the compressed speech piece data;
(D) data (speed initial value data) representing the utterance speed (time length when reproduced) of the sound piece represented by the compressed sound piece data;
(E) data (pitch component data) representing a temporal change of the frequency of the pitch component of the sound piece;
Are stored in a form associated with each other. (Note that an address is assigned to the storage area of the sound piece database 10.)
[0030]
Note that FIG. 2 shows, as data included in the data part DAT, compressed speech piece data having a data amount of 1410 h bytes representing a waveform of a speech piece whose reading is “Saitama” at a logical position starting from the address 001A36A6h. The case where it is stored is illustrated. (Note that in this specification and the drawings, numbers suffixed with "h" represent hexadecimal numbers.)
[0031]
In addition, at least the data (A) (that is, the speech unit reading data) of the data set of the above (A) to (E) is sorted according to the order determined based on the phonetic characters represented by the speech unit reading data. (For example, if the phonetic characters are kana, they are arranged in descending address order according to the Japanese syllabary order) and stored in the storage area of the speech piece database 10.
[0032]
The index part IDX stores data for specifying the approximate logical position of the data in the directory part DIR based on the sound piece reading data. Specifically, for example, assuming that the speech unit reading data represents kana, it is assumed that the kana character and the range of the address of the speech unit reading data whose first character is this kana character are present. Are stored in association with each other.
[0033]
Note that a single non-volatile memory may perform some or all of the functions of the general word dictionary 2, the user word dictionary 3, the waveform database 7, and the speech unit database 10.
[0034]
The speed detection unit 12 is configured by, for example, a speed sensor. The speed detecting unit 12 detects the moving speed of the moving body on which the speech synthesis system is mounted, generates data indicating the detected moving speed, and supplies the data to the sound piece editing unit 8.
[0035]
As shown in FIG. 1, the speech unit registration unit R includes a recorded speech unit data set storage unit 13, a speech unit database creation unit 14, and a compression unit 15. Note that the speech unit registration unit R may be detachably connected to the speech unit database 10. In this case, the speech unit registration unit R is not used except when newly writing data to the speech unit database 10. The main unit M may be made to perform an operation described below in a state where the main unit M is separated from the main unit M.
[0036]
The recorded sound piece data set storage unit 13 is composed of a data rewritable nonvolatile memory such as a hard disk device.
The recorded voice unit data set storage unit 13 stores phonograms representing the reading of the voice unit and voice unit data representing a waveform obtained by collecting a voice of a person actually uttering the voice unit. They are stored in advance in association with each other by the manufacturer of the speech synthesis system or the like. Note that the sound piece data may be composed of, for example, PCM digital data.
[0037]
The speech unit database creation unit 14 and the compression unit 15 are configured by a processor such as a CPU, a memory for storing a program to be executed by the processor, and the like, and perform processing described later according to the program.
[0038]
Note that a single processor may perform some or all of the functions of the speech unit database creation unit 14 and the compression unit 15, and may include a language processing unit 1, a sound processing unit 4, a search unit 5, and a decompression unit. 6. The processor that performs some or all of the functions of the sound piece editing unit 8, the search unit 9, and the speech speed conversion unit 11 may further perform the functions of the sound unit database creation unit 14 and the compression unit 15. Further, a processor that performs the functions of the sound piece database creating unit 14 and the compressing unit 15 may also function as the control circuit of the recorded sound piece data set storage unit 13.
[0039]
The speech unit database creation unit 14 reads out phonograms and speech unit data associated with each other from the recorded speech unit data set storage unit 13, and reads the time change of the frequency of the pitch component of the voice represented by the speech unit data. Identify the utterance speed.
The utterance speed may be specified, for example, by counting the number of samples of the sound piece data.
[0040]
On the other hand, the time change of the frequency of the pitch component may be specified, for example, by performing cepstrum analysis on the sound piece data. Specifically, for example, the waveform represented by the sound piece data is divided into a number of small parts on the time axis, and the intensity of each obtained small part is logarithmized to the original value (the base of the logarithm is arbitrary). Transforms the small portion of the spectrum (ie, the cepstrum) into substantially equal values and uses a fast Fourier transform technique (or other methods to generate data representing the result of Fourier transform of a discrete variable). Arbitrary method). Then, the minimum value of the frequencies giving the peak of the cepstrum is specified as the frequency of the pitch component in this small portion.
[0041]
It should be noted that the time change of the frequency of the pitch component is specified based on the pitch waveform data after converting the speech piece data into pitch waveform data in accordance with, for example, a method disclosed in JP-A-2003-108172. Good results can be expected. Specifically, the pitch unit is extracted by filtering the sound unit data, and the waveform represented by the sound unit data is divided into sections of a unit pitch length based on the extracted pitch signal. It is sufficient to convert the speech piece data into a pitch waveform signal by specifying the phase shift based on the above correlation and aligning the phases of the respective sections. Then, the time change of the frequency of the pitch component may be specified by treating the obtained pitch waveform signal as sound piece data and performing cepstrum analysis or the like.
[0042]
On the other hand, the speech unit database creation unit 14 supplies the speech unit data read from the recorded speech unit data set storage unit 13 to the compression unit 15.
The compression section 15 entropy-encodes the speech piece data supplied from the speech piece database creation section 14 to create compressed speech piece data, and returns the compressed speech piece data to the speech piece database creation section 14.
[0043]
When the utterance speed of the speech unit data and the time change of the frequency of the pitch component are specified, and this speech unit data is returned from the compression unit 15 as compressed speech unit data after being entropy-encoded, the speech unit database creation unit 14 The compressed speech piece data is written to the storage area of the speech piece database 10 as data constituting the data part DAT.
[0044]
Further, the speech unit database creation unit 14 uses the phonogram read out from the recorded speech unit data set storage unit 13 as an indication of the reading of the speech unit represented by the written compressed speech unit data, and uses the speech unit database as the speech unit reading data. Write to 10 storage areas.
Further, the head address of the written compressed speech piece data in the storage area of the speech piece database 10 is specified, and this address is written in the storage area of the speech piece database 10 as the above-mentioned (B) data.
The data length of the compressed speech piece data is specified, and the specified data length is written to the storage area of the speech piece database 10 as data (C).
In addition, it generates data indicating the result of specifying the time change of the utterance speed and the frequency of the pitch component of the voice unit represented by the compressed voice unit data, and stores the speed initial value data and the pitch component data in the storage area of the voice unit database 10. Write.
[0045]
Next, the operation of the speech synthesis system will be described.
First, a description will be given on the assumption that the language processing unit 1 obtains free text data describing a sentence (free text) including an ideographic character prepared by a user as a target for synthesizing a voice in the voice synthesizing system.
[0046]
The language processing unit 1 may acquire free text data by any method. For example, the language processing unit 1 may acquire the free text data from an external device or a network via an interface circuit (not shown), or set in a recording medium drive (not shown). A recording medium (for example, a floppy (registered trademark) disk, a CD-ROM, or the like) may be read through the recording medium drive device. Further, the processor performing the function of the language processing unit 1 may transfer text data used in other processing executed by itself to the processing of the language processing unit 1 as free text data.
[0047]
When the free text data is obtained, the language processing unit 1 specifies the phonogram representing the reading of each ideographic character included in the free text by searching the general word dictionary 2 and the user word dictionary 3. . Then, the ideogram is replaced with the specified phonogram. Then, the language processing unit 1 supplies the sound processing unit 4 with a phonetic character string obtained as a result of replacing all ideographic characters in the free text with phonetic characters.
[0048]
When the sound processing unit 4 is supplied with the phonogram string from the language processing unit 1, for each phonogram included in the phonogram string, the sound processing unit 4 searches for a unit voice waveform represented by the phonogram. , To the search unit 5.
[0049]
The search unit 5 searches the waveform database 7 in response to this instruction, and searches for compressed waveform data representing the waveform of the unit voice represented by each phonetic character included in the phonetic character string. Then, the retrieved compressed waveform data is supplied to the decompression unit 6.
[0050]
The expansion unit 6 restores the compressed waveform data supplied from the search unit 5 to the waveform data before compression, and returns the waveform data to the search unit 5. The search unit 5 supplies the waveform data returned from the decompression unit 6 to the sound processing unit 4 as a search result.
The sound processing unit 4 converts the waveform data supplied from the search unit 5 to the sound piece editing unit 8 in the order in which the phonograms in the phonogram string supplied from the language processing unit 1 are arranged. Supply.
[0051]
When supplied with the waveform data from the acoustic processing unit 4, the sound piece editing unit 8 combines the waveform data with each other in the order in which they are supplied, and outputs the combined data as data representing synthesized speech (synthesized speech data). This synthesized speech synthesized based on the free text data corresponds to a speech synthesized by a rule synthesis method.
[0052]
The method by which the sound piece editing unit 8 outputs the synthesized voice data is arbitrary. For example, the synthesized voice data represented by the synthesized voice data is output via a D / A (Digital-to-Analog) converter or a speaker (not shown). May be reproduced. The data may be transmitted to an external device or a network via an interface circuit (not shown), or may be written to a recording medium set in a recording medium drive (not shown) via the recording medium drive. Further, the processor performing the function of the sound piece editing unit 8 may transfer the synthesized voice data to another process executed by itself.
[0053]
Next, it is assumed that the sound processing unit 4 has acquired data (distribution character string data) that is distributed from the outside and represents a phonogram string. (The method by which the acoustic processing unit 4 acquires the distribution character string data is also arbitrary. For example, the language processing unit 1 may acquire the distribution character string data by the same method as the method of acquiring the free text data. )
[0054]
In this case, the sound processing unit 4 treats the phonetic character string represented by the distribution character string data in the same manner as the phonetic character string supplied from the language processing unit 1. As a result, compressed waveform data corresponding to the phonetic characters included in the phonetic character string represented by the distribution character string data is retrieved by the search unit 5, and the waveform data before being compressed is restored by the decompression unit 6. The restored waveform data is supplied to the sound piece editing unit 8 via the sound processing unit 4, and the sound unit editing unit 8 converts the waveform data into each of the phonetic character strings represented by the distribution character string data. The phonograms are combined with each other in the order in which they are arranged, and output as synthesized speech data. This synthesized voice data synthesized based on the distribution character string data also indicates voice synthesized by the rule synthesis method.
[0055]
Next, it is assumed that the sound piece editing unit 8 has acquired the fixed message data. The fixed message data is data representing the fixed message as a phonetic character string.
The method by which the sound piece editing unit 8 acquires the fixed message data is arbitrary. For example, the sound processing unit 1 may acquire the fixed message data by the same method as the method by which the language processing unit 1 acquires the free text data.
[0056]
When the standard message data is supplied to the speech unit editing unit 8, the speech unit editing unit 8 generates the compressed sound associated with the phonetic character that matches the phonetic character representing the reading of the speech unit included in the standard message. The search unit 9 is instructed to retrieve all pieces of data.
[0057]
The search unit 9 searches the voice unit database 10 in response to the instruction of the voice unit editing unit 8, and finds the corresponding compressed voice unit data and the above-mentioned voice unit read data associated with the relevant compressed voice unit data. , And retrieves the speed initial value data and the pitch component data, and supplies the retrieved compressed speech piece data to the decompression unit 6. Even when a plurality of compressed speech piece data correspond to one speech piece, all of the corresponding compressed speech piece data are searched for as data candidates used for speech synthesis. On the other hand, when there is a speech unit from which compressed speech unit data cannot be found, the search unit 9 generates data for identifying the corresponding speech unit (hereinafter, referred to as missing portion identification data).
[0058]
The decompression unit 6 restores the compressed speech piece data supplied from the search unit 9 to the speech piece data before being compressed, and returns the data to the search unit 9. The search unit 9 supplies the speech unit data returned from the decompression unit 6 and the retrieved speech unit read data, speed initial value data, and pitch component data to the speech speed conversion unit 11 as search results. When the missing part identification data is generated, the missing part identification data is also supplied to the speech speed conversion unit 11.
[0059]
On the other hand, when the sound piece editing unit 8 is supplied with data representing the moving speed of the moving object from the speed detecting unit 12, based on the moving speed represented by the data, the utterance speed of the fixed message (the utterance of the fixed message). Time length). The speech unit editing unit 8 converts the speech unit data supplied to the speech speed conversion unit 11 to the speech speed conversion unit 11 and determines the time length of the speech unit represented by the speech unit data. It is instructed to make the speed match the speed (or a speed close to the determined utterance speed by a certain degree or more).
[0060]
The correspondence between the moving speed of the moving object and the utterance speed is arbitrary. For example, the sound piece editing unit 8 may determine that the utterance speed increases as the moving speed of the moving object increases.
[0061]
The speech speed conversion unit 11 responds to the instruction of the speech unit editing unit 8, converts the speech unit data supplied from the search unit 9 so as to match the instruction, and supplies the speech unit editing unit 8. Specifically, for example, the original time length of the sound piece data supplied from the search unit 9 is specified based on the searched speed initial value data, and the sound piece data is resampled. The number of samples of the speech piece data may be set to a time length that matches the speed specified by the speech piece editing unit 8.
[0062]
The speech speed conversion unit 11 also supplies the speech unit reading data and the pitch component data supplied from the retrieval unit 9 to the speech unit editing unit 8, and further, when the missing part identification data is supplied from the retrieval unit 9, This missing part identification data is also supplied to the sound piece editing unit 8.
[0063]
If the utterance speed data is not supplied to the speech unit editing unit 8, the speech unit editing unit 8 transmits the speech unit to the speech speed conversion unit 11 without converting the speech unit data supplied to the speech speed conversion unit 11. What is necessary is just to instruct the speech unit editing unit 8 to supply the speech unit, and the speech speed conversion unit 11 supplies the speech unit data supplied from the search unit 9 to the speech unit editing unit 8 as it is in response to this instruction.
[0064]
When the speech unit data, the speech unit reading data, and the pitch component data are supplied from the speech speed conversion unit 11, the speech unit editing unit 8 outputs the waveform of the speech unit constituting the fixed message from the supplied speech unit data. Is selected one by one for each of the speech pieces.
[0065]
The criterion for selecting the speech piece data is arbitrary. For example, the speech piece editing section 8 performs prosody prediction on the fixed message, and then supplies the speech rate conversion section 11 for each sound piece in the fixed message. What is necessary is to select one of the pieces of the speech unit that shows the highest correlation between the temporal change of the pitch component and the result of the prosody prediction one by one.
[0066]
Specifically, first, the sound piece editing unit 8 adds an analysis based on a prosody prediction method such as “Fujisaki model” or “ToBI (Tone and Break Indices)” to the fixed message represented by the fixed message data. , The time change of the frequency of the pitch component of each sound piece in the fixed message is predicted, and a function representing the prediction result is specified. On the other hand, the speech unit editing unit 8 converts a function representing a time change of the frequency of the pitch component of the speech unit data supplied from the speech speed conversion unit 11 based on the pitch component data supplied from the speech speed conversion unit 11. Identify.
[0067]
Then, the speech unit editing unit 8 calculates, for each speech unit in the fixed message, a function representing a prediction result of a temporal change of the frequency of the pitch component of the speech unit, and a waveform of the speech unit whose reading matches with the speech unit. , A correlation coefficient with a function representing a time change of the frequency of the pitch component of each piece of sound piece data is obtained, and the piece of sound piece data having the highest correlation coefficient is selected.
[0068]
On the other hand, if the missing portion identification data is also supplied from the speech speed conversion unit 11, the speech unit editing unit 8 extracts a phonetic character string representing the reading of the speech unit indicated by the missing portion identification data from the standard message data. Then, it supplies the sound processing unit 4 with the sound processing unit 4 to synthesize the waveform of the sound piece.
[0069]
Upon receiving the instruction, the sound processing unit 4 treats the phonetic character string supplied from the sound piece editing unit 8 in the same manner as the phonetic character string represented by the distribution character string data. As a result, compressed waveform data representing the waveform of the voice indicated by the phonogram contained in the phonogram string is retrieved by the search unit 5, and the compressed waveform data is restored to the original waveform data by the decompression unit 6. Is supplied to the sound processing unit 4 via the search unit 5. The sound processing unit 4 supplies the waveform data to the sound piece editing unit 8.
[0070]
When the waveform data is returned from the sound processing unit 4, the sound unit editing unit 8 compares the waveform data with the one specified by the sound unit editing unit 8 among the sound unit data supplied from the speech speed conversion unit 11. , Are combined with each other in the order according to the sequence of the sound pieces in the fixed message indicated by the fixed message data, and are output as data representing the synthesized speech.
[0071]
If the data supplied from the speech speed conversion unit 11 does not include the missing part identification data, the speech unit selected by the speech unit editing unit 8 immediately without instructing the sound processing unit 4 to synthesize a waveform. The data may be combined with each other in the order according to the sequence of the sound pieces in the fixed message indicated by the fixed message data, and output as data representing a synthesized voice.
[0072]
In the speech synthesis system described above, the utterance speed of the synthesized speech changes according to the moving speed of the moving object on which the speech synthesis system is mounted. Therefore, for example, when this speech synthesis system is used for generating navigation speech in a car navigation device, the speech unit editing unit 8 determines that the utterance speed of the speech unit data increases as the moving speed of the vehicle increases. A navigation sound at an appropriate speed according to the running condition of the vehicle can be obtained. For example, when the vehicle is approaching the intersection at a high speed, the occupant can hear the necessary information before entering the intersection by uttering the necessary information at a speaking speed corresponding to the speed of the vehicle. . In addition, it is possible to easily obtain a synthesized voice that is easy to hear even when the speed of the moving body changes.
[0073]
The configuration of the speech synthesis system is not limited to the above.
For example, the waveform data and the sound piece data need not be PCM format data, and the data format is arbitrary.
Further, the waveform database 7 and the sound piece database 10 do not necessarily need to store the waveform data and the sound piece data in a state where the data is compressed. When the waveform database 7 or the sound piece database 10 stores the waveform data or the sound piece data in a state where the data is not compressed, the main unit M does not need to include the decompression unit 6.
[0074]
Further, the speech piece database creating unit 14 becomes a material of new compressed speech piece data to be added to the speech piece database 10 from a recording medium set in a recording medium drive device (not shown) via the recording medium drive device. The speech unit data and phonetic character strings may be read.
Further, the sound piece registration unit R does not necessarily need to include the recorded sound piece data set storage unit 13.
[0075]
Further, the sound piece database creating unit 14 may include a microphone, an amplifier, a sampling circuit, an A / D (Analog-to-Digital) converter, a PCM encoder, and the like. In this case, instead of acquiring the sound piece data from the recorded sound piece data set storage unit 13, the sound piece database creating unit 14 amplifies the audio signal representing the sound collected by its own microphone, samples it, and performs A / A After D-conversion, the speech unit data may be created by performing PCM modulation on the sampled audio signal.
[0076]
Further, the pitch component data may be data representing a temporal change of the pitch length of the sound piece represented by the sound piece data. In this case, the speech unit editing unit 8 predicts a temporal change in the pitch length of the speech unit by performing prosody prediction, and obtains the prediction result and the speech unit data representing the waveform of the speech unit whose reading matches the reading of the speech unit. What is necessary is just to obtain the correlation with the pitch component data representing the time change of the pitch length.
[0077]
The speech unit editing unit 8 acquires free text data together with the language processing unit 1, for example, and converts the speech unit data representing a waveform close to the waveform of the speech unit included in the free text represented by the free text data into a fixed message. May be selected by performing substantially the same processing as the processing for selecting sound piece data representing a waveform close to the waveform of the sound piece included in the speech, and used for speech synthesis.
In this case, the sound processing unit 4 does not need to cause the search unit 5 to search for waveform data representing the waveform of the sound unit represented by the sound unit data selected by the sound unit editing unit 8. Note that the sound piece editing unit 8 notifies the sound processing unit 4 of a sound piece that does not need to be synthesized by the sound processing unit 4, and the sound processing unit 4 responds to this notification, and responds to the notification by outputting a unit sound constituting the sound piece. What is necessary is just to stop the search of the waveform of.
[0078]
The sound piece editing unit 8 also acquires, for example, distribution character string data together with the sound processing unit 4 and converts sound piece data representing a waveform close to the waveform of the sound piece included in the distribution character string represented by the distribution character string data. Alternatively, the selection may be made by performing substantially the same processing as that of selecting speech piece data representing a waveform close to the waveform of the speech piece included in the fixed message, and used for speech synthesis. In this case, the sound processing unit 4 does not need to cause the search unit 5 to search for waveform data representing the waveform of the sound unit represented by the sound unit data selected by the sound unit editing unit 8.
[0079]
The speech piece editing unit 8 supplies the waveform data returned from the sound processing unit 4 to the speech speed conversion unit 11 so that the time length of the waveform represented by the waveform data matches the speed indicated by the utterance speed data. (Or a speed close to the speed by a certain degree or more). By doing so, the generation speed of the voice synthesized by the acoustic processing unit 4 by the rule synthesis method also changes according to the moving speed of the moving object.
[0080]
The speech synthesis system may include a plurality of speech unit databases 10, and in this case, each of the speech unit databases 10 may be associated with a different range of utterance speed that does not overlap with each other. In this case, for example, it is assumed that the combination of reading of the compressed speech piece data stored in each speech piece database 10 is common, while the speech piece database 10 associated with a certain range of utterance speed is used. The compressed speech piece data of a certain reading in the speech read out at a lower (slower) speech speed than the same reading compressed speech piece data in the speech piece database 10 associated with a higher (faster) speech speed. What is necessary is just to represent.
[0081]
When the speech synthesis system includes a plurality of speech unit databases 10 as described above, the speech unit editing unit 8 determines which one of the speech unit databases 10 is to be used based on the determined speech speed, for example. What is necessary is just to supply the data indicating the sound piece database 10 to the search unit 9. Then, the search unit 9 may search for the compressed speech piece data from the speech piece database 10 indicated by the data.
[0082]
Also, instead of resampling the speech piece data, the speech speed conversion unit 11 specifies a portion of the speech piece data that substantially indicates a silent state, and adjusts the time length of the specified portion, The time length of the sound piece represented by the sound piece data may be matched with the speed indicated by the utterance speed data.
[0083]
Further, the physical quantity detected by the voice synthesizing system for an external situation does not necessarily need to be the speed of the moving object, but may represent any other physical quantity.
[0084]
Therefore, the speed detection unit 12 may be configured by, for example, an acceleration sensor or the like, and the speed detection unit 12 detects the acceleration of the moving object on which the voice synthesis system is mounted, and outputs data indicating the detected acceleration. May be generated and supplied to the sound piece editing unit 8. Further, an integrated circuit for integrating the detected acceleration may be further provided. In this case, data indicating a result of integrating the detected acceleration is generated, and the data is transmitted to the sound piece editing unit 8. You may make it supply.
[0085]
The speed detection unit 12 detects the peak of the acceleration of the moving body on which the speech synthesis system is mounted, and generates the data indicating the value of the latest peak detected each time the detected peak is detected. You may make it supply to the part 8.
On the other hand, whenever the data indicating the peak value of the acceleration is supplied from the speed detection unit 12, the sound piece editing unit 8 classifies the peak value indicated by the data into one of a plurality of predetermined ranks, and The frequency at which the peak value was classified in the past to the same rank as the rank at which the peak value was classified may be specified, and the utterance speed may be determined according to the specified result.
[0086]
In this case, for example, the speed detection unit 12 includes an A / D (Analog-to-Digital) converter or a logic circuit for converting a signal generated by the acceleration sensor into digital data and detecting a peak. What is necessary is just to have.
[0087]
In this case, the sound piece editing unit 8 is further provided with a non-volatile memory such as a PROM, for example. A table showing a data structure is previously stored in FIG. You just need to decide on your speed. As shown in the figure, the table only needs to store the utterance speed in a form in which the utterance speed is associated with the rank to which the detected acceleration peak belongs and the frequency at which the peak belonging to this rank is detected.
[0088]
When the speech synthesis system is used, for example, mounted on an automobile, the speech synthesis system detects a peak in the amount of depression of the brake of the automobile, and supplies data representing the detection result to the sound piece editing unit 8. A brake sensor may be further provided. Further, a sensor for the steering wheel which detects the peak of the angular velocity of the steering wheel of the automobile and supplies data representing the detection result to the sound piece editing unit 8 may be further provided.
[0089]
In this case, each time the sound piece editing unit 8 is supplied with data indicating the peak value of the amount of depression of the brake or data indicating the peak value of the angular velocity of the steering wheel, the value of the peak indicated by these data is supplied. Is classified into one of a plurality of ranks respectively defined for the amount of brake depression and the angular velocity of the steering wheel. Determine if the value has been classified.
[0090]
Then, the sound piece editing unit 8 calculates the value α on the right side of Expression 1 from the peak value pa of the acceleration of the vehicle, the peak value pb of the amount of depression of the brake, and the peak value pω of the angular velocity of the steering wheel. Ask.
[0091]
(Equation 1)
α = (W _A1 ・ Pa) + (W _A2 ・ Pb) + (W _A3 ・ Pω)
(However, W _A1 , W _A2 And W _A3 Is a predetermined coefficient)
[0092]
The sound piece editing unit 8 further includes a value fa indicating the frequency at which the peak value has been classified in the past to the same rank as the rank at which the latest peak value of the vehicle acceleration has been classified, and a value of the brake. The value fb indicating how frequently the peak value has been classified in the past to the same rank as the rank where the latest peak value of the amount of depression has been classified, and the peak value of the angular velocity of the steering wheel have been classified. The value β on the right-hand side of Expression 2 is obtained from the value fω indicating how frequently the peak value has been classified in the past to the same rank as the rank.
[0093]
(Equation 2)
β = (W _B1 ・ Fa) + (W _B2 ・ Fb) + (W _B3 ・ Fω)
(However, W _B1 , W _B2 And W _B3 Is a predetermined coefficient)
[0094]
On the other hand, in this case, the speech piece editing unit 8 previously stores a table that stores the utterance speed in a form associated with the values of α and β, for example, as shown in the data structure of FIG. The utterance speed may be determined by referring to this table.
[0095]
Further, the speech synthesis system may change the utterance speed of the speech based on the current time. In this case, for example, the sound piece editing unit 8 includes a timer composed of a crystal oscillator or the like, continuously acquires data indicating the current date and time from the timer, and determines the generation speed of the sound piece data based on the acquired data. And so on.
[0096]
Also, the target that the voice synthesis system changes according to the detection result of the physical quantity regarding the external situation does not necessarily need to be the voice utterance speed, and may be any other element that characterizes the voice.
[0097]
Therefore, in the speech synthesis system, for example, the speech unit editing unit 8 changes the amplitude of the speech unit data supplied from the search unit 9 and the waveform data supplied from the acoustic processing unit 4 via the speech unit editing unit 8. You may let it.
[0098]
In addition, the speech synthesis system may include, for example, a microphone and a level detection circuit for detecting the level of noise inside or outside the moving object and supplying data representing the detection result to the sound piece editing unit 8. . In this case, the sound piece editing unit 8 may determine the amplitude of the synthesized speech based on, for example, the level of the noise represented by the data, and convert the sound piece data and the waveform data so as to match the determined amplitude. With such a configuration, this speech synthesis system can maintain the audibility of the synthesized speech even when the ambient noise is large, such as by increasing the amplitude of the synthesized speech as the noise level increases. it can.
[0099]
The speech synthesis system also includes, for example, a microphone and a Fourier transform device for detecting a band occupied by noise inside or outside the moving object and supplying data representing the detection result to the sound piece editing unit 8. Is also good. In this case, the sound piece editing unit 8 determines, for example, the occupied band of the noise represented by the data as the attenuation band of the synthesized voice (or otherwise determines the attenuation band of the synthesized voice based on the occupied band of the noise), The spectral components within the determined attenuation band may be removed from the sound piece data and the waveform data. With such a configuration, this speech synthesis system avoids overlap between the band occupied by the noise and the band occupied by the synthesized voice, and makes it easy to hear the synthesized voice even when the surrounding noise is loud. Can be kept.
[0100]
Further, in this speech synthesis system, for example, the sound piece editing unit 8 may change the voice quality of the sound piece data and the waveform data.
Specifically, for example, the sound piece editing unit 8 converts the sound piece data into sub-band data indicating a temporal change of a pitch component (fundamental frequency component) and a harmonic component of the sound piece represented by the sound piece data. , And further converts the obtained sub-band data back to data representing the waveform of the sound piece. However, when converting the sub-band data into data representing a waveform, the sound piece editing unit 8 converts each component represented by the sub-band data into a frequency different from the frequency originally represented (for example, the original frequency). Is interpreted as representing the time change of the component of (double the frequency).
[0101]
With such a configuration, the speech synthesis system generates a synthesized speech that is less likely to cause drowsiness by increasing the pitch of the speech at night, for example, as synthesized speech for car navigation, etc. As a result, it is possible to obtain a synthesized voice suitable for a time zone in which the car is driven.
[0102]
The embodiments of the present invention have been described above. However, the speech speed conversion device according to the present invention can be realized using an ordinary computer system without using a dedicated system.
For example, a language processing unit 1, a general word dictionary 2, a user word dictionary 3, a sound processing unit 4, a search unit 5, and a personal computer to which a speed sensor (or any other sensor for detecting an arbitrary physical quantity) is connected. A medium (CD-ROM, MO, floppy (registered) that stores programs for executing the operations of the decompression unit 6, the waveform database 7, the speech unit editing unit 8, the search unit 9, the speech unit database 10, and the speech speed conversion unit 11. By installing the program from a (trademark) disk or the like), the main unit M that executes the above-described processing can be configured.
In addition, by installing the program for executing the operations of the recorded speech piece data set storage unit 13, the speech piece database creation unit 14, and the compression unit 15 in a personal computer, the above-described processing is performed. Can be configured.
[0103]
A personal computer that executes these programs and functions as the main unit M and the sound piece registration unit R performs, for example, the processing illustrated in FIGS. 4 to 6 as the processing corresponding to the operation of the speech synthesis system in FIG. Shall be.
FIG. 4 is a flowchart showing processing when the personal computer acquires free text data.
FIG. 5 is a flowchart showing a process when the personal computer acquires distribution character string data.
FIG. 6 is a flowchart showing a process when the personal computer acquires the fixed message data and the utterance speed data.
[0104]
That is, when the personal computer acquires the above-described free text data from the outside (FIG. 4, step S101), for each ideographic character included in the free text represented by the free text data, a phonogram representing the reading is obtained. Is searched for in the general word dictionary 2 and the user word dictionary 3, and this ideographic character is replaced with the specified phonogram (step S102). The method by which the personal computer acquires the free text data is arbitrary.
[0105]
Then, when a phonogram string representing the result of replacing all ideograms in the free text with phonograms is obtained, the personal computer determines the phonograms included in the phonogram string. The waveform of the unit voice represented by the phonetic character is searched from the waveform database 7, and compressed waveform data representing the waveform of the unit voice represented by each phonetic character included in the phonetic character string is retrieved (step S103).
[0106]
Next, the personal computer restores the retrieved compressed waveform data to the waveform data before compression (step S104), and converts the restored waveform data to each phonogram in the phonogram character string. They are combined with each other in the order according to the arrangement of the characters, and output as synthesized speech data (step S105). The method by which the personal computer outputs synthesized speech data is arbitrary.
[0107]
When the personal computer acquires the above-mentioned distribution character string data from an external device by an arbitrary method (FIG. 5, step S201), each phonogram included in the phonogram string represented by the distribution character string data is obtained. , The waveform of the unit voice represented by the phonetic character is searched from the waveform database 7, and compressed waveform data representing the waveform of the unit voice represented by each phonetic character included in the phonetic character string is retrieved (step S202). ).
[0108]
Next, the personal computer restores the retrieved compressed waveform data to the waveform data before compression (step S203), and converts the restored waveform data into each phonogram in the phonogram character string. These are combined with each other in the order according to the arrangement of the characters, and output as synthesized speech data by the same processing as the processing in step S105 (step S204).
[0109]
On the other hand, when the personal computer obtains the above-mentioned fixed message data and the utterance speed data from an external device by an arbitrary method (FIG. 6, step S301), first, the sound piece included in the fixed message represented by the fixed message data is obtained. All the compressed speech piece data associated with the phonetic character that matches the phonetic character representing the reading is retrieved (step S302).
[0110]
In step S302, the above-described speech piece reading data, speed initial value data, and pitch component data associated with the corresponding compressed speech piece data are also retrieved. If a plurality of compressed speech piece data correspond to one speech piece, all the corresponding compressed speech piece data are retrieved. On the other hand, if there is a voice piece for which compressed voice piece data could not be found, the above-described missing portion identification data is generated. Then, the personal computer restores the retrieved compressed speech piece data to the speech piece data before being compressed (step S303).
[0111]
On the other hand, when the personal computer is supplied with data representing the moving speed (or any other physical quantity) of the moving object from a speed sensor (for example, a device for supplying a vehicle speed pulse representing the vehicle speed of the vehicle: not shown) or the like. The utterance speed of the fixed message is determined based on the moving speed represented by the data (step S304). The correspondence between the moving speed of the moving object and the utterance speed is arbitrary.
[0112]
Next, the personal computer converts the sound piece data restored in step S303 by the same processing as that performed by the sound piece editing unit 8 described above, and determines the time length of the sound piece represented by the sound piece data. Then, the utterance speed determined in step S304 (or a speed close to the determined utterance speed by a certain degree or more) is set (step S305). If the utterance speed data is not supplied, the restored speech piece data need not be converted.
[0113]
Next, the personal computer converts the speech unit data representing the waveform closest to the waveform of the speech unit constituting the fixed message from the speech unit data obtained by converting the time length of the speech unit into the above-described speech unit editing unit. By performing the same processing as the processing performed by No. 8, one sound piece is selected one by one (step S306).
[0114]
In step S306, the personal computer predicts the prosody of the fixed message by, for example, adding an analysis based on the prosody prediction method to the fixed message represented by the fixed message data. A function representing the temporal change of the frequency of the pitch component of the pitch component data is specified based on the pitch component data. The correlation coefficient with the function representing the time change of the frequency of the pitch component of each piece of speech data representing the waveform of the speech piece whose reading matches with this speech piece is obtained, and the speech piece data having the highest correlation coefficient is obtained. What is necessary is just to make it select.
[0115]
On the other hand, when the personal computer generates the missing part identification data, the personal computer extracts a phonetic character string representing the reading of the sound piece indicated by the missing part identification data from the standard message data, and for this phonetic character string, for each phoneme. By performing the processing of steps S202 to S203 described above in the same manner as the phonetic character string represented by the distribution character string data, the waveform data representing the waveform of the voice indicated by each phonetic character in the phonetic character string is obtained. The data is restored (step S307).
[0116]
Then, the personal computer combines the restored waveform data and the sound piece data selected in step S306 with each other in the order according to the order of each sound piece in the fixed message indicated by the fixed message data, and synthesizes the synthesized voice. It is output as data to represent (step S308).
[0117]
The program that causes the personal computer to perform the functions of the main unit M may cause the personal computer to perform the functions of the plurality of sound piece databases 10. In this case, the personal computer completes the process of step S304 before starting the process of step S302. On the other hand, in step S302, based on the determined voice speed, for example, It is only necessary to determine whether or not to use the compressed speech piece data from the decided speech piece database 10. Then, the process of step S305 may be omitted.
[0118]
When the personal computer is caused to perform the functions of the plurality of speech unit databases 10, each of the speech unit databases 10 is assumed to be associated with, for example, a different range of utterance speed that does not overlap each other, and each of the speech unit databases 10 is stored. Combinations of the readings of the compressed speech piece data are common, while the compressed speech piece data of a certain reading in the speech piece database 10 associated with a certain range of utterance speed is higher (faster). ) It is assumed that the speech data indicates speech read out at a lower (slower) utterance speed than the same read compressed speech unit data in the speech unit database 10 associated with the utterance speed.
[0119]
Further, a program that causes a personal computer to perform the functions of the main unit M and the sound piece registration unit R may be uploaded to, for example, a bulletin board (BBS) of a communication line and distributed via the communication line. Carrier waves may be modulated by signals representing these programs, the resulting modulated waves may be transmitted, and a device that has received the modulated waves may demodulate the modulated waves and restore these programs.
Then, by starting these programs and executing them in the same manner as other application programs under the control of the OS, the above-described processing can be executed.
[0120]
When the OS shares a part of the processing, or when the OS constitutes a part of one component of the present invention, the program excluding the part is stored in the recording medium. You may. Also in this case, in the present invention, it is assumed that the recording medium stores a program for executing each function or step executed by the computer.
[0121]
【The invention's effect】
As described above, according to the present invention, a speech speed conversion device, a speech speed conversion method, and a program for obtaining a synthesized voice that is easy to hear even when the environment changes are realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech synthesis system according to an embodiment of the present invention.
FIG. 2 is a diagram schematically showing a data structure of a sound piece database.
3A is a diagram showing a data structure of a table used to determine a utterance speed based on a vehicle acceleration, and FIG. 3B is a diagram showing a vehicle acceleration, a brake depression amount, and a steering wheel; FIG. 7 is a diagram showing a data structure of a table used to determine a utterance speed based on an angular velocity of the utterance.
FIG. 4 is a flowchart showing processing when a personal computer performing the function of the speech synthesis system according to the embodiment of the present invention acquires free text data.
FIG. 5 is a flowchart showing a process when a personal computer that performs the function of the speech synthesis system according to the embodiment of the present invention acquires distribution character string data.
FIG. 6 is a flowchart showing a process when a personal computer that performs the function of the speech synthesis system according to the embodiment of the present invention acquires fixed message data and utterance speed data.
[Explanation of symbols]
M body unit
1 Language processing unit
2 General word dictionary
3 User word dictionary
4 Sound processing unit
5 Search section
6 Extension
7 Waveform database
8 Sound Unit Editing Department
9 Search section
10. Sound Unit Database
11 Speech rate converter
12 Speed detector
R sound unit registration unit
13 Recorded sound piece data set storage unit
14 Sound Unit Database Creation Unit
15 Compression section
HDR header
IDX index section
DIR directory
DAT data section

Claims

音声の波形を表す音声データを取得する音声データ取得手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
取得した音声データのスピードを、生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換手段と、
を備えることを特徴とする話速変換装置。Audio data acquisition means for acquiring audio data representing an audio waveform;
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Voice data conversion means for converting the speed of the obtained voice data based on the speed represented by the generated voice speed setting data;
A speech speed conversion device comprising:

前記音声データ変換手段は、取得した音声データをサンプリングする変換を行うことにより、変換後の音声データが表す音声のスピードが、話速設定データにより表されるスピードとなるようにする、
ことを特徴とする請求項１に記載の話速変換装置。The audio data conversion means performs a conversion of sampling the acquired audio data, so that the speed of the audio represented by the converted audio data is the speed represented by the speech speed setting data,
The speech speed conversion device according to claim 1, wherein:

前記音声データ変換手段は、取得した音声データが表す波形のうち実質的に無音状態を表している部分を特定し、当該部分の時間長を変化させる変換を行うことにより、音声データが表す音声のスピードを、話速設定データにより表されるスピードに基づいたスピードとなるようにする、
ことを特徴とする請求項１に記載の話速変換装置。The audio data conversion means identifies a portion of the waveform represented by the acquired audio data, which substantially represents a silent state, and performs conversion to change the time length of the portion, thereby converting the audio represented by the audio data. The speed should be based on the speed represented by the speech speed setting data,
The speech speed conversion device according to claim 1, wherein:

前記話速設定データ生成手段は、移動体の加速度のピークを検出し、検出した最新のピークを所定の複数のランクのいずれかに分類して記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、特定した結果に基づいて話速設定データを生成する、
ことを特徴とする請求項１に記載の話速変換装置。The speech speed setting data generating means detects the peak of the acceleration of the moving object, classifies and stores the latest detected peak into one of a plurality of predetermined ranks, and the peak detected in the past is the latest peak. Specify how often the peak was classified into the same rank as the classified rank in the past, and generate speech speed setting data based on the specified result,
The speech speed conversion device according to claim 1, wherein:

同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶する音声データ記憶手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記音声データ変換手段が記憶する音声データのうち、表す音声のスピードが、生成された話速設定データにより表されるスピードに最も近いものを選択する音声データ選択手段と、
を備えることを特徴とする話速変換装置。Voice data storage means for storing a plurality of voice data representing a plurality of voice waveforms uttering the same reading phrase at different speeds,
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Of the voice data stored by the voice data conversion means, the voice speed to represent, voice data selection means to select the closest to the speed represented by the generated speech speed setting data,
A speech speed conversion device comprising:

音声の波形を表す音声データを取得する音声データ取得手段と、
外部の状況を表す物理量を検出して、当該物理量に基づいて、音声が満たすべき条件を表す条件設定データを生成する条件設定データ生成手段と、
取得した音声データを変換して、変換後の音声データが表す音声が、生成された条件設定データにより表される条件を満たすようにする音声データ変換手段と、
を備えることを特徴とする話速変換装置。Audio data acquisition means for acquiring audio data representing an audio waveform;
Condition setting data generating means for detecting a physical quantity representing an external situation and generating condition setting data representing a condition to be satisfied by the voice based on the physical quantity;
Voice data conversion means for converting the obtained voice data so that the voice represented by the converted voice data satisfies the condition represented by the generated condition setting data;
A speech speed conversion device comprising:

前記条件設定データ生成手段は、車両の移動速度、当該車両のブレーキの動き、及び／又は当該車両のハンドルの動きを表す物理量を検出するものである、
ことを特徴とする請求項６に記載の話速変換装置。The condition setting data generation means detects a physical quantity representing a moving speed of a vehicle, a movement of a brake of the vehicle, and / or a movement of a steering wheel of the vehicle.
The speech speed conversion device according to claim 6, wherein:

音声の波形を表す音声データを取得し、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成し、
取得した音声データのスピードを、生成された話速設定データにより表されるスピードに基づいて変換する、
ことを特徴とする話速変換方法。Acquires audio data representing the audio waveform,
Detecting the speed or acceleration of the moving object, based on the speed or acceleration, generates speech speed setting data that specifies the speed of the voice,
Convert the speed of the acquired voice data based on the speed represented by the generated speech speed setting data,
A speech speed conversion method characterized in that:

同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶し、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成し、
前記音声データ変換手段が記憶する音声データのうち、表す音声のスピードが、生成された話速設定データにより表されるスピードに最も近いものを選択する、
ことを特徴とする話速変換方法。A plurality of voice data representing waveforms of a plurality of voices uttering the same reading phrase at different speeds are stored,
Detecting the speed or acceleration of the moving object, based on the speed or acceleration, generates speech speed setting data that specifies the speed of the voice,
Among the voice data stored by the voice data conversion means, the speed of the voice to be represented is selected as the speed closest to the speed represented by the generated voice speed setting data,
A speech speed conversion method characterized in that:

音声の波形を表す音声データを取得し、
外部の状況を表す物理量を検出して、当該物理量に基づいて、音声が満たすべき条件を表す条件設定データを生成し、
取得した音声データを変換して、変換後の音声データが表す音声が、生成された条件設定データにより表される条件を満たすようにする、
ことを特徴とする話速変換方法。Acquires audio data representing the audio waveform,
Detecting a physical quantity representing an external situation and generating condition setting data representing a condition to be satisfied by the voice based on the physical quantity,
Converting the acquired audio data so that the audio represented by the converted audio data satisfies the condition represented by the generated condition setting data;
A speech speed conversion method characterized in that:

移動体の速度又は加速度を検出する装置を備えたコンピュータを、
音声の波形を表す音声データを取得する音声データ取得手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
取得した音声データのスピードを、生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換手段と、
して機能させるためのプログラム。A computer equipped with a device for detecting the speed or acceleration of a moving object,
Audio data acquisition means for acquiring audio data representing an audio waveform;
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Voice data conversion means for converting the speed of the obtained voice data based on the speed represented by the generated voice speed setting data;
Program to make it work.

移動体の速度又は加速度を検出する装置を備えたコンピュータを、
同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶する音声データ記憶手段と、
移動体の速度又は加速度を検出して、当該速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記音声データ変換手段が記憶する音声データのうち、表す音声のスピードが、生成された話速設定データにより表されるスピードに合致しているものを選択する音声データ選択手段と、
して機能させるためのプログラム。A computer equipped with a device for detecting the speed or acceleration of a moving object,
Voice data storage means for storing a plurality of voice data representing a plurality of voice waveforms uttering the same reading phrase at different speeds,
Speaking speed setting data generating means for detecting the speed or acceleration of the moving object, and based on the speed or acceleration, generating talking speed setting data for specifying the speed of voice,
Among the voice data stored by the voice data conversion means, voice data selection means for selecting the speed of the voice to be represented, which matches the speed represented by the generated speech speed setting data,
Program to make it work.

外部の状況を表す物理量を検出する装置を備えたコンピュータを、
音声の波形を表す音声データを取得する音声データ取得手段と、
外部の状況を表す物理量を検出して、当該物理量に基づいて、音声が満たすべき条件を表す条件設定データを生成する条件設定データ生成手段と、
取得した音声データを変換して、変換後の音声データが表す音声が、生成された条件設定データにより表される条件を満たすようにする音声データ変換手段と、
して機能させるためのプログラム。A computer equipped with a device for detecting a physical quantity representing an external situation,
Audio data acquisition means for acquiring audio data representing an audio waveform;
Condition setting data generating means for detecting a physical quantity representing an external situation and generating condition setting data representing a condition to be satisfied by the voice based on the physical quantity;
Voice data conversion means for converting the obtained voice data so that the voice represented by the converted voice data satisfies the condition represented by the generated condition setting data;
Program to make it work.