JP3685648B2

JP3685648B2 - Speech synthesis method, speech synthesizer, and telephone equipped with speech synthesizer

Info

Publication number: JP3685648B2
Application number: JP12044299A
Authority: JP
Inventors: 誠橋本
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1999-04-27
Filing date: 1999-04-27
Publication date: 2005-08-24
Anticipated expiration: 2019-04-27
Also published as: JP2000310995A

Abstract

PROBLEM TO BE SOLVED: To read character information with a correct accent even when the information needs to be decreased in the processing amount, by obtaining fundamental frequency pattern information as prosodic information based on the position and pitch information. SOLUTION: A prosody generation part 2 generates fundamental frequency information based on the phoneme and pitch information decided by a character information analyzing part 1, and also decides phoneme duration information. Here, fundamental frequency pattern information as prosody information is generated by obtaining each fundamental frequency information on each mora as the position information to the pitch information and linearly interpolating between them. A phoneme piece cutting-out part 4 takes phoneme pieces out of a speech database 3 so that the phoneme pieces match with a phoneme string to be synthesized based on the phoneme information. A phoneme piece connecting part 5 connects the phoneme pieces taken out by the phoneme piece cutting-out part, and processes the phoneme pieces based on the prosody information, and outputs a desired synthesized speech data in a form of a speech signal.

Description

【０００１】
【発明の属する技術分野】
本発明は、音声素片を接続することによって入力されたテキストに対する音声を生成する音声合成方法において、音程情報の単純化によって処理量を低減しながら、かつ適切な音程情報に従ってテキスト情報を正確に読み上げることができる韻律生成方法に関するものである。
【０００２】
【従来の技術】
従来、合成音声の基本周波数パターン生成モデルとして、電子情報通信学会論文誌Vol.J72-A,No.1,pp32-40（１９８９年１月）「基本周波数パターン生成過程モデルに基づく文章音声の合成」にも開示されているように、句頭から句末に向かう緩やかな下降のフレーズ成分と局所的な起伏のアクセント成分との和で表現する生成モデルが代表的なものとして知られており、これは下記のような関数で生成される。
【０００３】
【数４】

【０００４】
ここで、Api，Aajは、フレーズ成分、アクセント成分の指令の大きさであり、T0iはフレーズ成分の指令の時点、T1j，T2jはアクセント指令の始点と終点である。また、基本周波数パターン生成過程モデルを用いて基本周波数パターンを合成するためには、前記式（Ａ）〜（Ｃ）のパラメータを与える必要があるが、自然音声の分析結果から、αi＝3.0(rad/sec), βj＝20.0(rad/sec),θ＝0.9で固定し得ることが判明している。
【０００５】
【発明が解決しようとする課題】
然し乍ら、テキストから合成音声を生成するテキスト音声変換において上記の基本周波数パターン生成モデルを用いる場合、フレーズ指令やアクセント指令の時点や大きさを、形態素解析などの言語処理によって推定する必要があった。
【０００６】
従って、単語などの短いテキストのみを読ませるといった、言語処理や基本周波数パターン生成処理を簡素化しても合成音声の自然性劣化を抑えることができるようなテキスト音声変換処理においても、複雑な処理を行う必要があるという問題があった。
【０００７】
【課題を解決するための手段】
本発明の音声合成方法は、予め蓄積されている音声素片より所望の音声素片を取り出し、取り出した音声素片を韻律情報に基づいて接続することによって合成音声を生成する音声合成方法において、前記韻律情報としての基本周波数パターン情報を、位置情報と音程情報に基づいて求める。
【０００８】
また、本発明の音声合成装置は、音声素片が蓄積された音声素片蓄積手段と、文字情報を解析して各文字に対応した音素情報を求める音素情報生成手段と、文字情報を解析して各文字に対応する音程情報を求める音程情報生成手段と、前記音素情報生成手段で求めた音素情報及び前記音程情報生成手段で求めた音程情報とに基づいて韻律情報を求める韻律情報生成手段と、前記音素情報生成手段で求めた音素情報に基づいて前記音声素片蓄積手段より所望の音声素片を取り出す音声素片取り出し手段と、前記音声素片取り出し手段で取り出された音声素片を前記韻律情報に基づいて接続して合成音声情報を生成する音声素片接続手段とを備える。
【０００９】
さらに、本発明の音声合成装置を備えた電話機は、電話番号情報と該電話番号情報と関連付けられた文字情報とが記憶された記憶手段と、音声信号及び電話番号情報を受信する受信手段と、該受信手段で受信した電話番号情報を抽出する電話番号情報抽出手段と、前記記憶手段の中から前記電話番号情報抽出手段で抽出した電話番号情報を検索して前記電話番号情報と関連付けられた文字情報を検索して出力する検索手段と、該検索手段が出力する文字情報を解析して各文字に対応した音素情報を求める音素情報生成手段と、前記検索手段が出力する文字情報を解析して各文字に対応する音程情報を求める音程情報生成手段と、前記音素情報生成手段で求めた音素情報及び前記音程情報生成手段で求めた音程情報とに基づいて韻律情報を求める韻律情報生成手段と、前記恩師情報生成手段で求めた音素情報に基づいて前記音声素片蓄積手段より所望の音声素片を取り出す音声素片取り出し手段と、前記音声素片取り出し手段で取り出された音声素片を前記韻律情報に基づいて接続して合成音声情報を生成する音声素片接続手段と、該音声素片接続手段からの合成音声情報を音声として放音する放音手段とを備えたことを特徴とする音声合成装置を備える。
【００１０】
【発明の実施の形態】
以下、図面を参照しつつ本発明の一実施形態について詳述する。
【００１１】
先ず、図１は本発明の音声合成方法を適用した音声合成装置の構成を示す概略ブロック図である。同図において、６は文字情報を入力する文字情報入力部、１は文字情報入力部６から入力された文字情報を解析して音素情報及び音程情報を出力する文字情報解析部、２は文字情報解析部１からの音素情報及び音程情報に基づいて基本周波数パターン情報及び音素継続時間長情報を生成する韻律生成部、３は音声素片情報が蓄積された音声データベース、４は韻律生成部２で生成された基本周波数パターン情報及び音素継続時間長情報に基づいて合成させるべき音素列に合致するように音声データベース３から音声素片を取り出す音声素片取り出し部、５は音声素片取り出し部４で取り出された音声素片を接続して所望の合成音声データを出力する音声素片接続部である。
【００１２】
続いて、前記図１に示す音声合成装置の動作について詳述する。
【００１３】
先ず、文字情報入力部６より入力された文字情報は、文字情報解析部１により解析され、韻律生成の区切りが検出されると共に、音素情報決定部１aで、入力された文字に対応する音素記号が求められる。具体的には、例えば入力された文字情報が”ひらかた”であれば、/hirakata/という音素情報を求める。
【００１４】
次に、音程情報決定部１ｂでは、入力された文字列の音程情報が決定される。前記と同様に、入力された文字情報が”ひらかた”であれば、各文字に対応して［低高高高］という音程情報が決定される。
【００１５】
本発明では、音程情報は発声すべき文字情報と共に予め与えられている。例えば図４（ｂ）に示すように、文字列中にアクセント記号（図の例では＃や＊が相当する）を付与しておき、該アクセント記号によって音程情報が判定できるようにしているが、詳細は後述する。
【００１６】
然し乍ら、上記のようなアクセント記号をつけずに、文字列とアクセント情報とが蓄積された単語辞書を参照する方法もある。即ち、入力された文字情報が”ひらかた”であれば、単語辞書より”ひらかた”を検索し、アクセント情報を取得する。
【００１７】
このようにして前記音素情報決定部１ａで決定された音素情報、及び音程情報決定部１ｂで決定された音程情報は、後段の韻律生成部２に送られる。
【００１８】
次に韻律生成部２は、前記文字情報解析部１で決定された音素情報と音程情報に基づいて、以下に示す生成式によって、基本周波数情報[F0(M)]を生成すると共に、音素継続時間長情報を決定する。
【００１９】
尚、音素継続時間長情報については、音素や音声素片などの単位毎の時間長をテーブル化しておくことで計算量を減らすことができる。
【００２０】
【数５】

【００２１】
本実施例では、上記の式（１）〜（４）において、Fh=310Hz、Fl=250Hz、w=0.375、Fmin=150Hzとしている。
【００２２】
尚、前記セグメントの単位は、「呼気段落」，「アクセント句」，「フレーズ」，「ポーズ」，「基本周波数生成の区切り」のいずれかで区切られた区間とすることができる。
【００２３】
補足すると、
［呼気段落］
音声を発声する過程で、呼気の切れ目によって生じるひとまとまりの音声区間をいう。
［アクセント句］
ほぼ文節程度の長さで１つのアクセント型を担う単位である。
［フレーズ］
自然な区切りで区切られる旋律のあるまとまりをいう。
［ポーズ］
音声発声中に生じる間。音のない短い区間である。
［基本周波数生成の区切り］
上記以外の区切りで区切られるひとまとまりの音声区間をいう。
【００２４】
また、位置情報Ｍは、着目しているデータの時間的な位置（何番目の音素であるか、何番目のモーラであるか、何フレーム目であるか、など）を表す情報であり、モーラ位置，音節位置，音素位置，音声素片位置，フレーム位置などが利用できる。
【００２５】
補足すると、
［モーラ］
日本語のかな１字（拗音なら２字）に相当する単位である。
［モーラ位置］
文字列中の何番目のモーラであるかを表す情報である。
（例：文字列「ひらかた」の「ら」のモーラ位置は２［＝２モーラ目］である。）
［音節］
前述のモーラとほぼ同じ単位であるが、長音「−」、撥音「ん」、促音「っ」は１モーラとして扱うのに対して、１音節としては扱わない。
（例：「うんどーかい」のモーラ数は６であるが、音節数は４となる。）
［音節位置］
文字列中の何番目の音節であるかを表す情報である。
（例：「うんどーかい」の「どー」の音節位置は２）
［音素］
言葉の意味の区別を表すのに用いられる音の単位である。
（例：「ひらかた」を音素表記すると /hirakata/となる。）
［音素位置］
文字列中の何番目の音素であるかを表す情報である。
（例：/hirakata/の /r/の音素位置は３である。）
［音声素片］
音声合成用データベースに蓄えておく音声データの最小単位のデータをいう。（例：/hirakata/を合成するときは、/hir/, /rak/, /kat/, /ta/ のような素片を接続する。素片の種類は、ＣＶＣ［子音＋母音＋子音］、ＣＶ［子音＋母音］、ＶＣ［母音+子音］、ＶＶ［母音+母音］など様々である。）
［音声素片位置］
文字列中の音声素片単位で数えて何番目であるかを表す情報である。
【００２６】
［フレーム］
音声を分析してスペクトルやピッチ情報を求める際、短時間毎に分析するのが一般的であり、このときの音声区間をフレームという。（より具体的にいうと、元波形に窓関数を乗じて切り出した音声区間をフレームという。）
［フレーム位置］
文字列中の何番目のフレームであるかを表す情報である。
【００２７】
上記の式（１）〜（４）によって求められた基本周波数情報[F0(M)]は、文字列情報が”ひらかた”の場合、例えば図２のようになる。
【００２８】
即ち、与えられた文字列情報（図の例では”ひらかた”）に対する音程情報は［低高高高］であり、そして各モーラについて各基本周波数情報[F0(M)]を求め、その間を直線補間することで基本周波数パターン情報が生成される。この例では、各モーラ位置において、前記式（１）〜（４）及び前記のFh=310Hz、Fl=250Hz、w=0.375、Fmin=150Hzという条件に基づいて、
F0(1)＝(250×1)-(1×1×0.375)=249.625Hz
F0(2)＝(310×1)-(2×2×0.375)=308.5Hz
F0(3)＝(310×1)-(3×3×0.375)=306.625Hz
F0(4)＝(310×1)-(4×4×0.375)=304Hz
という各基本周波数情報が求められ、各点の間を直線補間して基本周波数パターンが得られる。
【００２９】
尚、図の例では、”ひらかた”という１つのセグメントから成る文字情報についての処理なので、前記式（３）及び（４）においてs=1が適用されるが、”ひらかたこうえん”という文字列の場合、”ひらかた”という第１のセグメントと、”こうえん”という第２のセグメントの２つで構成されるので、前者の文字情報”ひらかた”について各基本周波数情報を求めるときはs=1だが、後者の文字情報”こうえん”について各基本周波数情報を求めるときはs=2となる。
【００３０】
このようにして求めた基本周波数パターン情報に基づいて、音声素片取り出し部４にて、合成すべき音素列に合致するように音声素片を音声データベース３から取り出し、音声素片接続部５に送る。
【００３１】
音声素片接続部５では、音声素片取り出し部４で取り出された音声素片を接続し、韻律生成部２で生成された韻律情報に基づいて、音声素片を加工し、所望の合成音声データを音声信号の形態で出力する。音声素片接続部５から出力された合成音声信号は図示されない増幅器を経て同じく図示されないスピーカから合成音声として出力される。
【００３２】
続いて、上記の音声合成装置をナンバーディスプレイ機能付電話機に適用した例について詳述する。
【００３３】
図３は本発明が適用されたナンバーディスプレイ機能付電話機の構成を示す概略ブロック図である。同図において、公衆電話回線に接続された受信部２００は、その回線を介して音声信号及び電話番号情報を取得する。このうち、電話番号情報は送信元番号抽出部２０１で抽出されて表示部２０２に送られて表示される。
【００３４】
前記送信元番号抽出部２０１で抽出された電話番号情報は、登録データ検索部２０３に送られる。登録データベース２０４には、図４（ａ）に示すように、「登録番号」、「電話番号」、「名前情報」及び「アクセント情報」が使用者によって予め登録されている。そして登録データ検索部２０３が、送信元番号抽出部２０１から送信されてきた電話番号情報が登録データベース２０４に登録されているか否かを検索し、登録されている場合、名前情報を文字情報解析部１に送るように構成されている。
【００３５】
ここで、前記図４に示した「アクセント情報」について補足すると、「０型」とはアクセント核が無いものをいい、「１型」とは１モーラ目にアクセント核があるものをいう。また音程情報は、０型に対しては［低高高高高…］、１型に対しては［高低低低低…］となり、アクセント型とモーラ数から音程情報を決定する。
【００３６】
補足すると、例えば文字列情報が”すずき”の場合、図５に示すように０〜３の型が考えられる。即ち、型のバリエーションはモーラ数をｎとすると、０〜ｎ通りあるということである。
【００３７】
尚、文字情報解析部１、音声素片取り出し部４、及び音声素片接続部５の動作は前述した通りであるので、ここでは割愛する。
【００３８】
次に、図６は公衆電話回線を介して受信した番号情報を、予め登録データベース２０４に登録しておいた送信相手側の名前を読み出した後、音声合成によって読み上げる際の動作を説明するためのフローチャートである。
【００３９】
図６において、先ず、電話が着信状態となったか否かを判断し（ステップＳ１０１）、着信状態であれば、送信元番号抽出部２０１によって抽出された電話番号の表示が許可されているかどうかを判断し（ステップＳ１０２）、電話番号の表示が許可されていれば送信元の電話番号情報を表示部２０２に表示する（ステップＳ１０３）。
【００４０】
次に、登録データ検索部２０３によって、送信元番号抽出部２０１から送られた電話番号情報に対応する名前情報が、登録データベース２０４に登録されているかどうかを調べ（ステップＳ１０４）、電話番号情報に対応する名前情報が登録されていれば登録データベース２０４から名前情報の取得を行い（ステップＳ１０５）、取得された名前情報を表示部２０２に表示する（ステップＳ１０６）。これにより、表示部２０２には前記ステップＳ１０３で処理された電話番号情報及びステップＳ１０６で処理された名前情報が表示されることになる。
【００４１】
次にステップ１０７〜ステップ１１１において名前情報を解析する。具体的には、音素情報決定部１ａにより名前情報を音素列に変換し（ステップＳ１０７）、そして音程情報決定部１ｂにより登録データベース２０４から取得したデータに基づいて音程情報を決定する（ステップＳ１０８）。
【００４２】
さらに、取得された名前情報及び音程情報に基づいて、韻律生成部２により、基本周波数パターン情報と音素継続時間長情報を決定し（ステップＳ１０９）、音声素片取り出し部４により、音声データベース３から音声素片を選択し、それらの素片を接続し（ステップＳ１１０）、合成音声情報（例えば、「鈴木さんからお電話です」など）を出力する（ステップ１１１）。
【００４３】
そして使用者が、通常の電話機能処理、即ち受話器を上げる（オフフック）ことにより、送信元と通話をすることができるが（ステップ１１２）、通常の電話機としての動作についてはその詳細な説明を割愛する。
【００４４】
次に、登録データベース２０４への名前情報の登録フォーマットについて詳述する。登録データベース２０４に登録された情報のフォーマットは、前記図４（ａ）に示すように、アクセント型を記述しておく構成以外に、例えば図４（ｂ）に示すように、アクセント記号を直接書き込むように構成してもよい。
【００４５】
図４（ｂ）の場合は、使用者が直接アクセント情報を入力する。即ち、図４（ｂ）中の例において、「ニ＊シダ」の中にある記号「＊」は、音程が［高］から［低］に落ちる位置を示しており、「ス＃ズキ」の中にある記号［＃］は、音程が［低］から［高］に上がる位置を示している。
【００４６】
上記の実施の形態では、基本周波数パターン情報を閾値まで徐々に下降する線で表現しているが、閾値に近づくほど、下降の度合いを緩めるようなパターンにすれば、さほど処理量を増やすことなく、さらに人間の発声に近づけることができる。
【００４７】
尚、ここでいう閾値とは、前記式（２）におけるＦminに相当する。これを設けないと、音程がどんどん下降していく現象が生じるため、自然な発声ができなくなる。本発明では、基本周波数パターン情報が閾値に達したあとは閾値を保持するように構成されている。ただ、このままでは閾値のところで折れ線的なパターンになりかねないため、閾値に近づくにつれて下降の度合いを緩めることで回避するように成されている。
【００４８】
また、前記音声データベース３に記憶しておくデータは、単音節、音素、モーラなどの単位のうちいずれでもよいし、ＣＶ（子音＋母音）、ＶＣ（母音＋子音）、ＶＶ（母音＋母音）、ＣＶＣ（子音＋母音＋子音）などのように音素環境を考慮したものにしてもよいし、複数の文章をそのまま記憶させておいてもよい。
【００４９】
さらに、音声データベース３に各音声素片毎に複数種類の音声素片が蓄積されている場合には、音声データベースから音声素片を取り出すときに、韻律情報を利用して適切な音声素片を取り出すようにしてもよいことは言うまでもない。
【００５０】
そして、上記のナンバーディスプレー機能付電話機に適用した例では、カナ情報を登録情報として記憶しておくことを想定しているが、登録された時点で音素列に変換し、音素列を記憶しておくようにしてもよい。そして、登録する情報を名前ではなく会社名等にしてもよいことは言うまでもない。
【００５１】
【発明の効果】
以上の説明から明らかなように、本発明によれば、音声合成処理の際の処理量の低減が必要な場合でも、正しいアクセントで文字情報を読み上げることができる効果を奏する。
【図面の簡単な説明】
【図１】本発明の音声合成装置の構成を示すブロック図である。
【図２】基本周波数パターン情報と文字列情報との関係を示す図である。
【図３】本発明の電話機の構成を示すブロック図である。
【図４】登録データベース２０４に登録されている登録情報の一例を示す図である。
【図５】アクセント型を示す図である。
【図６】本発明の電話機の動作を説明するためのフローチャートである。
【符号の説明】
１文字情報解析部
１ａ音素情報決定部
１ｂ音程情報決定部
２韻律生成部
３音声データベース
４音声素片取り出し部
５音声素片接続部
６文字情報入力部
２００受信部
２０１送信元番号抽出部
２０２表示部
２０３登録データ検索部
２０４登録データベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesis method for generating speech for input text by connecting speech segments, while accurately reducing the amount of processing by simplifying the pitch information and accurately matching the text information according to the appropriate pitch information. The present invention relates to a prosody generation method that can be read out.
[0002]
[Prior art]
Conventionally, as a fundamental frequency pattern generation model for synthesized speech, the Institute of Electronics, Information and Communication Engineers Journal Vol.J72-A, No.1, pp32-40 (January 1989) "Synthesis of sentence speech based on fundamental frequency pattern generation process model , The generation model that is expressed as the sum of the slowly descending phrase component from the beginning of the phrase to the end of the phrase and the local accented component of the undulations is known as a representative one. This is generated by the following function.
[0003]
[Expression 4]

[0004]
Here, Api and Aaj are the magnitudes of the phrase component and accent component commands, T0i is the phrase component command time, and T1j and T2j are the accent command start and end points. In order to synthesize a fundamental frequency pattern using the fundamental frequency pattern generation process model, it is necessary to give the parameters of the above formulas (A) to (C). From the analysis result of natural speech, αi = 3.0 ( rad / sec), βj = 20.0 (rad / sec), and θ = 0.9 have been found to be fixed.
[0005]
[Problems to be solved by the invention]
However, when the above basic frequency pattern generation model is used in text-to-speech conversion for generating synthesized speech from text, it is necessary to estimate the time and size of phrase commands and accent commands by language processing such as morphological analysis.
[0006]
Therefore, even in the text-to-speech conversion process that can suppress the natural deterioration of synthesized speech even if the language process and the basic frequency pattern generation process are simplified such that only a short text such as a word is read. There was a problem that had to be done.
[0007]
[Means for Solving the Problems]
The speech synthesis method of the present invention is a speech synthesis method for generating a synthesized speech by extracting a desired speech unit from speech units stored in advance and connecting the extracted speech units based on prosodic information. Basic frequency pattern information as the prosodic information is obtained based on position information and pitch information.
[0008]
The speech synthesizer according to the present invention includes a speech unit storage unit that stores speech units, a phoneme information generation unit that analyzes character information to obtain phoneme information corresponding to each character, and analyzes the character information. Pitch information generating means for obtaining pitch information corresponding to each character; prosody information generating means for obtaining prosody information based on the phoneme information obtained by the phoneme information generating means and the pitch information obtained by the pitch information generating means; , Speech unit extraction means for extracting a desired speech unit from the speech unit storage unit based on phoneme information obtained by the phoneme information generation unit, and speech units extracted by the speech unit extraction unit Speech unit connection means for connecting based on prosodic information and generating synthesized speech information.
[0009]
Further, the telephone provided with the speech synthesizer according to the present invention includes a storage means for storing telephone number information and character information associated with the telephone number information, a receiving means for receiving a voice signal and telephone number information, A telephone number information extracting means for extracting the telephone number information received by the receiving means; a character associated with the telephone number information by searching the telephone number information extracted by the telephone number information extracting means from the storage means; Search means for searching and outputting information; character information output by the search means; analyzing phoneme information generating means for obtaining phoneme information corresponding to each character; analyzing character information output by the search means; Prosody information is obtained based on pitch information generation means for obtaining pitch information corresponding to each character, phoneme information obtained by the phoneme information generation means and pitch information obtained by the pitch information generation means. Prosody information generating means, speech element extracting means for extracting a desired speech element from the speech element accumulating means based on the phoneme information obtained by the teacher information generating means, and extracted by the speech element extracting means A speech unit connecting unit that connects speech units based on the prosodic information to generate synthesized speech information, and a sound emitting unit that emits synthesized speech information from the speech unit connecting unit as speech. A speech synthesizer characterized by the above.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
[0011]
FIG. 1 is a schematic block diagram showing the configuration of a speech synthesizer to which the speech synthesis method of the present invention is applied. In the figure, 6 is a character information input unit for inputting character information, 1 is a character information analysis unit for analyzing character information input from the character information input unit 6 and outputting phoneme information and pitch information, and 2 is character information. A prosody generation unit that generates basic frequency pattern information and phoneme duration information based on phoneme information and pitch information from the

analysis unit

1, 3 is a speech database in which speech unit information is stored, and 4 is a prosody generation unit 2. Based on the generated basic frequency pattern information and phoneme duration information, a speech unit extraction unit 5 extracts a speech unit from the speech database 3 so as to match a phoneme sequence to be synthesized. A speech unit connection unit that connects the extracted speech units and outputs desired synthesized speech data.
[0012]
Next, the operation of the speech synthesizer shown in FIG. 1 will be described in detail.
[0013]
First, the character information input from the character information input unit 6 is analyzed by the character information analysis unit 1 to detect a prosody generation break, and the phoneme information determination unit 1a detects a phoneme symbol corresponding to the input character. Is required. Specifically, for example, if the input character information is “Hirakata”, phoneme information of / hirakata / is obtained.
[0014]
Next, the pitch information determination unit 1b determines pitch information of the input character string. Similarly to the above, if the input character information is “Hirakata”, the pitch information “low high high” is determined corresponding to each character.
[0015]
In the present invention, the pitch information is given in advance together with the character information to be uttered. For example, as shown in FIG. 4B, an accent symbol (equivalent to # and * in the example of the figure) is given in the character string so that the pitch information can be determined by the accent symbol. Details will be described later.
[0016]
However, there is also a method of referring to a word dictionary in which character strings and accent information are stored without adding an accent symbol as described above. That is, if the input character information is “Hirakata”, the word dictionary is searched for “Hirakata” to obtain accent information.
[0017]
The phoneme information determined by the phoneme information determination unit 1a and the pitch information determined by the pitch information determination unit 1b are sent to the prosody generation unit 2 at the subsequent stage.
[0018]
Next, the prosody generation unit 2 generates basic frequency information [F0 (M)] by the following generation formula based on the phoneme information and pitch information determined by the character information analysis unit 1, and continues phoneme. Determine time length information.
[0019]
In addition, regarding the phoneme duration time length information, the calculation amount can be reduced by tabulating time lengths for each unit such as phonemes and speech segments.
[0020]
[Equation 5]

[0021]
In the present embodiment, in the above formulas (1) to (4), Fh = 310 Hz, Fl = 250 Hz, w = 0.375, and Fmin = 150 Hz.
[0022]
The unit of the segment can be a section delimited by any of “exhalation paragraph”, “accent phrase”, “phrase”, “pause”, and “delimitation of fundamental frequency generation”.
[0023]
In addition,
[Exhalation paragraph]
This is a group of speech segments that occur due to breaks in expiration during the process of producing speech.
[Accent phrase]
It is a unit that bears one accent type with a length of about a phrase.
[Phrase]
A melody unit that is separated by natural delimiters.
[Pause]
While occurring during voice utterance. A short section without sound.
[Delimitation of basic frequency generation]
A group of speech segments that are separated by other than the above.
[0024]
The position information M is information indicating the temporal position of the data of interest (number of phonemes, number of mora, frame number, etc.). Position, syllable position, phoneme position, speech segment position, frame position, etc. can be used.
[0025]
In addition,
[Mora]
It is a unit corresponding to one Japanese kana character (two characters for stuttering).
[Mora position]
This is information indicating the number of mora in the character string.
(Example: The mora position of “ra” in the character string “Hirakata” is 2 [= 2 mora eyes].)
[syllable]
Although the unit is almost the same as the above-mentioned mora, the long sound “-”, the sound repellent “n”, and the prompt sound “t” are treated as one mora, but not as one syllable.
(Example: “Undokai” has 6 mora but 4 syllables.)
[Syllable position]
This is information indicating the syllable number in the character string.
(Example: The syllable position of “Do” in “Undokai” is 2)
[phoneme]
It is a unit of sound used to express the meaning of words.
(Example: “hirakata” is phonetic notation and becomes / hirakata /.)
[Phoneme position]
This is information indicating the number of the phoneme in the character string.
(Example: / hirakata / / r / phoneme position is 3.)
[Speech segment]
This is the smallest unit of speech data stored in the speech synthesis database. (Example: When synthesizing / hirakata /, connect segments such as / hir /, / rak /, / kat /, / ta /. The type of segment is CVC [consonant + vowel + consonant] , CV [consonant + vowel], VC [vowel + consonant], VV [vowel + vowel], etc.)
[Voice unit position]
This is information indicating the number of speech units counted in the character string.
[0026]
[flame]
When analyzing a voice to obtain spectrum and pitch information, it is common to analyze it every short time, and the voice section at this time is called a frame. (To be more specific, a voice section cut out by multiplying the original waveform by a window function is called a frame.)
[Frame position]
This is information indicating the frame number in the character string.
[0027]
The basic frequency information [F0 (M)] obtained by the above formulas (1) to (4) is, for example, as shown in FIG. 2 when the character string information is “open”.
[0028]
That is, the pitch information for the given character string information ("Hirakata" in the example in the figure) is [Low, High, High], and each fundamental frequency information [F0 (M)] is obtained for each mora, Basic frequency pattern information is generated by linear interpolation. In this example, at each mora position, based on the above formulas (1) to (4) and the above conditions of Fh = 310 Hz, Fl = 250 Hz, w = 0.375, Fmin = 150 Hz,
F0 (1) = (250 × 1)-(1 × 1 × 0.375) = 249.625Hz
F0 (2) = (310 × 1)-(2 × 2 × 0.375) = 308.5Hz
F0 (3) = (310 × 1)-(3 × 3 × 0.375) = 306.625Hz
F0 (4) = (310 × 1)-(4 × 4 × 0.375) = 304Hz
The fundamental frequency information is obtained, and a fundamental frequency pattern is obtained by linear interpolation between the points.
[0029]
In the example in the figure, since processing is performed on character information consisting of one segment “Hirakata”, s = 1 is applied in the above formulas (3) and (4), but the character “Hirakata Koen” is used. In the case of a column, it is composed of two segments, a first segment called “Hirakata” and a second segment called “Koen”. Therefore, when calculating each fundamental frequency information for the former character information “Hirakata”, s = 1, but when obtaining each fundamental frequency information for the latter character information “Kouen”, s = 2.
[0030]
Based on the fundamental frequency pattern information obtained in this way, the speech unit extraction unit 4 extracts speech units from the speech database 3 so as to match the phoneme sequence to be synthesized, and sends them to the speech unit connection unit 5. send.
[0031]
The speech unit connection unit 5 connects the speech units extracted by the speech unit extraction unit 4, processes the speech units based on the prosodic information generated by the prosody generation unit 2, and generates a desired synthesized speech Data is output in the form of audio signals. The synthesized speech signal output from the speech element connection unit 5 is output as synthesized speech from a speaker (not shown) through an amplifier (not shown).
[0032]
Next, an example in which the above speech synthesizer is applied to a telephone with a number display function will be described in detail.
[0033]
FIG. 3 is a schematic block diagram showing the configuration of a telephone with a number display function to which the present invention is applied. In the figure, a receiving unit 200 connected to a public telephone line acquires a voice signal and telephone number information via the line. Among these, the telephone number information is extracted by the transmission source number extraction unit 201 and sent to the display unit 202 for display.
[0034]
The telephone number information extracted by the transmission source number extraction unit 201 is sent to the registration data search unit 203. In the registration database 204, as shown in FIG. 4A, “registration number”, “phone number”, “name information”, and “accent information” are registered in advance by the user. Then, the registration data search unit 203 searches whether or not the telephone number information transmitted from the transmission source number extraction unit 201 is registered in the registration database 204, and if registered, the name information is converted into the character information analysis unit. 1 to send.
[0035]
Here, supplementing the “accent information” shown in FIG. 4, “0 type” means that there is no accent nucleus, and “1 type” means that there is an accent nucleus in the first mora. The pitch information is [Low, High, High, High ...] for Type 0, [High, Low, Low, Low ...] for Type 1, and the pitch information is determined from the accent type and the number of mora.
[0036]
Supplementally, for example, when the character string information is "Suzuki", types 0 to 3 are conceivable as shown in FIG. That is, there are 0 to n types of mold variations where n is the number of mora.
[0037]
The operations of the character information analysis unit 1, the speech unit extraction unit 4, and the speech unit connection unit 5 are as described above, and are omitted here.
[0038]
Next, FIG. 6 is a diagram for explaining the operation when the number information received via the public telephone line is read out by voice synthesis after reading the name of the transmission partner registered in the registration database 204 in advance. It is a flowchart.
[0039]
In FIG. 6, first, it is determined whether or not the telephone is in an incoming state (step S101). If the telephone is in an incoming state, it is determined whether or not display of the telephone number extracted by the transmission source number extraction unit 201 is permitted. Judgment is made (step S102), and if display of the telephone number is permitted, the telephone number information of the transmission source is displayed on the display unit 202 (step S103).
[0040]
Next, the registration data search unit 203 checks whether the name information corresponding to the telephone number information sent from the transmission source number extraction unit 201 is registered in the registration database 204 (step S104). If the corresponding name information is registered, the name information is acquired from the registration database 204 (step S105), and the acquired name information is displayed on the display unit 202 (step S106). As a result, the telephone number information processed in step S103 and the name information processed in step S106 are displayed on the display unit 202.
[0041]
Next, in step 107 to step 111, the name information is analyzed. Specifically, the name information is converted into a phoneme string by the phoneme information determination unit 1a (step S107), and the pitch information is determined based on the data acquired from the registration database 204 by the pitch information determination unit 1b (step S108). .
[0042]
Further, based on the acquired name information and pitch information, the prosody generation unit 2 determines basic frequency pattern information and phoneme duration information (step S109), and the speech segment extraction unit 4 extracts the speech database 3 from the speech database 3. A speech unit is selected, the units are connected (step S110), and synthesized speech information (for example, “Call from Mr. Suzuki”) is output (step 111).
[0043]
The user can make a call with the transmission source by performing normal telephone function processing, that is, raising the handset (off hook) (step 112), but the detailed description of the operation as a normal telephone is omitted. To do.
[0044]
Next, the registration format of name information in the registration database 204 will be described in detail. As for the format of the information registered in the registration database 204, as shown in FIG. 4A, in addition to the configuration in which the accent type is described, for example, as shown in FIG. You may comprise as follows.
[0045]
In the case of FIG. 4B, the user directly inputs accent information. That is, in the example in FIG. 4B, the symbol “*” in “N * Fern” indicates the position where the pitch falls from [High] to [Low]. The symbol [#] in the middle indicates the position where the pitch rises from [Low] to [High].
[0046]
In the above embodiment, the basic frequency pattern information is expressed by a line that gradually decreases to the threshold value. However, if the pattern is such that the degree of decrease decreases as the threshold value is approached, the processing amount does not increase so much. Furthermore, it can be closer to human voice.
[0047]
The threshold value here corresponds to Fmin in the above equation (2). If this is not provided, there will be a phenomenon in which the pitch goes down steadily, so that natural speech cannot be produced. The present invention is configured to hold the threshold after the basic frequency pattern information reaches the threshold. However, since it may become a polygonal line pattern at the threshold value as it is, it is avoided by relaxing the degree of descending as it approaches the threshold value.
[0048]
The data stored in the speech database 3 may be any unit of single syllable, phoneme, mora, etc., CV (consonant + vowel), VC (vowel + consonant), VV (vowel + vowel). , CVC (consonant + vowel + consonant) may be taken into consideration, and a plurality of sentences may be stored as they are.
[0049]
Further, when a plurality of types of speech units are stored for each speech unit in the speech database 3, when the speech unit is extracted from the speech database, an appropriate speech unit is obtained using prosodic information. It goes without saying that it may be taken out.
[0050]
In the example applied to the telephone with the number display function, it is assumed that kana information is stored as registration information. However, at the time of registration, the kana information is converted into a phoneme string, and the phoneme string is stored. You may make it leave. Needless to say, the information to be registered may be a company name or the like instead of a name.
[0051]
【The invention's effect】
As is clear from the above description, according to the present invention, even when it is necessary to reduce the amount of processing during speech synthesis processing, there is an effect that character information can be read out with correct accents.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to the present invention.
FIG. 2 is a diagram illustrating a relationship between basic frequency pattern information and character string information.
FIG. 3 is a block diagram showing a configuration of a telephone according to the present invention.
4 is a diagram showing an example of registration information registered in a registration database 204. FIG.
FIG. 5 is a diagram showing an accent type.
FIG. 6 is a flowchart for explaining the operation of the telephone of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Character information analysis part 1a Phoneme information determination part 1b Pitch information determination part 2 Prosody generation part 3 Speech database 4 Speech element extraction part 5 Speech element connection part 6 Character information input part 200 Reception part 201 Transmission number extraction part 202 Display Section 203 Registration data search section 204 Registration database

Claims

音声素片が蓄積された音声素片蓄積手段より所望の音声素片を取り出し、取り出した音声素片を韻律情報に基づいて接続することによって合成音声を生成する音声合成方法であって、
文字情報を解析して各文字に対応した音素情報を求める音素情報生成ステップと、
文字情報を解析して各文字に対応する音程情報を求める音程情報生成ステップと、
前記音素情報生成ステップで求めた音素情報及び前記音程情報生成ステップで求めた音程情報とに基づいて韻律情報を求める韻律情報生成ステップと、
前記音素情報生成ステップで求めた音素情報に基づいて前記音声素片蓄積手段より所望の音声素片を取り出す音声素片取り出しステップと、
前記韻律情報に基づいて前記音声素片取り出しステップで取り出された音声素片を接続して合成音声情報を生成する音声素片接続ステップとを具え、
前記韻律情報生成ステップは、前記音素情報に基づく位置情報及び前記音程情報に基づいて、次式（１）〜（４）によって基本周波数情報［Ｆ０（Ｍ）］を算出し、該基本周波数情報［Ｆ０（Ｍ）］に基づいて韻律情報である基本周波数パターン情報を求めることを特徴とする音声合成方法。

A speech synthesis method for generating a synthesized speech by taking out a desired speech unit from a speech unit storage unit in which speech units are stored and connecting the retrieved speech units based on prosodic information,
Phoneme information generation step of analyzing character information to obtain phoneme information corresponding to each character;
A pitch information generation step for analyzing pitch information to obtain pitch information corresponding to each character ;
Prosody information generation step for obtaining prosody information based on the phoneme information obtained in the phoneme information generation step and the pitch information obtained in the pitch information generation step;
A speech segment extraction step of extracting a desired speech segment from the speech segment storage means based on the phoneme information obtained in the phoneme information generation step;
A speech unit connection step of generating synthesized speech information by connecting the speech units extracted in the speech unit extraction step based on the prosodic information,
The prosodic information generation step calculates basic frequency information [F0 (M)] by the following equations (1) to (4) based on the position information based on the phoneme information and the pitch information, and the basic frequency information [ A speech synthesis method characterized by obtaining fundamental frequency pattern information which is prosodic information based on F0 (M)].

前記位置情報は、セグメント開始点からの時間的な位置で表されることを特徴とする請求項１記載の音声合成方法。The speech synthesis method according to claim 1, wherein the position information is represented by a temporal position from a segment start point.

前記位置情報は、セグメント開始点からのモーラ位置で表されることを特徴とする請求項１記載の音声合成方法。The speech synthesis method according to claim 1, wherein the position information is represented by a mora position from a segment start point.

前記位置情報は、セグメント開始点からの音節位置で表されることを特徴とする請求項１記載の音声合成方法。The speech synthesis method according to claim 1, wherein the position information is represented by a syllable position from a segment start point.

前記位置情報は、セグメント開始点からの音素表記された音素位置で表されることを特徴とする請求項１記載の音声合成方法。The speech synthesis method according to claim 1, wherein the position information is represented by a phoneme position indicated by a phoneme from a segment start point.

前記位置情報は、セグメント開始点からの音声素片位置で表されることを特徴とする請求項１記載の音声合成方法。The speech synthesis method according to claim 1, wherein the position information is represented by a speech unit position from a segment start point.

前記位置情報は、セグメント開始点からのフレーム数で表されThe position information is represented by the number of frames from the segment start point.
ることを特徴とする請求項１記載の音声合成方法。The speech synthesis method according to claim 1.

前記音程情報は、前記位置情報で表される位置ごとの高低を表The pitch information represents the height of each position represented by the position information.
す情報で構成され、前記韻律情報生成ステップは、該高低を表す情報に基づいた周波数値The prosodic information generation step includes a frequency value based on the information indicating the height. から、各モーラ位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１乃至請求項７記載の音声合成方法。The speech synthesis method according to claim 1, wherein basic frequency information [F 0 (M)] for each mora position is calculated from the above.

前記音程情報は、前記位置情報Ｍで表される位置ごとの高低を表す情報で構成され、前記韻律情報生成ステップは、該高低を表す情報に基づいた周波数値から、各音節位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１乃至請求項７記載の音声合成方法。The pitch information is composed of information representing the height for each position represented by the position information M, and the prosody information generating step is configured to calculate a fundamental frequency for each syllable position from a frequency value based on the information representing the height. 8. The speech synthesis method according to claim 1, wherein information [F0 (M)] is calculated.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、前記韻律情報生成ステップは、該高低を表す情報に基づいた周波数値から、各素片位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１乃至請求項７記載の音声合成方法。The pitch information is composed of information representing the height for each position represented by the position information, and the prosodic information generation step uses a frequency value based on the information representing the height to calculate a fundamental frequency for each unit position. 8. The speech synthesis method according to claim 1, wherein information [F0 (M)] is calculated.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、前記韻律情報生成ステップは、該高低を表す情報に基づいた周波数値から、前記位置情報で表される位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１乃至請求項７記載の音声合成方法。The pitch information is composed of information representing the height for each position represented by the position information, and the prosody information generating step is performed by using a frequency value based on the information representing the height to represent the position represented by the position information. 8. The speech synthesis method according to claim 1, wherein basic frequency information [F0 (M)] is calculated for each.

前記セグメントの単位は呼気段落で区切られた区間とすることを特徴とする請求項１乃至請求項１１記載の音声合成方法。12. The speech synthesis method according to claim 1, wherein a unit of the segment is a section delimited by exhalation paragraphs.

前記セグメントの単位はアクセント句で区切られた区間とすることを特徴とする請求項１乃至請求項１１の音声合成方法。12. The speech synthesis method according to claim 1, wherein a unit of the segment is a section divided by an accent phrase.

前記セグメントの単位はフレーズで区切られた区間とすることを特徴とする請求項１乃至請求項１１記載の音声合成方法。12. The speech synthesis method according to claim 1, wherein the unit of the segment is a section delimited by a phrase.

前記セグメントの単位はポーズで区切られた区間とすることを特徴とする請求項１乃至請求項１１記載の音声合成方法。12. The speech synthesis method according to claim 1, wherein the unit of the segment is a section divided by pauses.

前記セグメントの単位は基本周波数生成の区切りで区切られた区間とすることを特徴とする請求項１乃至請求項１１記載の音声合成方法。12. The speech synthesis method according to claim 1, wherein the unit of the segment is a section divided by a fundamental frequency generation partition.

音声素片が蓄積された音声素片蓄積手段と、Speech unit storage means in which speech units are stored;
文字情報を解析して各文字に対応した音素情報を求める音素情報生成手段と、文字情報を解析して各文字に対応する音程情報を求める音程情報生成手段と、Phoneme information generating means for analyzing character information to obtain phoneme information corresponding to each character; pitch information generating means for analyzing character information to obtain pitch information corresponding to each character;
前記音素情報生成手段で求めた音素情報及び前記音程情報生成手段で求めた音程情報とに基づいて韻律情報を求める韻律情報生成手段と、Prosody information generation means for obtaining prosody information based on the phoneme information obtained by the phoneme information generation means and the pitch information obtained by the pitch information generation means;
前記音素情報生成手段で求めた音素情報に基づいて前記音声素片蓄積手段より所望の音声素片を取り出す音声素片取り出し手段と、A speech segment extraction unit that extracts a desired speech unit from the speech unit storage unit based on the phoneme information obtained by the phoneme information generation unit;
前記韻律情報に基づいて前記音声素片取り出し手段で取り出された音声素片を接続して合成音声情報を生成する音声素片接続手段とを備え、Speech unit connection means for generating synthesized speech information by connecting the speech units extracted by the speech unit extraction means based on the prosodic information,
前記韻律情報生成手段は、前記音素情報に基づく位置情報及び前記音程情報に基づいて、次式（１）〜（４）によって基本周波数情報［Ｆ０（Ｍ）］を算出し、該基本周波数情報［Ｆ０（Ｍ）］に基づいて韻律情報である基本周波数パターン情報を求めることを特徴とする音声合成装置。The prosody information generating means calculates basic frequency information [F0 (M)] by the following equations (1) to (4) based on the position information based on the phoneme information and the pitch information, and the basic frequency information [ A speech synthesizer characterized by obtaining fundamental frequency pattern information, which is prosodic information, based on F0 (M)].

前記位置情報は、セグメント開始点からの時間的な位置で表されることを特徴とする請求項１７記載の音声合成装置。The speech synthesizer according to claim 17, wherein the position information is represented by a temporal position from a segment start point.

前記位置情報は、セグメント開始点からのモーラ位置で表されることを特徴とする請求項１７記載の音声合成装置。The speech synthesizer according to claim 17, wherein the position information is represented by a mora position from a segment start point.

前記位置情報は、セグメント開始点からの音節位置で表されることを特徴とする請求項１７記載の音声合成装置。The speech synthesizer according to claim 17, wherein the position information is represented by a syllable position from a segment start point.

前記位置情報は、セグメント開始点からの音素表記された音素位置で表されることを特徴とする請求項１７記載の音声合成装置。The speech synthesizer according to claim 17, wherein the position information is represented by a phoneme position indicated by a phoneme from a segment start point.

前記位置情報は、セグメント開始点からの音声素片位置で表されることを特徴とする請求項１７記載の音声合成装置。The speech synthesis apparatus according to claim 17, wherein the position information is represented by a speech unit position from a segment start point.

前記位置情報は、セグメント開始点からのフレーム数で表されることを特徴とする請求項１７記載の音声合成装置。The speech synthesizer according to claim 17, wherein the position information is represented by the number of frames from a segment start point.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、前記韻律情報生成手段は、該高低を表す情報に基づいた周波数値から、各モーラ位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１７乃至請求項２３記載の音声合成装置。The pitch information is composed of information representing the height for each position represented by the position information, and the prosody information generating means uses the frequency value based on the information representing the height to obtain basic frequency information for each mora position. 24. The speech synthesizer according to claim 17, wherein [F0 (M)] is calculated.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、前記韻律情報生成手段は、該高低を表す情報に基づいた周波数値から、各音節位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１７乃至請求項２３記載の音声合成装置。The pitch information is composed of information representing the height of each position represented by the position information, and the prosody information generating means uses the frequency value based on the information representing the height to calculate basic frequency information for each syllable position. 24. The speech synthesizer according to claim 17, wherein [F0 (M)] is calculated.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、前記韻律情報生成手段は、該高低を表す情報に基づいた周波数値から、各素片位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１７乃至請求項２３記載の音声合成装置。The pitch information is composed of information representing the height for each position represented by the position information, and the prosody information generating means uses a frequency value based on the information representing the height to calculate a fundamental frequency for each unit position. 24. The speech synthesizer according to claim 17, wherein information [F0 (M)] is calculated.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、前記韻律情報生成手段は、該高低を表す情報に基づいた周波数値から、前記位置情報で表される位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項１７乃至請求項２３記載の音声合成装置。The pitch information is composed of information representing the height for each position represented by the position information, and the prosody information generating means uses the frequency value based on the information representing the height to determine the position represented by the position information. 24. The speech synthesizer according to claim 17, wherein basic frequency information [F0 (M)] is calculated for each.

前記セグメントの単位は呼気段落で区切られた区間とすることを特徴とする請求項１７乃至請求項２７記載の音声合成装置。28. The speech synthesizer according to claim 17, wherein a unit of the segment is a section delimited by exhalation paragraphs.

前記セグメントの単位はアクセント句で区切られた区間とすることを特徴とする請求項１７乃至請求項２７記載の音声合成装置。28. The speech synthesizer according to claim 17, wherein a unit of the segment is a section delimited by an accent phrase.

前記セグメントの単位はフレーズで区切られた区間とすることを特徴とする請求項１７乃至請求項２７記載の音声合成装置。28. The speech synthesizer according to claim 17, wherein a unit of the segment is a section delimited by a phrase.

前記セグメントの単位はポーズで区切られた区間とすることを特徴とする請求項１７乃至請求項２７記載の音声合成装置。28. The speech synthesizer according to claim 17, wherein a unit of the segment is a section delimited by pauses.

前記セグメントの単位は基本周波数生成の区切りで区切られた区間とすることを特徴とする請求項１７乃至請求項２７記載の音声合成装置。28. The speech synthesizer according to claim 17, wherein a unit of the segment is a section divided by a fundamental frequency generation partition.

電話番号情報と該電話番号情報と関連付けられた文字情報とが記憶された記憶手段と、Storage means for storing telephone number information and character information associated with the telephone number information;
音声信号及び電話番号情報を受信する受信手段と、Receiving means for receiving voice signals and telephone number information;
該受信手段で受信した電話番号情報を抽出する電話番号情報抽出手段と、Telephone number information extracting means for extracting telephone number information received by the receiving means;
前記記憶手段の中から前記電話番号情報抽出手段で抽出した電話番号情報を検索して前記電話番号情報と関連付けられた文字情報を検索して出力する検索手段と、Search means for searching for and outputting character information associated with the telephone number information by searching the telephone number information extracted by the telephone number information extracting means from the storage means;
該検索手段が出力する文字情報を解析して各文字に対応した音素情報を求める音素情報生成手段と、Phoneme information generating means for analyzing the character information output by the search means to obtain phoneme information corresponding to each character;
前記検索手段が出力する文字情報を解析して各文字に対応する音程情報を求める音程情報生成手段と、Pitch information generation means for analyzing the character information output by the search means to obtain pitch information corresponding to each character;
前記音素情報生成手段で求めた音素情報及び前記音程情報生成手段で求めた音程情報とに基づいて韻律情報を求める韻律情報生成手段と、Prosody information generation means for obtaining prosody information based on the phoneme information obtained by the phoneme information generation means and the pitch information obtained by the pitch information generation means;
前記音素情報生成手段で求めた音素情報に基づいて前記音声素片蓄積手段より所望の音声素片を取り出す音声素片取り出し手段と、A speech segment extraction unit that extracts a desired speech unit from the speech unit storage unit based on the phoneme information obtained by the phoneme information generation unit;
前記韻律情報に基づいて前記音声素片取り出し手段で取り出された音声素片を接続して合成音声情報を生成する音声素片接続手段と、Speech unit connection means for generating synthesized speech information by connecting the speech units extracted by the speech unit extraction means based on the prosodic information;
該音声素片接続手段からの合成音声情報を音声として放音する放音手段とを備えたことを特徴とする音声合成装置を備え、A speech synthesizer comprising: sound emitting means for emitting synthesized speech information from the speech unit connecting means as speech;
前記韻律情報生成手段は、前記音素情報に基づく位置情報及び前記音程情報に基づいて、次式（１）〜（４）によって基本周波数情報［Ｆ０（Ｍ）］を算出し、該基本周波数情報［Ｆ０（Ｍ）］に基づいて韻律情報である基本周波数パターン情報を求めることを特徴とする音声合成装置を備えた電話機。The prosody information generating means calculates basic frequency information [F0 (M)] by the following equations (1) to (4) based on the position information based on the phoneme information and the pitch information, and the basic frequency information [ A telephone having a speech synthesizer characterized in that basic frequency pattern information which is prosodic information is obtained based on F0 (M)].

前記位置情報は、セグメント開始点からの時間的な位置で表されることを特徴とする請求項３３記載の音声合成装置を備えた電話機。The telephone with a speech synthesizer according to claim 33, wherein the position information is represented by a temporal position from a segment start point.

前記位置情報は、セグメント開始点からのモーラ位置で表されることを特徴とする請求項３３記載の音声合成装置を備えた電話機。The telephone with a speech synthesizer according to claim 33, wherein the position information is represented by a mora position from a segment start point.

前記位置情報は、セグメント開始点からの音節位置で表されることを特徴とする請求項３３記載の音声合成装置を備えた電話機。The telephone with the speech synthesizer according to claim 33, wherein the position information is expressed by a syllable position from a segment start point.

前記位置情報は、セグメント開始点からの音素表記された音素位置で表されることを特徴とする請求項３３記載の音声合成装置を備えた電話機。The telephone with the speech synthesizer according to claim 33, wherein the position information is represented by a phoneme position indicated by a phoneme from a segment start point.

前記位置情報は、セグメント開始点からの音声素片位置で表されることを特徴とする請求項３３記載の音声合成装置を備えた電話機。The telephone with a speech synthesizer according to claim 33, wherein the position information is represented by a position of a speech unit from a segment start point.

前記位置情報は、セグメント開始点からのフレーム数で表されることを特徴とする請求項３３記載の音声合成装置を備えた電話機。The telephone with a speech synthesizer according to claim 33, wherein the position information is represented by the number of frames from the segment start point.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、該高低を表す情報に基づいた周波数値から、各モーラ位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項３３乃至請求項３９記載の音声合成装置を備えた電話機。The pitch information is composed of information representing the height for each position represented by the position information, and the basic frequency information [F0 (M)] for each mora position is obtained from the frequency value based on the information representing the height. 40. A telephone provided with the speech synthesizer according to claim 33 to claim 39.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、該高低を表す情報に基づいた周波数値から、各音節位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項３３乃至請求項３９記載の音声合成装置を備えた電話機。The pitch information is composed of information indicating the height for each position represented by the position information, and the basic frequency information [F0 (M)] for each syllable position is obtained from the frequency value based on the information indicating the height. 40. A telephone provided with the speech synthesizer according to claim 33 to claim 39.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、該高低を表す情報に基づいた周波数値から、各素片位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項３３乃至請求項３９記載の音声合成装置を備えた電話機。The pitch information is composed of information representing the height for each position represented by the position information, and based on the frequency value based on the information representing the height, the fundamental frequency information [F0 (M)] for each unit position. 40. A telephone provided with the speech synthesizer according to claim 33 to claim 39.

前記音程情報は、前記位置情報で表される位置ごとの高低を表す情報で構成され、該高低を表す情報に基づいた周波数値から、前記位置情報で表される位置ごとの基本周波数情報［Ｆ０（Ｍ）］を算出することを特徴とする請求項３３乃至請求項３９記載の音声合成装置を備えた電話機。The pitch information is composed of information representing the height for each position represented by the position information, and the basic frequency information [F0 for each position represented by the position information is obtained from a frequency value based on the information representing the height. 40. The telephone with the speech synthesizer according to claim 33 to claim 39, wherein (M)] is calculated.

前記セグメントの単位は呼気段落で区切られた区間とすることを特徴とする請求項３３乃至請求項４３記載の音声合成装置を備えた電話機。44. A telephone with a speech synthesizer according to claim 33 to claim 43, wherein a unit of the segment is a section delimited by exhalation paragraphs.

前記セグメントの単位はアクセント句で区切られた区間とすることを特徴とする請求項３３乃至請求項４３記載の音声合成装置を備えた電話機。44. The telephone with a speech synthesizer according to claim 33 to claim 43, wherein a unit of the segment is a section delimited by an accent phrase.

前記セグメントの単位はフレーズで区切られた区間とすることを特徴とする請求項３３乃至請求項４３記載の音声合成装置を備えた電話機。44. The telephone with a speech synthesizer according to claim 33, wherein the segment unit is a section delimited by phrases.

前記セグメントの単位はポーズで区切られた区間とすることを特徴とする請求項３３乃至請求項４３記載の音声合成装置を備えた電話機。44. The telephone with a speech synthesizer according to claim 33, wherein the unit of the segment is a section divided by pauses.

前記セグメントの単位は基本周波数生成の区切りで区切られた区間とすることを特徴とする請求項３３乃至請求項４３記載の音声合成装置を備えた電話機。44. The telephone with a speech synthesizer according to claim 33 to claim 43, wherein a unit of the segment is a section divided by a fundamental frequency generation partition.