JP3892302B2

JP3892302B2 - Voice dialogue method and apparatus

Info

Publication number: JP3892302B2
Application number: JP2002004552A
Authority: JP
Inventors: 利光蓑輪
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2002-01-11
Filing date: 2002-01-11
Publication date: 2007-03-14
Anticipated expiration: 2022-01-11
Also published as: JP2003208196A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識や音声合成を使用した音声対話の方法およびその装置に関する。
【０００２】
【従来の技術】
従来、ユーザの発話中に相づちを打ってユーザとの対話を円滑に進める方法は特開平１１−７５０９３号公報に記載されたものがある。
【０００３】
図１９は、従来の音声対話方法の動作例を示す。
【０００４】
この従来の音声対話方法では、例えば、会議室の予約で「月曜日のですね午後２時なんですが」の入力発話１３２１に対し、音声認識の処理により「えっと」１３４１、「どよう」１３４２、「げつよう」１３３１、「び」１３３２、、、の認識の中間結果を応答生成部（図示せず）に入力して、途中応答が必要かの判断をし、必要なものに対し、ここでは予約の条件となる「月曜」、「午後二時」の各音声認識に対し途中応答信号を生成し、途中応答発話「ハイ」１３５５、１３５６を行っている。
【０００５】
【発明が解決しようとする課題】
しかしながら、従来の音声対話方法においては、途中の音声認識の信頼度が低い場合でも相づちをうつだけで、ユーザの発話が終了するまで音声認識の結果がわからない場合があり、ユーザとの対話の効率に問題を有していた。
【０００６】
本発明は、このような従来の問題を解決するためになされたもので、ユーザの途中発話の認識の信頼度が低い場合にはユーザの発話途中であってもユーザに即座に訂正発話を要求するようにしたり、逆にユーザの発話が終わった後の確認応答で、誤認識の疑いの高い部分については確認のための合成音声の話速を遅くし、かつ語尾を伸長してユーザの訂正発話を誘発しやすくしたりして、ユーザとの対話の効率を高めることができる音声対話方法を提供するものである。
【０００７】
【課題を解決するための手段】
本発明の第１の局面は、音声対話方法であって、ユーザの音声の認識結果に基づく前記ユーザへの返答の中で、前記認識結果に自信が持てない部分を自信が持てる部分よりゆっくりと復唱し、かつ語尾を伸ばす。
【０００８】
また、音声対話方法は、ユーザの訂正発声を誘発する言葉を該復唱にさらに入れても構わない。
【０００９】
本発明の第２の局面は、音声対話装置であって、ユーザの音声を認識する手段と、前記音声の認識結果に基づき前記ユーザへの返答文を生成する手段と、前記返答文を音声化するときに前記認識結果が低かった単語部分の話速を他より遅くする話速設定手段と、前記単語部分の語尾を伸長する語尾伸長手段と、前記単語部分につき話速設定されかつ語尾が伸長された返答文を音声合成する音声合成手段とを備える。
【００１０】
また、音声対話装置は、訂正発声誘発のための音声を挿入する手段をさらに備えていても構わない。
【００１１】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を用いて説明する。
【００１２】
図１は、本発明の第１の実施形態の音声対話方法のフローチャートを示す。
【００１３】
図１に示すように、この第１の実施形態の音声対話方法は、まず、ユーザへのレスポンスの音声出力を開始し１０１、ユーザからの何らかの発声に対する認識結果のレスポンス（復唱）を装置がテキスト音声合成や音声編集合成によって行っている間に、誤認識を発見したユーザがレスポンス音声の出力中にユーザから新たな音声入力があったとき１０２、訂正発声を行うと装置は即座にレスポンス音声出力を中断し１０３、この新たな訂正発声に対し認識処理を行いキーワードを抽出する１０４。連続音声認識によってユーザが発声したと推定される単語列が抽出され、ユーザの訂正発声の前に行っていたレスポンス音声のもとになるレスポンスの文の単語列とキーワードが比較される。不一致のキーワードが見つかると、このキーワードから前の単語列は削除され、「ああ」とか「えっ」などの間投詞付与し１０６、付与した間投詞を先頭にした新たなレスポンス文が作成させる。次に、このレスポンスをテキスト音声合成や音声編集合成によって読み上げられる。以後、このような動作が続けられる。ユーザからの訂正発声がないと、この一連の対話処理は終了する。
【００１４】
この対話の様子は、例えば図１３に示す本発明の第１の実施形態の音声対話方法の対話例のようになる。すなわち、ユーザからの「大崎駅東口まで行って」という音声を認識処理し、「大阪駅東口に行って」と誤認識した装置は、行き先を告げる文パターンの行き先部分を「大阪駅東口」に設定し、このレスポンス文の音声合成出力を始める。しかし、「大阪駅」と聞いたユーザが即座に「違うよ、大崎駅だよ」と訂正すると、装置は先ほどの訂正発声を即座に中止し、ユーザからの訂正発声を認識処理し、その単語列からキーワードの行き先になりうる「大崎駅」を抽出する。そして、先ほどのレスポンス文の「大阪駅」を「大崎駅」に置き換え、その前の「行き先は」を削除し、驚きを表現する間投詞「ああ」を先頭に挿入して効果的なレスポンス文を作成する。次に、このレスポンス文を音声出力し、ユーザに認識結果を確認させる。この例では、誤認識がなくなったためユーザが「そう」と、肯定的な発声をして終わっている。
【００１５】
以上のように本発明の第１の実施形態によれば、装置がユーザの音声を認識し、その認識結果に基づいてユーザへのレスポンスをしている最中にユーザからの訂正発声を受付け、この訂正発声の認識結果に基づき返答内容を一部変更した上で返答を再開するようにしたものであり、誤認識の訂正発声が即座に可能になり、その新たな認識結果もユーザには即座にレスポンスから判断できるようにしたため、ユーザとの対話の効率を高めることができる。
【００１６】
図２は、本発明の第２の実施形態の音声対話方法のフローチャートを示す。
【００１７】
図２に示すように、この第２の実施形態の音声対話方法は、まず、ユーザの音声の認識を開始し２０１、ユーザからの何らかの発声に対する認識処理の結果、連続音声認識によって最終的に文（単語列）が推定されるが、連続音声認識の最中には、数１０ｍｓｅｃ毎に入力音声と単語仮説（候補となりうる単語）との照合がビタビアルゴリズムなどを利用して行われ、入力音声の時間軸と単語仮説の時間軸によって形成される２次元空間上のノード毎に最上位単語がスコア（尤度）とともに残されてゆく。一般的には、そのノードのその時点までの累積スコアが残されるが、この発明では、単語毎の尤度が必要なため、推定単語候補とともに、その尤度も記憶する２０２。ユーザ発話が終了すると、累積スコアが最小となる単語列パスがバックトラック処理によって抽出される２０３。このように、ユーザ発声に対する推定単語列が、各単語の尤度とともに明らかになる。装置は、レスポンスのための音声出力を開始し２０５、この尤度と予め設定してあった閾値を比較し、閾値より低いものは誤認識している可能性が高いと判断する。
【００１８】
このような単語は、ユーザに確認させ、訂正発声をさせたいが、わざわざ「○○でよろしいですか」と確認のレスポンスを行っていては効率の良い対話にならない。そこで、推定した単語列をレスポンスする際に、誤認識している可能性が高いと判断された単語は、わざとゆっくりした話速（３モーラ／秒程度）で音声合成出力を行わせ、読み上げ速度を遅くし２０６、ユーザの注意を喚起するとともに、ユーザが訂正発声を即座にしやすいようにする。そして、ユーザからの訂正発声があったら２０７、この訂正発声に対しても同様の処理を行う。もちろん、誤認識がなければユーザからの訂正発声がないので対話は終了する。
【００１９】
この対話の様子は、図１４に示す本発明の第２の実施形態の音声対話方法の対話例のようになる。すなわち、ユーザからの「大崎駅東口まで行って」という発声を認識処理し、「西口」と誤認識した装置は、この尤度が低いため、誤認識の可能性が高いと判断し、レスポンス文を合成する際に「西口」をわざとゆっくりした話速で出力する。急に話速が変わって注意を喚起されたユーザは、「西口」と聞き、即座に「違うよ、東口だ」と訂正すると、装置はレスポンスの音声合成出力を即座に中止し、ユーザからの訂正発声を認識処理し、その単語列からキーワードの行き先になりうる「東口」を抽出する。そして、レスポンス文「ああ、東口ですね」を作成する。次に、このレスポンス文を音声出力し、ユーザに認識結果を確認させる。この例では、誤認識がなくなったためユーザが「そう」と、肯定的な発声をしたあと、全体のレスポンス文「行き先を大崎駅東口にします」を合成する。
【００２０】
以上のように本発明の第２の実施形態によれば、ユーザの音声を認識し、その認識結果に基づくユーザへの返答の中で、認識結果に自信が持てない部分はゆっくりと復唱し、ユーザの訂正発声を誘発するようにしたため、ユーザが訂正発話をしやすくなり、ユーザとの対話の効率を高めることができる。
【００２１】
図３は、本発明の第３の実施形態の音声対話方法のフローチャートを示す。
【００２２】
図３に示すように、この第３の実施形態の音声対話方法は、まず、ユーザの音声の認識を開始し２０１、ユーザからの何らかの発声に対する認識処理の結果、連続音声認識によって最終的に文（単語列）が推定されるが、連続音声認識の最中には、数１０ｍｓｅｃ毎に入力音声と単語仮説（候補となりうる単語）との照合がビタビアルゴリズムなどを利用して行われ、入力音声の時間軸と単語仮説の時間軸によって形成される２次元空間上のノード毎に最上位単語がスコア（尤度）とともに残されてゆく。一般的には、そのノードのその時点までの累積スコアが残されるが、この発明では、単語毎の尤度が必要なため、推定単語候補とともに、その尤度も記憶する２０２。ユーザ発話が終了すると、累積スコアが最小となる単語列パスがバックトラック処理によって抽出される２０３。このように、ユーザ発声に対する推定単語列が、各単語の尤度とともに明らかに。装置は、レスポンスのための音声出力を開始し２０５、この尤度と予め設定してあった閾値を比較し、閾値より低いものは誤認識している可能性が高いと判断する。
【００２３】
このような単語は、ユーザに確認させ、訂正発声をさせたいが、わざわざ「○○でよろしいですか」と確認のレスポンスを行っていては効率の良い対話にならない。そこで、推定した単語列をレスポンスする際に、誤認識している可能性が高いと判断された単語は、わざとゆっくりした話速で音声合成出力を行わせ、さらにユーザの注意を喚起するために語尾を延ばしたり２０６ａ、語尾にポーズを挿入したり、語尾でわざと「えーと」などの言いよどみを入れて２０６ｂ、自然に時間をかせぎ、ユーザが訂正発声をしやすいようにする。そして、ユーザからの訂正発声があったら２０７、この訂正発声に対しても同様の処理を行う。もちろん、誤認識がなければユーザからの訂正発声がないので対話は終了する。
【００２４】
図１５は、本発明の第３の実施形態の音声対話方法の対話例（ａ）、（ｂ）を示す。
【００２５】
この対話の様子は、例えば図１５の対話例（ａ）ようになり、すなわち、ユーザからの「大崎駅東口まで行って」という発声を認識処理し、「西口」と誤認識した装置は、この尤度が低いため、誤認識の可能性が高いと判断し、レスポンス文を合成する際に「ニシグチ」をわざとゆっくりした話速で出力するとともに最終音節のチを延ばして合成する。急に話速が変わって注意を喚起されたユーザは、「西口ー」と聞き、この合成音声が終了する前に「違うよ、東口だ」と訂正できる。すると、装置はレスポンスの音声出力を即座に中止し、ユーザからの訂正発声を認識処理し、その単語列からキーワードである行先になりうる「東口」を抽出する。そして、レスポンス文「ああ、東口ですね」を作成する。次に、このレスポンス文を音声出力し、ユーザに認識結果を確認させる。
【００２６】
図１５の対話例（ｂ）では、西口の語尾に「えーと」という言いよどみを入れ、図１５の対話例（ａ）と同様の効果を出している。
【００２７】
以上のように本発明の第３の実施形態によれば、ユーザの音声を認識し、その認識結果に基づくユーザへの返答の中で、認識結果に自信が持てない部分はゆっくりと復唱し、語尾の最終音節伸長などで時間を稼ぐためユーザが訂正発声をしやすくなり、ユーザとの対話の効率を高めることができる。
【００２８】
図４は、本発明の第４の実施形態の音声対話方法のフローチャートを示す。
【００２９】
図４に示すように、この第４の実施形態の音声対話方法は、まず、ユーザの発話を１０ｍｓｅｃ〜３０ｍｓｅｃ毎のフレームバッファに順次格納しつつ、そのフレームデータの特徴量抽出を行う。認識辞書には第1番目になりうる単語の候補が入っており、これらの音声のフレーム毎の特徴量と入力音声のフレーム特徴量間の距離（スコア）が計算され、ビタビアルゴリズムなどで最適なフレーム対応が明らかにされる。フレーム番号が進むたびに累積した累積スコアにもとづく足切りが実施され、候補単語が絞られていくのが一般的である。例えば上位数単語との照合が終了した段階で、最上位単語のスコアが予め定められた閾値より低いと、どの単語をも最終候補とすることはできず、ユーザがまだ発声している最中でも途中レスポンス文を選択し、音声合成でユーザに訂正発声を要求する。この途中レスポンス文は、最上位単語のスコアによって変えることが効果的である。
【００３０】
表1に途中レスポンス文の例を示す。
【００３１】
【表１】

【００３２】
表１に示すように、例えば、最上位候補単語のスコアが低いときは、ユーザに丁寧な再発声を促すため、丁寧に「すみません。もう一度おっしゃって下さい。」と途中レスポンスをするが、スコアがやや低いときは、「はあ」と簡単に再発声を促す。また、スコアが普通の場合は認識できている可能性が高いので何もレスポンスせず、明らかにスコアが高い場合は確信を持てるため、「はい」と相づちをうち、ユーザとの対話の自然性を上げるようにする。ある単語との照合が終わると、想定されている単語列規則(文法)にしたがって、認識辞書は、次に来るべき単語の入った認識辞書に更新され、入力音声の認識処理が継続される。単語照合に失敗し訂正発話を要求した場合には、単語辞書更新はせず、再入力された音声の認識処理を行う。このようにユーザの音声入力が終わるまで単語照合が行われ、最終的には各ステップで最上位となった単語の時系列が文として出力される。
【００３３】
例えば、図１６に示す本発明の第４の実施形態の音声対話方法の対話例のように「あのね」に対してはスコアが高く、「あのね」の後にポーズがあるため「はい」と相づちを打つだけであるが、「待ち合わせ場所は」に対してはスコアが低いため「はあ」と訂正発話を要求している。このようにして「待ち合わせ場所は渋谷」という認識結果を得る。
【００３４】
以上のように本発明の第４の実施形態によれば、ユーザが発声している最中に逐次、音声認識処理を行い、認識結果に自信が持てないときにはユーザの発声の最中でも即座にユーザに再発声を要請するようにしたことにより、誤認識した部分に対しユーザが即座に訂正発声をしやすくすることができる。
【００３５】
図５は、本発明の第５の実施形態の音声対話方法のフローチャートを示す。
【００３６】
図５に示すように、この第５の実施形態の音声対話方法は、まず、ユーザの発話を１０ｍｓｅｃ〜３０ｍｓｅｃ毎のフレームバッファに順次格納しつつ、そのフレームデータの特徴量抽出を行う。認識辞書には第1番目になりうる単語の候補が入っており、これら音声のフレーム毎の特徴量と入力音声のフレーム特徴量間の距離（スコア）が計算され、ビタビアルゴリズムなどで最適なフレーム対応が明らかにされる。フレーム番号が進むたびに累積した累積スコアにもとづく足切りが実施され、候補単語が絞られていくのが一般的である。例えば上位数単語との照合が終了した段階で、最上位単語のスコアが予め定められた閾値より低いと、どの単語をも最終候補とすることはできず、ユーザがまだ発声している最中でも途中レスポンス文を作成し、音声合成でユーザに訂正発声を要求する。この際、訂正要求文に装置が推定した認識結果を入れるようにする。このようにすることにより、ユーザは自分の発声のし方がどのような問題を持つかを知ることができ、訂正発声をより的確にすることができる。ある単語との照合が終わると、想定されている単語列規則(文法)にしたがって、認識辞書は、次に来るべき単語の入った認識辞書に更新され、入力音声の認識処理が継続される。単語照合に失敗し訂正発話を要求した場合には、単語辞書更新はせず、再入力された音声の認識処理を行う。このようにユーザの音声入力が終わるまで単語照合が行われ、最終的には各ステップで最上位となった単語の時系列が文として出力される。
【００３７】
例えば、図１７に示す本発明の第５の実施形態の音声対話方法の対話例のように「あのね」に対してはスコアが高く、「あのね」の後にポーズがあるため「はい」と相づちを打つだけであるが、「待ち合わせ場所は」に対しては誤認識して「打ち合わせ場所」と認識しているが、スコアが低いため「打ち合わせ場所ですか」と訂正発話を要求している。このようにしてユーザに「待ち合わせだよ」という訂正発声を促している。
【００３８】
以上のように本発明の第５の実施形態によれば、ユーザが発声している最中に逐次、音声認識処理を行い、認識結果に自信が持てないときにはユーザの発声の最中でもスコアの低い単語を挿入した訂正要求を発して、即座にユーザに再発声を促すようにしたことにより、誤認識した部分に対しユーザが即座に訂正発声をしやすくすることができる。
【００３９】
図６は、本発明の第６の実施形態の音声対話方法のフローチャートを示す。
【００４０】
図６に示すように、この第６の実施形態の音声対話方法は、まず、ユーザの発話を１０ｍｓｅｃ〜３０ｍｓｅｃ毎のフレームバッファに順次格納しつつ、そのフレームデータの特徴量を抽出し、フレームデータの音響分析を行う６０１。毎回、当該フレームの数フレーム前までのデータの音声のある部分とない部分の平均エネルギー比を計算してＳＮ比を算出する。次にＳＮ比が十分に高い場合は６０２、そのまま音声認識処理に移るが６０５、ＳＮ比が十分に高くない場合は６０２、予め保持してある騒音データと入力データとを比較し６０３、類似性を算出し、最も近い騒音を推定する。次に推定された騒音の種類をユーザに告げて再発声を要求する６０４。このようにすることにより、ユーザに騒音源を止めたり、騒音源がなくなってから再発声をさせることで、より認識しやすい状況を作り出すことができる。
【００４１】
例えば、図１８に示す本発明の第６の実施形態の音声対話方法の対話例のように「あのね」と「月曜の」に対してはスコアが高く、「あのね」については、その後にポーズがあるため「はい」と相づちを打つが、「待ち合わせ場所は」に対しては騒音が混入し、ＳＮ比が低くなるため、入力騒音と保持した複数の騒音データを比較し、航空機騒音と推定している。したがって、「うわ、飛行機みたいな音がうるさい」と言ってから訂正発話「もう一度言ってよ」を要求している。このようにしてユーザに音声認識の妨げとなる騒音を指摘してから訂正発声を促している。
【００４２】
以上のように本発明の第６の実施の形態によれば、ユーザの音声以外の周囲騒音がユーザの音声に混入し、このためにユーザ音声の認識結果に自信が持てなくなった場合には、その騒音の種類を推定し、ユーザの発話に割り込み、周囲騒音の種類をユーザに伝え、この騒音が原因で認識が困難になったことを伝えるようにしたものであり、誤認識の原因をユーザが取り除けるようにすることができる。
【００４３】
図７は、本発明の第７の実施形態の音声対話装置のブロック図を示す。
【００４４】
図７に示すように、この第７の実施形態の音声対話装置は、まず、ユーザからの何らかの発声を音声認識手段１１で認識し、その結果に対する認識結果のレスポンス（復唱）を行っている最中に、新たにユーザから訂正発声が入ると、音声認識手段１１はこれを即座に認識するとともに、現在出力していたレスポンスを音声合成出力中止手段１２によって即座に中止させる。次に、訂正発声を認識した結果として単語列が推定されるが、その中からキーワードがレスポンス文生成手段１３によって抽出され、レスポンス文選択手段１４によって、レスポンス用文パターンデータベース１５から選択されていたレスポンス文パターンに候補単語列を埋め込んでレスポンス文を作成する。多くの場合、ユーザには前の発声に対する復唱のための文パターンと一致し、復唱の文の一部を変更しているように見える。このレスポンス文は音声合成手段１６に渡され、合成音声となるが、その出力タイミングは、ユーザ心理モデル計算手段１８が音声合成出力制御手段１７に指令を出すことによって決められる。すなわち、ユーザ心理モデル計算手段１８は、当該訂正発声入力までに訂正発声が続いているようだとユーザが苛々している可能性が高いと判断し、ユーザ発声から0.3秒以内に「えーと」など、兎に角なんらかの発声をするが、まだ対話をし始めたばかりの段階では、レスポンス文が生成されるまで1秒を最長として待ち時間を設けるようにする。このようにしてユーザはいつでも訂正発声をすることができ、かつ、その認識結果をすぐに確認することができる。
【００４５】
以上のように本発明の第７の実施形態によれば、ユーザの音声を認識する音声認識手段の認識結果に基づいてユーザへの返答文を選定する手段と、この返答文を音声化する音声合成手段と、前記返答の最中であってもユーザの訂正発声を認識する音声認識手段と、ユーザの訂正発声が検知された場合に音声合成を中止する手段と、この訂正発声の認識結果に基づき返答内容を修正する手段と、この修正した返答の合成音声をユーザ心理モデルに基づく適切なタイミングで出力する手段を備えるようにしたものであり、ユーザが心理的な負荷なしに訂正発話をし、その結果をすぐに確認することができる。
【００４６】
図８は、本発明の第８の実施形態の音声対話装置のブロック図を示す。
【００４７】
図８に示すように、この第８の実施形態の音声対話装置は、まず、ユーザからの何らかの発声を音声認識手段２１で認識し、その結果に対する認識結果のレスポンス（復唱）を行う際に、話速設定手段２９は、スコアの低かった単語だけ、故意に遅い話速（３モーラ／秒程度）で合成するよう音声合成手段２６に指令を出す。このようにしてユーザに誤認識している可能性の高い部分を判りやすく提示する。これに対し、ユーザから訂正発声が入ると、音声認識手段２１はこれを即座に認識するとともに、現在出力していたレスポンスを音声合成出力中止手段２２によって即座に中止させる。次に、訂正発声を認識した結果として単語列が推定されるが、その中からキーワードがレスポンス文生成手段２３によって抽出され、レスポンス文選択手段２４によって、レスポンス用文パターンデータベース２５から選択されていたレスポンス文パターンに候補単語列を埋め込んでレスポンス文を作成する。このレスポンス文は音声合成手段２６に渡され、合成音声となるが、その出力タイミングは、ユーザ心理モデル計算手段２８が音声合成出力制御手段２７に指令を出すことによって決められる。このようにしてユーザは誤認識している可能性の高い部分を知って、すぐに訂正発声をすることができ、かつ、その認識結果をすぐに確認することができる。
【００４８】
以上のように本発明の第８の実施形態によれば、ユーザの音声を認識する音声認識手段の認識結果に基づいてユーザへの返答文を選定する手段と、この返答文を音声化する音声合成手段と、前記返答の最中であってもユーザの訂正発声を認識する音声認識手段と、ユーザの訂正発声が検知された場合に音声合成を中止する手段と、この訂正発声の認識結果に基づき返答内容を修正する手段と、この修正した返答の合成音声をユーザ心理モデルに基づく適切なタイミングで出力する手段を備えるようにしたため、ユーザが心理的な負荷なしに訂正発話をし、その結果をすぐに確認することができる。
【００４９】
図９は、本発明の第９の実施形態の音声対話装置のブロック図を示す。
【００５０】
図９に示すように、この第９の実施形態の音声対話装置は、まず、ユーザからの何らかの発声を音声認識手段３１で認識し、その結果に対する認識結果のレスポンス（復唱）を行う際に、話速設定手段３９は、スコアの低かった単語だけ、故意に遅い話速（３モーラ／秒程度）で合成するよう音声合成手段３６に指令を出す。さらに、語尾伸長手段３０は、当該単語の語尾を故意に伸長する（この部分は、ポーズ挿入手段３０ａとして当該単語の直後にポーズ（スコアが低いほど長くなる）を挿入したり、訂正発話誘発手段３０ｂとして「えーと」などの迷いを表現して、訂正発話を誘発する語を挿入してもよい）このようにしてユーザに誤認識している可能性の高い部分を判りやすく提示する。これに対し、ユーザから訂正発声が入ると、音声認識手段３１はこれを即座に認識するとともに、現在出力していたレスポンスを音声合成出力中止手段３２によって即座に中止させる。
【００５１】
次に、訂正発声を認識した結果として単語列が推定されるが、その中からキーワードがレスポンス文生成手段３３によって抽出され、レスポンス文選択手段３４によって、レスポンス用文パターンデータベース３５から選択されていたレスポンス文パターンに候補単語列を埋め込んでレスポンス文を作成する。このレスポンス文は音声合成手段３６に渡され、合成音声となるが、その出力タイミングは、ユーザ心理モデル計算手段３８が音声合成出力制御手段３７に指令を出すことによって決められる。このようにしてユーザは誤認識している可能性の高い部分を知って、すぐに訂正発声をすることができ、かつ、その認識結果をすぐに確認することができる。
【００５２】
以上のように本発明の第９の実施形態によれば、認識の信頼度の低かった単語は他より発話を遅くすることに加え、この単語の語尾を伸長するか、または認識結果の信頼度に応じたポーズ長を挿入するか、または「えーと」などの訂正発声誘発のための音声を挿入する手段を備えるようにしたため、さらにユーザの訂正発声をしやすくすることができる。
【００５３】
図１０は、本発明の第１０の実施形態の音声対話装置のブロック図を示す。
【００５４】
図１０に示すように、この第１０の実施形態の音声対話装置は、まず、ユーザの発話が音声認識手段４１によって１０ｍｓｅｃ〜３０ｍｓｅｃ毎にフレームバッファに順次格納されつつ、特徴量抽出が行われる。認識辞書４９には第１番目になりうる単語の候補が入っており、音声認識手段４１によって、逐次これらの音声のフレーム毎に特徴量と入力音声のフレーム特徴量間の距離（スコア）が計算され、ビタビアルゴリズムなどで最適なフレーム対応が明らかにされる。フレーム番号が進むたびに累積した累積スコアにもとづく足切りが実施され、候補単語が絞られていくのが一般的である。例えば上位数単語との照合が終了した段階で、最上位単語のスコアが予め定められた閾値より低いと、どの単語をも最終候補とすることはできず、ユーザがまだ発声している最中でも再発声文選択手段４４によって再発要求文パターンデータベース４５から適切な再発声要求文が選択される。再発声文生成手段４３は、この選択された再発声要求文を音声合成手段４６に渡し音声合成し、ユーザに訂正発声を要求する。このため、ユーザは装置がどの単語を認識できなかったかを即座に知ることができる。
【００５５】
以上のように本発明の第１０の実施形態によれば、ユーザが発声している最中に逐次、音声認識処理を行う手段と、この部分的な認識結果の信頼度を判断する手段と、この信頼度を使ってユーザに再発声を要請するか否かを判断する手段と、ユーザに再発声をうながすための文を選定する手段と、この文を音声化する音声合成手段とを備えるようにしたため、誤認識した部分に対しユーザは即座に訂正発声を行うことができる。
【００５６】
図１１は、本発明の第１１の実施形態の音声対話装置のブロック図を示す。
【００５７】
図１１に示すように、この第１１の実施形態の音声対話装置は、まず、ユーザの発話が音声認識手段５１によって１０ｍｓｅｃ〜３０ｍｓｅｃ毎にフレームバッファに順次格納されつつ、特徴量抽出が行われる。認識辞書５９には第１番目になりうる単語の候補が入っており、音声認識手段５１によって、逐次これらの音声のフレーム毎に特徴量と入力音声のフレーム特徴量間の距離（スコア）が計算され、ビタビアルゴリズムなどで最適なフレーム対応が明らかにされる。フレーム番号が進むたびに累積した累積スコアにもとづく足切りが実施され、候補単語が絞られていくのが一般的である。例えば上位数単語との照合が終了した段階で、最上位単語のスコアが予め定められた閾値より低いと、どの単語をも最終候補とすることはできず、ユーザがまだ発声している最中でも再発声文選択手段５４によって再発要求文パターンデータベース５５から適切な再発声要求文が選択される。再発声文生成手段５３は、この選択された再発声要求文に音声認識手段５１から得た単語候補を埋め込み、これを音声合成手段５６に渡し、音声合成し、ユーザに訂正発声を要求する。このため、ユーザは装置が認識できたのか、またはどのように誤認識したかを即座に知ることができる。
【００５８】
以上のように本発明の第１１の実施形態によれば、ユーザが発声している最中に逐次、音声認識処理を行う手段と、この部分的な認識結果の信頼度を判断する手段と、この信頼度を使ってユーザに再発声を要請するか否かを判断する手段と、認識結果を利用してユーザに再発声を誘発するための文を生成する手段と、この文を音声化する音声合成手段と、この合成音声を適切なタイミングで出力する手段を備えるようにしたため、ユーザの発声終了以前に誤認識を修正しやすいくすることができる。
【００５９】
図１２は、本発明の第１２の実施形態の音声対話装置のブロック図を示す。
【００６０】
図１２に示すように、この第１２の実施形態の音声対話装置は、まず、音響分析手段６１によってユーザの発話を１０ｍｓｅｃ〜３０ｍｓｅｃ毎にフレームバッファに順次格納しつつ、そのフレームデータの特徴量抽出を行う。騒音判別手段６８は毎回、当該フレームの数フレーム前までのデータの音声のある部分とない部分の平均エネルギー比を計算してＳＮ比を算出する。
【００６１】
次にＳＮ比が十分に高い場合はそのまま認識処理に移るが、ＳＮ比が低く、音声認識手段６２の出力するスコアも低いときは、騒音判別手段６８は予め保持してある騒音データベース６９と入力データの類似性を算出し、最も近い騒音を推定する。推定された騒音の種類を音声合成手段６６によってユーザに告げて再発声を要求する。このようにすることにより、ユーザに騒音源を止めたり、騒音源がなくなってから再発声をさせることで、より認識しやすい状況を作り出すことができる。
【００６２】
以上のように本発明の第１２の実施形態によれば、ユーザの音声とそれ以外の音源からの入力を識別する手段と、予め定められた種類の音源と入力を比較する手段と、ユーザが発声している最中に逐次、音声認識を行う手段と、常時その認識結果の信頼度を監視する手段と、この信頼度が低くなったときに原因を説明する文を生成する手段と、生成された文を音声化する音声合成手段を備えるようにしたため、ユーザが誤認識の原因と取り除きやすくすることができる。
【００６３】
【発明の効果】
以上、本発明は、ユーザの発話の認識結果の信頼度が低い場合にはユーザの発話途中であっても装置側からユーザに即座に訂正発話を要求するようにしたり、逆にユーザの発話が終わった後の確認応答で、誤認識の疑いの高い部分については確認のための合成音声の話速を遅くし、かつ語尾を伸長してユーザの訂正発話を誘発しやすくしたりして、ユーザとの対話の効率を高めることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態の音声対話方法のフローチャートを示す図
【図２】本発明の第２の実施形態の音声対話方法のフローチャートを示す図
【図３】本発明の第３の実施形態の音声対話方法のフローチャートを示す図
【図４】本発明の第４の実施形態の音声対話方法のフローチャートを示す図
【図５】本発明の第５の実施形態の音声対話方法のフローチャートを示す図
【図６】本発明の第６の実施形態の音声対話方法のフローチャートを示す図
【図７】本発明の第７の実施形態の音声対話装置のブロック図
【図８】本発明の第８の実施形態の音声対話装置のブロック図
【図９】本発明の第９の実施形態の音声対話装置のブロック図
【図１０】本発明の第１０の実施形態の音声対話装置のブロック図
【図１１】本発明の第１１の実施形態の音声対話装置のブロック図
【図１２】本発明の第１２の実施形態の音声対話装置のブロック図
【図１３】本発明の第１の実施形態の音声対話方法の対話例を示す図
【図１４】本発明の第２の実施形態の音声対話方法の対話例を示す図
【図１５】本発明の第３の実施形態の音声対話方法の対話例を示す図
【図１６】本発明の第４の実施形態の音声対話方法の対話例を示す図
【図１７】本発明の第５の実施形態の音声対話方法の対話例を示す図
【図１８】本発明の第６の実施形態の音声対話方法の対話例を示す図
【図１９】従来の音声対話方法の動作例を示す図
【符号の説明】
１１、２１、３１、４１、５１、６２音声認識手段
１２、２２、３２音声合成出力中止手段
１３、２３、３３レスポンス文生成手段
１４、２４、３４レスポンス文選択手段
１５、２５、３５レスポンス用文パターンデータベース
１６、２６、３６、４６、５６、６６音声合成手段
１７、２７、３７音声合成出力制御手段
１８、２８、３８ユーザ心理モデル計算手段
２９、３９話速設定手段
３０語尾伸長手段
３０ａポーズ挿入手段
３０ｂ訂正発話誘発手段
４９、５９認識辞書
４４、５４再発声文選択手段
４３、５３再発声文生成手段
４５、５５再発要求文パターンデータベース
６１音響分析手段
６８騒音判別手段
６９騒音データベース[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a method and apparatus for voice interaction using voice recognition and voice synthesis.
[0002]
[Prior art]
  Conventionally, Japanese Patent Laid-Open No. Hei 11 (1999) discloses a method for smoothly coordinating with a user by collaborating during the user's speech−There is one described in Japanese Patent No. 75093.
[0003]
  FIG. 19 shows an operation example of a conventional voice interaction method.
[0004]
  In this conventional voice dialogue method, for example, an input utterance 1321 of “It is 2:00 pm on Monday” in a conference room reservation, “Etto” 1341, “Doyo” 1342, An intermediate result of recognition of “gettsuyo” 1331 and “bi” 1332 is input to a response generation unit (not shown) to determine whether an intermediate response is necessary. Then, a midway response signal is generated for each of the speech recognitions of “Monday” and “2:00 p.m.” as reservation conditions, and midway response utterances “high” 1355 and 1356 are performed.
[0005]
[Problems to be solved by the invention]
  However, in the conventional voice interaction method, even if the reliability of voice recognition in the middle is low, there is a case where the result of the voice recognition is not known until the user's utterance is completed. Had a problem.
[0006]
  The present invention has been made to solve such a conventional problem. When the reliability of recognition of an utterance in the middle of the user is low, the user is immediately requested to correct the utterance even during the utterance of the user. On the other hand, in the confirmation response after the user's utterance is over, for the part with high suspicion of misrecognition, the synthesized speech for confirmation is slowed down and the ending is extended to correct the user It is an object of the present invention to provide a voice dialogue method capable of enhancing the efficiency of dialogue with a user by facilitating utterance.
[0007]
[Means for Solving the Problems]
  A first aspect of the present invention is a voice interaction method, wherein in a response to the user based on a user's voice recognition result, a part where the recognition result is not confident is more slowly than a part where the confidence is confident. Repeat and extend ending.
[0008]
  In addition, the voice dialogue method may further include a word that induces corrective utterance of the user in the repetition..
[0009]
  According to a second aspect of the present invention, there is provided a voice interaction device, comprising: means for recognizing a user's voice; means for generating a reply sentence to the user based on the voice recognition result; A speech speed setting means for slowing down the speech speed of the word part whose recognition result is low, a ending extension means for extending the ending of the word part, a speech speed is set for the word part and the ending is extended. Voice synthesizing means for synthesizing the response text.
[0010]
  In addition, the voice interactive apparatus may further include means for inserting a voice for inducing correction utterance.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0012]
  FIG. 1 shows a flowchart of a voice interaction method according to the first embodiment of the present invention.
[0013]
  As shown in FIG. 1, in the voice dialogue method according to the first embodiment, first, a voice output of a response to a user is started 101, and a response (return) of a recognition result to a certain utterance from the user is text-written by the device. When a user who has discovered misrecognition receives a new voice input from the user while outputting the response voice while performing the voice synthesis or voice editing synthesis, the device immediately outputs the response voice when a correct utterance is made. Is interrupted 103, and recognition processing is performed on the new corrected utterance to extract a keyword 104. A word string presumed to be uttered by the user by continuous speech recognition is extracted, and the keyword is compared with the word string of the response sentence that is the basis of the response speech performed before the user's corrected utterance. When an unmatched keyword is found, the previous word string is deleted from this keyword, and an interjection such as “Ah” or “Eh” is added 106 to create a new response sentence with the added interjection at the head. Next, this response is read out by text speech synthesis or speech editing synthesis. Thereafter, such an operation is continued. If there is no corrected utterance from the user, this series of dialogue processing ends.
[0014]
  The state of this dialogue is, for example, like a dialogue example of the voice dialogue method according to the first embodiment of the present invention shown in FIG. In other words, the device that recognizes the user's voice “Go to Osaki Station East Exit” and misrecognizes that “Go to Osaka Station East Exit” makes the destination part of the sentence pattern telling the destination “Osaka Station East Exit”. Set and start speech synthesis output of this response sentence. However, if a user who hears “Osaka Station” immediately corrects “No, Osaki Station”, the device immediately stops the corrective utterance, recognizes the corrective utterance from the user, Extract "Osaki Station" that can be a keyword destination from the column. Then, replace “Osaka Station” in the previous response sentence with “Osaki Station”, delete the previous “Destination is”, and insert an interjection “Oh” at the beginning to express surprise. create. Next, this response sentence is output as voice, and the user confirms the recognition result. In this example, since the misrecognition disappears, the user ends with a positive utterance saying “Yes”.
[0015]
  As described above, according to the first embodiment of the present invention, the device recognizes the user's voice and accepts a corrected utterance from the user while responding to the user based on the recognition result. Based on the recognition result of this corrected utterance, the response content is partially changed and the response is restarted. Corrected utterance of the erroneous recognition is immediately possible, and the new recognition result is immediately displayed to the user. Since the response can be determined from the response, the efficiency of the dialog with the user can be improved.
[0016]
  FIG. 2 shows a flowchart of the voice interaction method according to the second embodiment of the present invention.
[0017]
  As shown in FIG. 2, in the voice interaction method of the second embodiment, first, recognition of a user's voice is started 201. As a result of recognition processing for some utterance from the user, a sentence is finally obtained by continuous voice recognition. (Word string) is estimated, but during continuous speech recognition, input speech and word hypotheses (words that can be candidates) are collated using a Viterbi algorithm or the like every 10 milliseconds. The most significant word is left with a score (likelihood) for each node in the two-dimensional space formed by the time axis and the time axis of the word hypothesis. In general, the cumulative score of the node up to that point is left, but since the likelihood for each word is necessary in the present invention, the likelihood is also stored 202 together with the estimated word candidate. When the user utterance is completed, a word string path having a minimum cumulative score is extracted 203 by backtrack processing. In this way, the estimated word string for the user utterance is revealed along with the likelihood of each word. The apparatus starts outputting a voice for response 205, compares this likelihood with a preset threshold value, and determines that there is a high possibility that an error is lower than the threshold value.
[0018]
  I would like the user to confirm these words and make corrective utterances, but if I did bother to make a confirmation response, I wouldn't have an efficient conversation. Therefore, when the estimated word string is responded, words that are determined to have a high possibility of being misrecognized are intentionally subjected to speech synthesis output at a slow speaking speed (about 3 mora / second), and the reading speed is increased. Slow down 206 to alert the user and make it easier for the user to make corrective speech immediately. If there is a corrected utterance from the user 207, the same processing is performed for the corrected utterance. Of course, if there is no misrecognition, there is no correction utterance from the user, so the dialogue ends.
[0019]
  The state of this dialogue is like a dialogue example of the voice dialogue method of the second embodiment of the present invention shown in FIG. That is, the device that recognizes the utterance “Go to the east exit of Osaki Station” from the user and misrecognizes it as “West Exit” has a low likelihood, so it is determined that the possibility of misrecognition is high and the response sentence When synthesizing, "Nishiguchi" is intentionally output at a slow talk speed. The user who suddenly changed the speech rate and was alerted heard “West Exit” and immediately corrected “No, East Exit”, the device immediately stopped the speech synthesis output of the response, The corrective utterance is recognized, and “East Exit” that can be the destination of the keyword is extracted from the word string. Then, the response sentence “Oh, it ’s the East Exit” is created. Next, this response sentence is output as voice, and the user confirms the recognition result. In this example, since there is no misrecognition, the user makes a positive utterance saying “Yes”, and then synthesizes the entire response sentence “Make the destination the East Exit of Osaki Station”.
[0020]
  As described above, according to the second embodiment of the present invention, the user's voice is recognized, and in the reply to the user based on the recognition result, the portion that is not confident in the recognition result is slowly repeated, Since the user's corrective utterance is induced, the user can easily make corrective utterance, and the efficiency of the dialog with the user can be improved.
[0021]
  FIG. 3 shows a flowchart of a voice interaction method according to the third embodiment of the present invention.
[0022]
  As shown in FIG. 3, in the speech dialogue method of the third embodiment, first, recognition of a user's voice is started 201, and as a result of recognition processing for some utterance from the user, a sentence is finally obtained by continuous speech recognition. (Word string) is estimated, but during continuous speech recognition, input speech and word hypotheses (words that can be candidates) are collated using a Viterbi algorithm or the like every 10 milliseconds. The most significant word is left with a score (likelihood) for each node in the two-dimensional space formed by the time axis and the time axis of the word hypothesis. In general, the cumulative score of the node up to that point is left, but since the likelihood for each word is necessary in the present invention, the likelihood is also stored 202 together with the estimated word candidate. When the user utterance is completed, a word string path having a minimum cumulative score is extracted 203 by backtrack processing. Thus, the estimated word sequence for the user utterance is clarified along with the likelihood of each word. The apparatus starts outputting a voice for response 205, compares this likelihood with a preset threshold value, and determines that there is a high possibility that an error is lower than the threshold value.
[0023]
  I would like the user to confirm these words and make corrective utterances, but if I did bother to make a confirmation response, I wouldn't have an efficient conversation. Therefore, when responding to the estimated word string, words that are determined to have a high possibility of being misrecognized are intentionally made to synthesize speech at a slow speech speed, and to attract the user's attention. The ending is extended 206a, a pause is inserted at the ending, or the word ending is intentionally inserted into the word 206b, so that the time is naturally taken up so that the user can easily make a correct utterance. If there is a corrected utterance from the user 207, the same processing is performed for the corrected utterance. Of course, if there is no misrecognition, there is no correction utterance from the user, so the dialogue ends.
[0024]
  FIGS. 15A and 15B show dialogue examples (a) and (b) of the voice dialogue method according to the third embodiment of the present invention.
[0025]
  The state of this dialogue is, for example, the dialogue example (a) of FIG. 15, that is, the device that recognizes the utterance “go to the east exit of Osaki Station” from the user and misrecognizes it as “west exit” Since the likelihood is low, it is judged that the possibility of misrecognition is high, and when synthesizing the response sentence, “Nishiguchi” is intentionally output at a slow speaking speed and the final syllable is extended and synthesized. A user who suddenly changes his speech speed and is alerted can hear "Nishiguchi-" and correct it before saying that this synthesized speech is finished. Then, the apparatus immediately stops outputting the response voice, recognizes the corrected utterance from the user, and extracts “East Exit” which can be a keyword destination from the word string. Then, the response sentence “Oh, it ’s the East Exit” is created. Next, this response sentence is output as voice, and the user confirms the recognition result.
[0026]
  In the dialogue example (b) in FIG. 15, the word “Ut” is added to the ending of the west exit, and the same effect as in the dialogue example (a) in FIG. 15 is obtained.
[0027]
  As described above, according to the third embodiment of the present invention, the user's voice is recognized, and in the reply to the user based on the recognition result, the portion that is not confident in the recognition result is slowly repeated, Since time is saved by extending the last syllable at the end of the word, it becomes easier for the user to make a correct utterance, and the efficiency of dialogue with the user can be improved.
[0028]
  FIG. 4 shows a flowchart of the voice interaction method of the fourth embodiment of the present invention.
[0029]
  As shown in FIG. 4, in the voice interaction method of the fourth embodiment, first, the user's speech is sequentially stored in a frame buffer every 10 msec to 30 msec, and the feature amount of the frame data is extracted. The recognition dictionary contains the first possible word candidate, and the distance (score) between the feature values for each frame of the speech and the frame features of the input speech is calculated. Frame correspondence is revealed. In general, each time a frame number advances, a cut-off is performed based on the accumulated score, and candidate words are narrowed down. For example, if the top word score is lower than a predetermined threshold at the stage when the matching with the top few words is finished, no word can be a final candidate, and the user is still speaking A response sentence is selected on the way, and a correction utterance is requested from the user by voice synthesis. It is effective to change this response sentence depending on the score of the highest word.
[0030]
  Table 1 shows an example of an intermediate response statement.
[0031]
[Table 1]

[0032]
  As shown in Table 1, for example, when the score of the top candidate word is low, in order to urge the user to speak politely again, he / she carefully responds “Sorry. Please say again.” When it is a little low, it is easy to prompt a recurrence voice. Also, if the score is normal, there is a high possibility that it can be recognized, so no response will be given, and if the score is clearly high, you can be confident. To raise. When collation with a certain word is completed, the recognition dictionary is updated to a recognition dictionary containing a word to come next in accordance with an assumed word string rule (grammar), and the input speech recognition process is continued. If the word collation fails and a correction utterance is requested, the word dictionary is not updated, and the re-input speech is recognized. In this way, word matching is performed until the user's voice input is completed, and finally the time series of the highest-ranked word in each step is output as a sentence.
[0033]
  For example, as in the dialogue example of the voice dialogue method according to the fourth embodiment of the present invention shown in FIG. 16, the score is high for “Ane”, and there is a pose after “Ane”. Just hit it, but because “the meeting place is” the score is low, it asks for correct utterance “ha”. In this way, the recognition result that “the meeting place is Shibuya” is obtained.
[0034]
  As described above, according to the fourth embodiment of the present invention, the voice recognition process is sequentially performed while the user is uttering, and when the user is not confident in the recognition result, the user immediately By requesting a recurrence voice to the user, it is possible to make it easier for the user to immediately correct the voice of the erroneously recognized part.
[0035]
  FIG. 5 shows a flowchart of the voice interaction method of the fifth embodiment of the present invention.
[0036]
  As shown in FIG. 5, in the voice interaction method of the fifth embodiment, first, the user's speech is sequentially stored in a frame buffer every 10 msec to 30 msec, and the feature amount of the frame data is extracted. The recognition dictionary contains the first possible word candidate, and the distance (score) between the feature values for each frame of the speech and the frame features of the input speech is calculated. Correspondence is revealed. In general, each time a frame number advances, a cut-off is performed based on the accumulated score, and candidate words are narrowed down. For example, if the top word score is lower than a predetermined threshold at the stage when the matching with the top few words is finished, no word can be a final candidate, and the user is still speaking Create a response sentence along the way and request correct utterance from the user by speech synthesis. At this time, the recognition result estimated by the apparatus is included in the correction request sentence. By doing in this way, the user can know what kind of problem his / her utterance has, and can correct the correct utterance more accurately. When collation with a certain word is completed, the recognition dictionary is updated to a recognition dictionary containing a word to come next in accordance with an assumed word string rule (grammar), and the input speech recognition process is continued. If the word collation fails and a correction utterance is requested, the word dictionary is not updated, and the re-input speech is recognized. In this way, word matching is performed until the user's voice input is completed, and finally the time series of the highest-ranked word in each step is output as a sentence.
[0037]
  For example, as in the dialogue example of the voice dialogue method according to the fifth embodiment of the present invention shown in FIG. 17, the score for “Ane” is high, and there is a pose after “Ane”. Although it only hits, it is recognized as "meeting place" by misrecognizing "meeting place", but since the score is low, a correction utterance is requested as "is it a meeting place?" In this way, the user is urged to make a correct utterance saying "I'm waiting".
[0038]
  As described above, according to the fifth embodiment of the present invention, the speech recognition process is sequentially performed while the user is speaking, and when the recognition result is not confident, the score is low even during the user's speaking. By issuing a correction request in which a word has been inserted and prompting the user to recite immediately, it is possible to make it easier for the user to immediately make a correct utterance for the erroneously recognized portion.
[0039]
  FIG. 6 shows a flowchart of the voice interaction method of the sixth embodiment of the present invention.
[0040]
  As shown in FIG. 6, in the voice interaction method according to the sixth embodiment, first, a user's speech is sequentially stored in a frame buffer every 10 msec to 30 msec, the feature amount of the frame data is extracted, and the frame data The acoustic analysis is performed 601. Each time, the S / N ratio is calculated by calculating the average energy ratio between the part with and without the voice of the data up to several frames before the relevant frame. Next, when the SN ratio is sufficiently high, 602, the process proceeds to the speech recognition process as it is, but 605, when the SN ratio is not sufficiently high, 602, the previously stored noise data is compared with the input data 603, the similarity And the nearest noise is estimated. Next, the type of the estimated noise is notified to the user, and a recurrent voice is requested 604. By doing so, it is possible to create a more easily recognizable situation by stopping the noise source or letting the user repeat the voice after the noise source disappears.
[0041]
  For example, as in the dialogue example of the voice dialogue method according to the sixth embodiment of the present invention shown in FIG. 18, “Ane” and “Monday” have high scores, and “Ane” has a pause after that. Because there is "yes", the noise is mixed into the "meeting place" and the S / N ratio is lowered. Therefore, the input noise is compared with the stored noise data and estimated as aircraft noise. ing. Therefore, after saying "Wow, the sound like an airplane is noisy", it asks for a corrected utterance "Tell me again." In this way, the user is prompted to correct the utterance after pointing out the noise that hinders voice recognition.
[0042]
  As described above, according to the sixth embodiment of the present invention, when ambient noise other than the user's voice is mixed in the user's voice, and thus the user's voice recognition result is not confident, The type of noise is estimated, the user's speech is interrupted, the type of ambient noise is communicated to the user, and it is reported that the recognition has become difficult due to this noise. Can be removed.
[0043]
  FIG. 7 shows a block diagram of a voice interactive apparatus according to the seventh embodiment of the present invention.
[0044]
  As shown in FIG. 7, in the voice interactive apparatus according to the seventh embodiment, first, the voice recognition unit 11 recognizes some utterance from the user, and performs a response (return) of the recognition result to the result. When a corrected utterance is newly input from the user, the voice recognition unit 11 immediately recognizes this and immediately stops the currently output response by the voice synthesis output stop unit 12. Next, a word string is estimated as a result of recognizing the correct utterance. The keyword is extracted from the word string by the response sentence generating unit 13 and selected from the response sentence pattern database 15 by the response sentence selecting unit 14. Create a response sentence by embedding candidate word strings in the response sentence pattern. In many cases, it appears to the user that the sentence pattern for the previous utterance matches the sentence pattern for the repetition, and a part of the sentence for the repetition is changed. This response sentence is passed to the speech synthesizer 16 and becomes a synthesized speech. The output timing is determined by the user psychological model calculation unit 18 issuing a command to the speech synthesis output control unit 17. That is, the user psychological model calculation means 18 determines that there is a high possibility that the user is annoyed if the corrected utterance continues until the input of the corrected utterance. In the stage where the utterance is uttered, but the conversation has just begun, a maximum of 1 second is allowed for a response sentence to be generated. In this way, the user can make a correct utterance at any time, and can immediately confirm the recognition result.
[0045]
  As described above, according to the seventh embodiment of the present invention, the means for selecting a reply sentence to the user based on the recognition result of the voice recognition means for recognizing the user's voice, and the voice for making the reply sentence into speech Synthesis means, speech recognition means for recognizing the user's corrected utterance even during the reply, means for stopping speech synthesis when the user's corrected utterance is detected, and the recognition result of the corrected utterance Based on the content of the response, and a means for outputting the synthesized speech of the modified response at an appropriate timing based on the user psychological model. The user can make a correct utterance without psychological load. , You can immediately check the results.
[0046]
  FIG. 8 shows a block diagram of a voice interactive apparatus according to the eighth embodiment of the present invention.
[0047]
  As shown in FIG. 8, the voice interaction apparatus of the eighth embodiment first recognizes some utterance from the user by the voice recognition means 21 and performs a response (return) of the recognition result to the result. The speech speed setting means 29 commands the speech synthesis means 26 to synthesize only words with low scores at a deliberately slow speech speed (about 3 mora / second). In this way, a portion that is likely to be erroneously recognized is presented to the user in an easily understandable manner. On the other hand, when a correct utterance is input from the user, the voice recognition means 21 immediately recognizes this, and the voice synthesis output stop means 22 immediately stops the response that is currently output. Next, a word string is estimated as a result of recognizing the correct utterance. The keyword is extracted from the word string by the response sentence generating unit 23 and selected from the response sentence pattern database 25 by the response sentence selecting unit 24. Create a response sentence by embedding candidate word strings in the response sentence pattern. This response sentence is passed to the speech synthesizer 26 and becomes a synthesized speech, and the output timing thereof is determined by the user psychological model calculation unit 28 issuing a command to the speech synthesis output control unit 27. In this way, the user can know a portion that is highly likely to be erroneously recognized, can immediately make a correct utterance, and can immediately confirm the recognition result.
[0048]
  As described above, according to the eighth embodiment of the present invention, the means for selecting a response sentence to the user based on the recognition result of the voice recognition means for recognizing the user's voice, and the voice for making the response sentence into speech Synthesis means, speech recognition means for recognizing the user's corrected utterance even during the reply, means for stopping speech synthesis when the user's corrected utterance is detected, and the recognition result of the corrected utterance Based on the content of the response based on this, and a means for outputting the synthesized speech of the corrected response at an appropriate timing based on the user psychological model. Can be confirmed immediately.
[0049]
  FIG. 9 shows a block diagram of a voice interactive apparatus according to the ninth embodiment of the present invention.
[0050]
  As shown in FIG. 9, the voice interaction apparatus according to the ninth embodiment first recognizes some utterance from the user by the voice recognition means 31, and performs a response (return) of the recognition result to the result. The speech speed setting means 39 issues a command to the speech synthesis means 36 so that only words having low scores are intentionally synthesized at a slow speech speed (about 3 mora / second). Further, the ending extension means 30 intentionally extends the ending of the word (this part inserts a pose (longer as the score is lower) immediately after the word as the pose insertion means 30a, or correct utterance induction means. 30b may be used to express ambiguity such as “um” and insert a word that induces corrective utterance.) In this way, a portion that is likely to be misrecognized is presented to the user in an easy-to-understand manner. On the other hand, when a correct utterance is input from the user, the voice recognition unit 31 immediately recognizes this and causes the voice synthesis output stop unit 32 to immediately stop the response that has been output.
[0051]
  Next, a word string is estimated as a result of recognizing the corrected utterance, from which the keyword is extracted by the response sentence generation unit 33 and selected from the response sentence pattern database 35 by the response sentence selection unit 34. Create a response sentence by embedding candidate word strings in the response sentence pattern. This response sentence is passed to the speech synthesizer 36 and becomes a synthesized speech. The output timing is determined by the user psychological model calculation unit 38 issuing a command to the speech synthesis output control unit 37. In this way, the user can know a portion that is highly likely to be erroneously recognized, can immediately make a correct utterance, and can immediately confirm the recognition result.
[0052]
  As described above, according to the ninth embodiment of the present invention, a word whose recognition reliability is low, in addition to making the utterance slower than others, extends the ending of this word, or the reliability of the recognition result. Since a means for inserting a pause length in accordance with the above or a voice for inducing correction utterance such as “Uto” is provided, it is possible to further facilitate the user's correction utterance.
[0053]
  FIG. 10 is a block diagram of a voice interactive apparatus according to the tenth embodiment of the present invention.
[0054]
  As shown in FIG. 10, in the voice interactive apparatus according to the tenth embodiment, first, the feature amount extraction is performed while the user's speech is sequentially stored in the frame buffer every 10 msec to 30 msec by the voice recognition unit 41. The recognition dictionary 49 contains the first possible word candidate, and the speech recognition means 41 sequentially calculates the distance (score) between the feature amount and the frame feature amount of the input speech for each speech frame. The optimum frame correspondence is revealed by the Viterbi algorithm. In general, each time a frame number advances, a cut-off is performed based on the accumulated score, and candidate words are narrowed down. For example, if the top word score is lower than a predetermined threshold at the stage when the matching with the top few words is finished, no word can be a final candidate, and the user is still speaking An appropriate recurrence voice request sentence is selected from the recurrence request sentence pattern database 45 by the recurrence voice sentence selection means 44. The recurrent voice generation unit 43 passes the selected recurrent voice request sentence to the voice synthesizing unit 46 to synthesize the voice, and requests the user to make a corrected utterance. For this reason, the user can immediately know which word the device could not recognize.
[0055]
  As described above, according to the tenth embodiment of the present invention, means for sequentially performing speech recognition processing while the user is speaking, means for determining the reliability of this partial recognition result, A means for determining whether or not to request the user to replay using the reliability, a means for selecting a sentence for prompting the user to replay, and a speech synthesis means for making the sentence into speech. Therefore, the user can immediately make a correct utterance for the erroneously recognized part.
[0056]
  FIG. 11 shows a block diagram of a voice interactive apparatus according to the eleventh embodiment of the present invention.
[0057]
  As shown in FIG. 11, in the voice interactive apparatus according to the eleventh embodiment, feature amount extraction is performed while the user's utterance is sequentially stored in the frame buffer every 10 to 30 msec by the voice recognition unit 51. The recognition dictionary 59 contains the first possible word candidate, and the speech recognition means 51 sequentially calculates the distance (score) between the feature amount and the frame feature amount of the input speech for each frame of the speech. The optimum frame correspondence is revealed by the Viterbi algorithm. In general, each time a frame number advances, a cut-off is performed based on the accumulated score, and candidate words are narrowed down. For example, if the top word score is lower than a predetermined threshold at the stage when the matching with the top few words is finished, no word can be a final candidate, and the user is still speaking An appropriate recurrence request sentence is selected from the recurrence request sentence pattern database 55 by the recurrence sentence selection means 54. The recurrent speech generation means 53 embeds the word candidate obtained from the speech recognition means 51 in the selected recurrent speech request text, passes it to the speech synthesis means 56, synthesizes the speech, and requests the user to correct the utterance. For this reason, the user can immediately know whether the apparatus has been recognized or how it has been erroneously recognized.
[0058]
  As described above, according to the eleventh embodiment of the present invention, means for sequentially performing speech recognition processing while the user is speaking, means for determining the reliability of this partial recognition result, A means for determining whether or not to request a recurrent voice from the user using the reliability, a means for generating a sentence for inducing a recurrent voice to the user using the recognition result, and a voice for the sentence Since the voice synthesizing means and the means for outputting the synthesized voice at an appropriate timing are provided, it is possible to easily correct the misrecognition before the end of the user's utterance.
[0059]
  FIG. 12 shows a block diagram of a voice interactive apparatus according to the twelfth embodiment of the present invention.
[0060]
  As shown in FIG. 12, in the voice interaction apparatus of the twelfth embodiment, first, the acoustic analysis means 61 sequentially stores the user's utterances in the frame buffer every 10 msec to 30 msec, and extracts the feature amount of the frame data. I do. The noise discriminating means 68 calculates the SN ratio by calculating the average energy ratio between the part with and without the voice of the data up to several frames before the frame.
[0061]
  Next, when the SN ratio is sufficiently high, the process proceeds to the recognition process as it is. However, when the SN ratio is low and the score output by the voice recognition means 62 is low, the noise discrimination means 68 inputs the noise database 69 held in advance. Calculate the similarity of the data and estimate the nearest noise. The estimated noise type is notified to the user by the voice synthesizing means 66 and a recurrent voice is requested. By doing so, it is possible to create a more easily recognizable situation by stopping the noise source or letting the user repeat the voice after the noise source disappears.
[0062]
  As described above, according to the twelfth embodiment of the present invention, means for identifying input from a user's voice and other sound sources, means for comparing the input with a predetermined type of sound source, and A means for performing speech recognition sequentially while speaking, a means for constantly monitoring the reliability of the recognition result, a means for generating a sentence explaining the cause when the reliability decreases, and generation Since the voice synthesizing means for converting the written sentence into speech is provided, the cause of the erroneous recognition can be easily removed by the user.
[0063]
【The invention's effect】
  As described above, according to the present invention, when the reliability of the recognition result of the user's utterance is low, even if the user's utterance is in the middle of the user's utterance, the apparatus side promptly requests the user for the corrected utterance, or conversely, the user's utterance is In the confirmation response after the completion, for the part with high suspicion of misrecognition, the speech speed of the synthesized speech for confirmation is slowed and the ending is extended to make it easier for the user to induce correct utterance. Can increase the efficiency of dialogues with.
[Brief description of the drawings]
FIG. 1 is a flowchart illustrating a voice interaction method according to a first embodiment of this invention.
FIG. 2 is a diagram showing a flowchart of a voice dialogue method according to a second embodiment of the present invention.
FIG. 3 is a flowchart showing a voice interaction method according to a third embodiment of the present invention.
FIG. 4 is a flowchart showing a voice interaction method according to a fourth embodiment of the present invention.
FIG. 5 is a flowchart showing a voice interaction method according to a fifth embodiment of the present invention;
FIG. 6 is a diagram showing a flowchart of a voice interaction method according to a sixth embodiment of the present invention.
FIG. 7 is a block diagram of a voice interactive apparatus according to a seventh embodiment of the present invention.
FIG. 8 is a block diagram of a voice interactive apparatus according to an eighth embodiment of the present invention.
FIG. 9 is a block diagram of a voice interactive apparatus according to a ninth embodiment of the present invention.
FIG. 10 is a block diagram of a voice interaction apparatus according to a tenth embodiment of the present invention.
FIG. 11 is a block diagram of a voice interaction apparatus according to an eleventh embodiment of the present invention.
FIG. 12 is a block diagram of a voice interaction apparatus according to a twelfth embodiment of the present invention.
FIG. 13 is a diagram showing a dialog example of the voice dialog method according to the first embodiment of this invention;
FIG. 14 is a diagram showing a dialog example of the voice dialog method according to the second embodiment of the present invention;
FIG. 15 is a diagram showing an example of dialogue in the voice dialogue method according to the third embodiment of the present invention;
FIG. 16 is a diagram showing a dialog example of the voice dialog method according to the fourth embodiment of the present invention;
FIG. 17 is a diagram showing a dialog example of the voice dialog method according to the fifth embodiment of the present invention;
FIG. 18 is a diagram showing a dialog example of the voice dialog method according to the sixth embodiment of the present invention;
FIG. 19 is a diagram showing an operation example of a conventional voice dialogue method
[Explanation of symbols]
11, 21, 31, 41, 51, 62 Voice recognition means
12, 22, 32 Voice synthesis output stop means
13, 23, 33 Response sentence generation means
14, 24, 34 Response sentence selection means
15, 25, 35 Response sentence pattern database
16, 26, 36, 46, 56, 66 Speech synthesis means
17, 27, 37 Voice synthesis output control means
18, 28, 38 User psychological model calculation means
29, 39 Speech speed setting means
30 ending extension means
30a Pause insertion means
30b Correction utterance induction means
49, 59 recognition dictionary
44, 54 Recurrence voice selection means
43, 53 Recurrence voice generation means
45, 55 Recurrence request sentence pattern database
61 Acoustic analysis means
68 Noise discrimination means
69 Noise Database

Claims

ユーザの音声の認識結果に基づく前記ユーザへの返答の中で、前記認識結果に自信が持てない部分を自信が持てる部分よりゆっくりと復唱し、かつ語尾を伸ばす、音声対話方法。 A spoken dialogue method in which, in a response to the user based on a user's speech recognition result, a portion where the recognition result is not confident is repeated more slowly than a portion where the confidence is confident and the ending is extended .

前記ユーザの訂正発声を誘発する言葉を入れる、請求項１に記載の音声対話方法。 The voice interaction method according to claim 1, wherein words for inducing corrective utterance by the user are entered .

ユーザの音声を認識する手段と、前記音声の認識結果に基づき前記ユーザへの返答文を生成する手段と、前記返答文を音声化するときに前記認識結果が低かった単語部分の話速を他より遅くする話速設定手段と、前記単語部分の語尾を伸長する語尾伸長手段と、前記単語部分につき話速設定されかつ語尾が伸長された返答文を音声合成する音声合成手段とを備えた、音声対話装置。A means for recognizing the user's voice, a means for generating a response sentence to the user based on the recognition result of the voice, and a speech speed of the word portion for which the recognition result is low when the response sentence is voiced Speaking speed setting means for slowing down, ending extension means for extending the ending of the word part, and speech synthesizing means for synthesizing a response sentence in which the speaking speed is set for the word part and the ending is extended, Spoken dialogue device.

訂正発声誘発のための音声を挿入する手段をさらに備えた、請求項３に記載の音声対話装置。4. The voice interactive apparatus according to claim 3, further comprising means for inserting a voice for inducing correction utterance.