JP3844874B2

JP3844874B2 - Multimodal interface device and multimodal interface method

Info

Publication number: JP3844874B2
Application number: JP04836498A
Authority: JP
Inventors: 哲朗知野; 克己田中
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-02-27
Filing date: 1998-02-27
Publication date: 2006-11-15
Anticipated expiration: 2018-02-27
Also published as: JPH11249773A

Description

【０００１】
【発明の属する技術分野】
本発明は、利用者と対話するマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法に関する。
【０００２】
【従来の技術】
近年、パーソナルコンピュータをはじめとする各種計算機システムにおいては、従来のキーボードやマウスなどによる入力と、ディスプレイなどによる文字や画像情報の出力に加えて、音声情報や画像情報などのマルチメディア情報を入出力することが可能になって来ている。
【０００３】
こういった状況に加え、自然言語解析や自然言語生成、あるいは音声認識や音声合成技術あるいは対話処理技術の進歩などによって、利用者と音声入出力を用いて対話する音声対話システムへの要求が高まっており、自由発話による音声入力を利用可能な対話システムである“ＴＯＳＢＵＲＧ−ＩＩ”（電気情報通信学会論文誌、Ｖｏｌ．Ｊ７７−Ｄ−ＩＩ、Ｎｏ．８，ｐｐ１４１７−１４２８，１９９４）など、様々な音声対話システムの開発がなされている。
【０００４】
また、さらに、こう言った音声入出力に加え、例えばカメラを使った視覚情報入力を利用したり、あるいは、タッチパネル、ペン、タブレット、データグローブ、フットスイッチ、対人センサ、ヘッドマウンドディスプレイ、フォースディスプレイ（提力装置）など、様々な入出力デバイスを通じて利用者と授受できる情報を利用して、利用者とインタラクションを行なうマルチモーダル対話システムへの要求が高まっている。
【０００５】
このマルチモーダルインタフェースは、人間同士の対話においても、例えば音声など一つのメディア（チャネル）のみを用いてコミュニケーションを行なっている訳ではなく、身振りや手ぶりあるいは表情といった様々なメディアを通じて授受される非言語メッセージを駆使して対話することによって、自然で円滑なインタラクションを行っている（“ＩｎｔｅｌｌｉｇｅｎｔＭｕｌｔｉｍｅｄｉａＩｎｔｅｒｆａｃｅｓ”，ＭａｙｂｕｒｙＭ．Ｔ，Ｅｄｓ．，ＴｈｅＡＡＡＩＰｒｅｓｓ／ＴｈｅＭＩＴＰｒｅｓｓ，１９９３）ことから考えても、自然で使いやすいヒューマンインタフェースを実現するための一つの有力な方法として期待が高まっている。
【０００６】
従来、たとえば利用者から音声入力がなされた場合には、入力された音声波形信号を例えばアナログ／デジタル変換し、単位時間当たりのパワー計算を行なうことなどによって、音声区間を検出し、例えばＦＦＴ（高速フーリエ変換）などの方法によって分析し、例えば、ＨＭＭ（隠れマルコフモデル）などの方法を用いて、あらかじめ用意した標準パターンである音声認識辞書と照合処理を行なうことなどによって、発声内容を推定し、その結果に応じた処理を行なう。
【０００７】
あるいは、例えばタッチセンサなどの接触式の入力装置を通じて、利用者からの指し示しジェスチャの入力がなされた場合には、タッチセンサの出力情報である、座標情報、あるいはその時系列情報、あるいは入力圧力情報、あるいは入力時間間隔などを用いて、指し示し先を同定する処理を行なう。
【０００８】
あるいは、例えば、“ＵｎｃａｌｉｂｒａｔｅｄＳｔｅｒｅｏＶｉｓｉｏｎｗｉｔｈＰｏｉｎｔｉｎｇｆｏｒａＭａｎ−ＭａｃｈｉｎｅＩｎｔｅｒｆａｃｅ”（Ｒ．Ｃｉｐｏｌｌａ，ｅｔ．ａｌ．，ＰｒｏｃｅｅｄｉｎｇｓｏｆＭＶＡ’９４，ＩＡＰＲＷｏｒｋｓｈｏｐｏｎＭａｃｈｉｎｅＶｉｓｉｏｎＡｐｐｌｉｃａｔｉｏｎ，ｐｐ．１６３−１６６，１９９４．）などに示された方法を用いて、単数あるいは複数のカメラを用いて、利用者の手などを撮影し、観察された、形状、あるいは動作などを解析することによって、利用者の指し示した、実世界中の指示対象、あるいは表示画面上の指示対象などを入力することが出来るようにしている。
【０００９】
また、同様に、例えば赤外線などを用いた距離センサなどを用いて、利用者の手の、位置、形、あるいは動きなどを認識することで、利用者の指し示した、実世界中の指示対象、あるいは表示画面上の指示対象などへの指し示しジェスチャを入力することが出来るようにしている。
【００１０】
あるいは、利用者の手に、例えば磁気センサや加速度センサなどを装着することによって、手の空間的位置や、動き、あるいは形状を入力したり、仮想現実（ＶＲ＝ＶｉｒｔｕａｌＲｅａｌｉｔｙ）技術のために開発された、データグローブやデータスーツを利用者が装着することで、利用者の手や体の、動き、位置、あるいは形状を解析することなどによって、利用者の指し示した、実世界中の指示対象、あるいは表示画面上の指示対象などを入力することが出来るようにしている。
【００１１】
ところで、利用者からの入力に対応して利用者への適切な出力を行なったり、あるいは利用者からの入力と利用者への出力のタイミングを適切に制御したり、あるいは、利用者からの入力の認識に失敗したりあるいは利用者への情報の出力に失敗をした場合など、利用者との間のコミュニケーションに関する何らかの障害が発生した場合などには、その障害の発生を検知し、かつその障害を解決するための、例えば確認のための情報の再提示や、あるいは利用者への問い返し質問対話や、あるいは対話の論議の流れを適切に管理するための対話管理処理が必要となる。
【００１２】
従来、こういった対話管理処理には、あらかじめ用意した対話の流れであるスクリプトを利用した方法や、あるいは例えば質問／回答、挨拶／挨拶といった互いに対となる発話の組である発話対や発話交換構造といった情報を利用した方法や、あるいは、対話の流れ全体を対話の参加者の各個人の計画（プラン）あるいは参加者間の共同の計画（プラン）として形式化し記述、生成あるいは認識するプランニングによる方法などが用いられている。
【００１３】
【発明が解決しようとする課題】
しかし、従来、それぞれのメディアからの入力の解析精度の低さや、それぞれの入出力メディアの性質が明らかとなっていないため、新たに利用可能となった各入出力メディアあるいは、複数の入出力メディアを効率的に利用し、高能率で、効果的で、利用者の負担を軽減する、マルチモーダルインタフェースは実現されていないという問題がある。具体的には、次の通りである。
【００１４】
つまり、各メディアからの入力の解析精度が不十分であるため、たとえば、音声入力における周囲雑音などに起因する誤認識の発生や、あるいはジェスチャ入力の認識処理において、入力デバイスから刻々得られる信号のなかから、利用者が入力メッセージとして意図した信号部分の切りだしに失敗することなどによって、誤動作が起こり、利用者への負担となっているという問題がある。
【００１５】
また、音声入力やジェスチャ入力など、利用者が現在の操作対象である計算機などへの入力として用いるだけでなく、例えば周囲の他の人間へ話しかけたりする場合にも利用されるメディアを用いたインタフェース装置では、利用者が、インタフェース装置ではなく、たとえば自分の横にいる他人に対して話しかけたり、ジェスチャを示したりした場合にも、インタフェース装置が自分への入力であると誤って判断をして、認識処理などを行なって、誤動作を起こり、その誤動作の取消や、誤動作の影響の復旧や、誤動作を避けるために利用者が絶えず注意を払わなくてはいけなくなるなどの負荷を含め、利用者への負担となっているという問題がある。
【００１６】
また、本来不要な場面においても、入力信号の処理が継続的にして行われるため、その処理負荷によって、利用している装置に関与する他のサービスの実行速度や利用効率が低下するという問題がある。
【００１７】
また、この問題を解決するために、音声やジェスチャなどの入力を行なう際に、たとえば、ボタンを押したり、メニュー選択などによって、特別な操作によってモードを変更するなどという方法が用いられているが、このような特別な操作は、人間同士の会話では不要な操作であるために不自然なインタフェースとなるだけでなく、利用者にとって繁雑であったり、操作の種類によっては、習得のための訓練が必要となったりすることによって、利用者の負担を増加するという問題がある。
【００１８】
また、例えば、音声入力の可否をボタン操作によって切替える場合などでは、音声メディアによる入力は、本来、口だけを使ってコミュニケーションが出来るため、例えば手で行っている作業を妨害することがなく、双方を同時に利用することが可能であると言う、音声メディア本来の利点を活かすことが出来ないという問題がある。
【００１９】
また、従来、指し示しジェスチャの入力に於いて、例えばタッチセンサを用いて実現されたインタフェース方法では、離れた位置からや、機器に接触せずに、指し示しジェスチャを行なうことが出来ないという問題がある。
【００２０】
さらに、例えばデータグローブや、磁気センサや、加速度センサなどを利用者が装着することで実現されたインタフェース方法では、機器を装着しなければ利用できないという問題点がある。
【００２１】
一方、カメラなどを用いて、利用者の手などの形状、位置、あるいは動きを検出することで実現されているインタフェース方法では、十分な精度が得られないために、利用者が入力を意図したジェスチャだけを、適切に抽出することが困難であり、結果として、利用者がジェスチャとしての入力を意図していない手の動きや、形やなどを、誤ってジェスチャ入力であると誤認識してしまったり、あるいは利用者が入力を意図したジェスチャを、ジェスチャ入力であると正しく抽出することが出来ない場合が多発し、結果として、例えば誤認識のために引き起こされる誤動作の影響の訂正が必要になったり、あるいは利用者が入力を意図して行なったジェスチャ入力が実際にはシステムに正しく入力されず、利用者が再度入力を行なう必要が生じ、利用者の負担を増加させてしまうという問題がある。
【００２２】
また、従来のマルチモーダルインタフェースでは、人間同士のコミュニケーションにおいては重要な役割を演じていると言われる、視線一致（アイコンタクト）、注視位置、身振り、手振りなどのジェスチャ、顔表情など非言語メッセージを、効果的に利用することが出来ないという問題がある。
【００２３】
また、利用者からの入力に対応して利用者への適切な出力を行なったり、あるいは利用者からの入力と利用者への出力のタイミングを適切に制御するためには、利用者の発話が開始されるタイミングや、あるいは利用者の発話が終了するタイミングなどを、事前に予測する必要があるが、スクリプトを利用した方法や、あるいは発話対や発話交換構造といった情報を利用した方法や、プランニングによる方法などを用いた従来の対話管理処理だけではそれを行なうことが困難であるという問題がある。
【００２４】
また、利用者からの入力の認識に失敗したり、あるいは利用者への情報の出力に失敗をした場合など、利用者との間のコミュニケーションに関する何らかの障害が発生した場合などには、その障害の発生を検知する必要があるが、スクリプトを利用した方法や、あるいは発話対や発話交換構造といった情報を利用した方法や、プランニングによる方法などを用いた従来の対話管理処理だけではそれを行なうことが困難であるという問題がある。
【００２５】
また、検知した障害を解決するための、例えば確認のための情報の再提示や、あるいは利用者への問い返し質問対話や、あるいは対話の論議の流れを適切に管理するための対話管理処理が必要であるが、スクリプトを利用した方法や、あるいは発話対や発話交換構造といった情報を利用した方法や、プランニングによる方法などを用いた従来の対話管理処理だけではそれを行なうことが困難であるという問題がある。
【００２６】
本発明はこのような事情を考慮してなされたもので、非言語メッセージを用いて利用者との対話のためのインタフェース動作を制御できるようにすることにより、新たに利用可能となった各入出力メディアあるいは、複数の入出力メディアを効率的に利用し、高能率で、効果的で、利用者の負担を軽減することが出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することを目的とする。
【００２７】
また、本発明の具体的な目的の一つは、各メディアからの入力の解析精度が不十分さに起因する誤認識や、利用者が入力メッセージとして意図した信号部分の切りだしの失敗に起因する誤動作を起こさず、利用者への余分な負担を生じないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００２８】
また、他の具体的な目的は、音声入力やジェスチャ入力など、利用者が現在の操作対象である計算機などへの入力として用いるだけでなく、例えば周囲の他の人間へ話しかけたりする場合にも利用されるメディアを用いたインタフェース装置では、利用者が、インタフェース装置ではなく、たとえば自分の横にいる他人に対して話しかけたり、ジェスチャを示したりした場合に、インタフェース装置が自分への入力であると誤って判断することがないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００２９】
また、別の具体的な目的は、上述のような計算機への入力を利用者が意図していないメッセージを誤って自己への入力であると誤認識したことによる誤動作や、その影響の復旧や、誤動作を避けるために利用者が絶えず注意を払わなくてはいけなくなるなどの負荷を含めた利用者への負担を生じないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３０】
また、さらにもう一つの具体的な目的は、本来不要な場面においても、入力信号の処理が継続的にして行われるため、その処理負荷によって、利用している装置に関与する他のサービスの実行速度や利用効率が低下してしまうことのないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３１】
また、さらにもう一つの具体的な目的は、音声やジェスチャなどの入力を行なう際に、たとえば、ボタンを押したり、メニュー選択などといった特別な操作によるモード変更が必要なく、自然で、利用者にとって繁雑でなく、習得のための訓練が不要であり、利用者の負担を増加しないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３２】
また、さらにもう一つの具体的な目的は、例えば、口だけを使ってコミュニケーションが出来、例えば手で行なっている作業を妨害することがなく、双方を同時に利用することが可能であると言う、音声メディア本来の利点を活かすことが出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３３】
また、さらにもう一つの具体的な目的は、離れた位置からや、機器に接触せずに、ジェスチャの入力を行なう際に、利用者が入力を意図したジェスチャだけを、適切に抽出できるマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３４】
また、さらにもう一つの具体的な目的は、人間同士のコミュニケーションにおいては重要な役割を演じていると言われる、視線一致（アイコンタクト）、注視位置、身振り、手振りなどのジェスチャ、顔表情など非言語メッセージを、効果的に利用することが出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３５】
また、さらにもう一つの具体的な目的は、利用者からの入力に対応して利用者への適切な出力を行なったり、あるいは利用者からの入力と利用者への出力のタイミングを適切に制御するために、利用者の発話が開始されるタイミングや、あるいは利用者の発話が終了するタイミングなどを、事前に予測することの出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３６】
また、さらにもう一つの具体的な目的は、利用者からの入力の認識に失敗したり、あるいは利用者への情報の出力に失敗をした場合など、利用者との間のコミュニケーションに関する何らかの障害が発生した場合などには、その障害の発生を適切に検知することの出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３７】
また、さらにもう一つの具体的な目的は、検知した障害を解決するための、例えば確認のための情報の再提示や、あるいは利用者への問い返し質問対話や、あるいは対話の論議の流れの適切な管理を行なうことの出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を提供することである。
【００３８】
【課題を解決するための手段】
本発明のマルチモーダルインタフェース装置は、利用者の呼吸の状況を観察し利用者の呼吸の状態が定常状態での吸気または排気である定常呼吸であるか深呼吸または息継ぎによる非定常状態での吸気である非定常吸気であるかを示す呼吸状況情報を出力する呼吸状況認識手段と、利用者の発する音声の、取り込み、あるいは録音、あるいは加工、あるいは分析、あるいは認識の少なくとも一つの処理を行なう入力音声処理手段と、前記呼吸状況情報に基づいて前記利用者の非定常吸気が検出された場合、前記入力音声処理手段を制御して、利用者からの音声入力を非受け付け状態から受け付け状態に切り替える受け付け可否制御処理を実行する制御手段とを具備したことを特徴とする。
【００４２】
このように利用者から認識した呼吸状況情報に基づいて入力音声処理手段の動作を制御することにより、音声入力の解析精度が不十分さに起因する誤認識や、利用者が入力音声として意図した信号部分の切りだしの失敗に起因する誤動作を起こさず、利用者への余分な負担を生じないマルチモーダルインタフェース装置を提供すること等が可能となる。
【００５０】
【発明の実施の形態】
（ｉ）第１の実施形態
以下、図面を参照して、本発明の第１実施形態に係るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法について説明する。
【００５１】
図１は、本発明の第１実施形態に係るマルチモーダルインタフェース装置の構成例であり、１０１は呼吸検出部、１０２は音声入力部、１０３は制御部、１０４はアプリケーションである。このマルチモーダルインタフェース装置はコンピュータなどを用いて、音声情報による利用者との対話を支援するためのシステムである。
【００５２】
図１に於いて、１０１は呼吸検出部を表しており、例えば、「ビジュアルセンシングによる呼吸監視システムの関心領域（ＲＯＩ）の設定の自動化」（三宅他、第１７回医療情報学連合大会予稿、１−Ｃ−１−３、ｐｐ．１６８−１６９、１９９７）に示された方法などの様に、例えばカメラから得られる利用者の画像から、例えば利用者の胸部を観察し呼吸に付随する動作を検出することなどによって、利用者の呼吸の状態を検知し、呼吸状況情報として随時出力するようにしている。また、利用者の身体に装着あるいは近接して配置したセンサからの情報を処理することによって、利用者の呼吸の状況を観察することもできる。
【００５３】
図２は、呼吸検出部１０１が出力する呼吸状況情報の例を表している。
【００５４】
図２に於いて、ＩＤの欄は各呼吸状況情報の識別記号を表しており、時間情報Ａは対応する呼吸の状況が観察された時刻が記録されており、また状況情報Ｂには観察された呼吸の状況を表す記号が記録されるようにしている。
【００５５】
各呼吸状況情報の状況情報Ｂの欄に於いて、「定常呼吸（吸気）」および「定常呼吸（排気）」は、利用者が定常状態で、それぞれ吸気および排気を行なっていることが観察されたことを表している。
【００５６】
また、「非定常呼吸（吸気）」および「非定常呼吸（排気）」は、利用者が、例えば深呼吸や息継ぎなど非定常状態で、それぞれ吸気および排気を行なっていることが観察されたことを表している。
【００５７】
また、図１に於いて、１０２は音声入力部を表しており、例えばマイクなどによって利用者の発した音声信号を電気信号に変換するなどして本装置への入力信号として取り込んだり、あるいはさらに例えばＡ／Ｄ（アナログディジタル）変換を施すことによって本装置で処理可能な表現への変換を行なったり、あるいはさらに、例えばＦＦＴ（高速フーリエ変換）などを用いて分析処理や加工処理を行なったり、あるいはさらに例えば複合類似度法やＨＭＭ（隠れマルコフモデル）やＤＰ（ダイナミックプログラミング）やニューラルネットワークなどといった方法を用いてあらかじめ用意した標準パターンと入力信号との間での照合処理を行なうことなどによって認識処理を行なったりするようにしている。
【００５８】
本音声入力部１０２による利用者の発する音声の、取り込み、あるいは録音、あるいは加工、あるいは分析、あるいは認識といった動作は制御部１０３によって制御されるようになっており、また音声入力部１０２によって得られる音声入力の処理結果も制御部１０３の制御に従って、アプリケーション１０４へと渡されるようにしている。
【００５９】
図１に於いて、１０３は制御部である。
【００６０】
制御部１０３は、呼吸検出部１０１から逐次得られる呼吸状況情報を参照し、音声入力部１０２およびアプリケーション１０４の内少なくとも一方を適宜制御し、利用者からの音声入力信号の受け付け可否制御、音声区間の推定処理、雑音低減処理、音声信号変換処理などを制御する。
【００６１】
なお、本制御部１０３の動作が、本装置の効果の実現において本質的な役割を演ずるものであるためその詳細は後述することとする。
【００６２】
図１に於いて、１０４はアプリケーションであり、制御部１０３の制御に応じて音声入力部１０２の出力を受けとり、例えばデータベースシステムでは、入力された検索要求に対応する検索結果を出力したり、あるいは音声録音システムでは、入力された音声信号を適切に保存するなどといったサービスを行なうものであり、コンピュータのアプリケーションプログラムに相当する。
【００６３】
つづいて、制御部１０３について詳説する。
【００６４】
制御部１０３は以下の処理手順Ａに従って動作するようにしている。なお、図３は処理手順Ａの処理内容を説明するフローチャートである。
【００６５】
＜処理手順Ａ＞
Ａ１：音声入力部１０２を制御し、音声入力を「非受け付け状態」とする。
【００６６】
Ａ２：呼吸検出部１０１から得られる呼吸状況情報の内容を常時監視し、「非定常呼吸（吸気）」を検出した場合にはステップＡ３へ進み、そうでない場合はステップＡ２に留まる。
【００６７】
Ａ３：音声入力部１０２を制御し、音声入力を受け付け状態とする。
【００６８】
Ａ４：タイマＴの値を０とした上で、タイマＴを（再）スタートする。
【００６９】
Ａ５：タイマＴに関して、あらかじめ定めた時間ｔＡが経過していたら、ステップＡ１へ進み、そうでなければステップＡ６へ進む。
【００７０】
Ａ６：現時点において、利用者からの音声入力Ｉがなされていたら、ステップＡ８へ進み、そうでなければステップＡ７へ進む。
【００７１】
Ａ７：現時点に於いて、呼吸検出部１０１から得られる呼吸状況情報により、「非定常呼吸（吸気）」が検出されたら、ステップＡ４へ進み、そうでなければステップＡ５へ進む。
【００７２】
Ａ８：音声入力Ｉに対する音声入力部１０２の処理結果を、アプリケーション１０４へ渡し、ステップＡ４へ進む。
【００７３】
以上が本発明に係る第１実施形態の構成とその機能である。
【００７４】
ここで先ず上述した処理について、具体例を用いて詳しく説明する。
【００７５】
（１）まず、ステップＡ１の処理によって、本装置の音声入力が非受け付け状態になる。
【００７６】
（２）ここで、利用者の周囲で雑音が発生したとする。
【００７７】
（３）ここでは音声入力は非受け付け状態にあるので、この雑音に起因する音声認識の誤認識は発生しない。
【００７８】
（４）つづいて、利用者が本装置への音声入力を行なうために、発声のために大きく息を吸ったものとする。
【００７９】
（５）この行動が、呼吸検出部１０１によって検知され、図２のｐ１０４のエントリに示した通りの呼吸状況情報が出力される。
【００８０】
（６）さらに、ステップＡ２〜Ａ４の処理によって、音声入力が受け付け状態に変更され、タイマＴがスタートされる。
【００８１】
（７）ここで利用者が音声入力を行なったとする。
【００８２】
（８）ここまでの処理によって音声入力は受け付け状態であるため、利用者の音声入力が受け付けられ、ステップＡ８によって、その処理結果がアプリケーション１０４へと送られ、所望のサービスが利用者に提供される。
【００８３】
以上の処理によって、利用者は明示的あるいは恣意的な操作をすることなく自然に音声入力を行なうことが可能となり、また周囲雑音による誤動作の発生も解消することが出来ている。
【００８４】
（９）その後、ステップＡ４の処理によってタイマＴがリスタートされる。
【００８５】
（１０ａ）もしこの段階で利用者が行なうべき音声入力がない場合には、利用者は、黙っていることとなり、タイマＴがｔＡを経過した段階でステップＡ５の処理によって、ステップＡ１へ進み、音声入力が非受け付け状態に戻る。
【００８６】
（１０ｂ）あるいは、もしこの利用者が次に行なうべき音声入力があり、次の音声入力を行なった場合には、ステップＡ６の処理によって、再度音声が受け付けられ、ステップＡ８によって、その処理結果がアプリケーション１０４へと送られ、所望のサービスが利用者に提供されたのち、ステップＡ４へ進み、タイマＴがリスタートされ、利用者からの音声入力の待ち受け時間が延長される。
【００８７】
（１０ｃ）あるいは、もしこの利用者が次に行なうべき音声入力があるが、まだ発声を行わず、発声準備のために息継ぎを行なった場合には、ステップＡ７の処理によって、ステップＡ４へ進み、タイマＴがリスタートされ、利用者からの音声入力の待ち受け時間が延長される。
【００８８】
（１１）以上の音声入力処理あるいは音声入力の待ち受け時間の延長処理は、利用者の行動に応じて任意回必要なだけ繰り返されたのち、ステップＡ５の分岐によって、ステップＡ１に進み、初期状態に戻る。
【００８９】
かくしてこのように構成された本装置の第１の実施形態によれば、音声入力の解析精度が不十分さに起因する誤認識や、利用者が入力音声として意図した信号部分の切りだしの失敗に起因する誤動作を起こさず、利用者への余分な負担を生じないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが可能となる。
【００９０】
また、本来不要な場面での、入力音声信号の処理負荷を軽減し、利用している装置に関与する他のサービスの実行速度や利用効率が低下しない、マルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが出来る。
【００９１】
また、音声入力を行なう際に、たとえば、ボタンを押したり、メニュー選択などといった特別な操作によるモード変更が必要なく、自然で、利用者にとって繁雑でなく、習得のための訓練が不要であり、利用者の負担を増加しないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが出来る。
【００９２】
また、例えば、口だけを使ってコミュニケーションが出来、例えば手で行なっている作業を妨害することがなく、双方を同時に利用することが可能であると言う、音声メディア本来の利点を活かすことが出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが出来る。
【００９３】
また、人間同士のコミュニケーションにおいては重要な役割を演じていると言われる、非言語メッセージを、効率的に利用することが出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが出来るなど、多大な効果が奏せられる。
【００９４】
（ｉｉ）第２の実施形態
続いて、図面を参照して本発明の第２実施形態に係るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法について説明する。
【００９５】
図４は、本発明の第２実施形態に係るマルチモーダルインタフェース装置の構成例を表しており、注視対象検出部２０１、ジェスチャ認識部２０２、制御部２０３、およびアプリケーションプログラム２０４から構成されている。
【００９６】
図４に於いて、２０１は注視対象検出部を表しており、例えば、特願平０９−６２６８１号の「オブジェクト操作装置およびオブジェクト操作方法」と同様の方法によって、例えば利用者の姿を観察した画像情報の解析などによって、利用者が注視している対象を検出し、注視対象情報として随時出力するようにしている。
【００９７】
図５は、注視対象検出部２０１の出力する注視対象情報の例を表している。
【００９８】
図５の各エントリに於いて、ＩＤの欄には、各注視対象情報の識別信号が記録されており、時間情報Ａの欄には対応する注視が検出された時刻に関する情報が記録されるようにしている。
【００９９】
また、対象情報Ｂの欄には、対応する注視の対象となった物体あるいは領域を表す記号が記録されるようにしている。
【０１００】
（なお、図５のエントリｑ２５１およびｑ２５２の対象情報Ｂの欄に記載された記号「マインズアイ」については後述する。）
図４に於いて、２０２はジェスチャ認識部を表しており、これは、単数または複数のカメラなどによって得られる利用者の画像情報の処理、あるいは赤外線センサなどの遠隔センサ、装着センサなどによって得られる信号の処理などによって、利用者の手など体の部分あるいは体の全体の動作を解析し利用者からのジェスチャ入力を認識するものであり、ジェスチャ入力の解析、認識は、例えば、“ＵｎｃａｌｉｂｒａｔｅｄＳｔｅｒｅｏＶｉｓｉｏｎｗｉｔｈＰｏｉｎｔｉｎｇｆｏｒａＭａｎ−ＭａｃｈｉｎｅＩｎｔｅｒｆａｃｅ”（Ｒ．Ｃｉｐｏｌｌａ，ｅｔ．ａｌ．，ＰｒｏｃｅｅｄｉｎｇｓｏｆＭＶＡ’９４，ＩＡＰＲＷｏｒｋｓｈｏｐｏｎＭａｃｈｉｎｅＶｉｓｉｏｎＡｐｐｌｉｃａｔｉｏｎ，ｐｐ．１６３−１６６，１９９４．）などに示された方法を用いることができる。
【０１０１】
図６は、ジェスチャ認識部２０２が出力するジェスチャ認識情報の例を表している。図６の各エントリに於いて、ＩＤは各ジェスチャ認識情報の識別記号を表しており、開始時間情報Ａおよび終了時間情報Ｂの欄は、それぞれ対応するジェスチャの開始および終了時刻が記録されるようにしている。
【０１０２】
また、ジェスチャ種別情報Ｃの欄にはジェスチャ認識部２０２における処理によって得られたジェスチャの種別が記号で記録されるようにしている。
【０１０３】
図４に於いて、２０３は制御部を表しており、注視対象検出部２０１、およびジェスチャ認識部２０２、およびアプリケーション２０４を制御する。この制御部２０３が、視線検出情報に基づいて、ジェスチャ入力の受け付け可否、あるいはジェスチャ入力の検出あるいは認識に用いられるパラメータ情報の調整などの制御を行うことにより、本装置の効果が実現される。
【０１０４】
なお、本制御部２０３は、本装置の効果を実現する上で重要な役割を担うものであるため、その動作の詳細については後述することとする。
【０１０５】
図４に於いて、２０４はアプリケーションを表しており、本部品の役割は、前述第１実施形態におけるアプリケーション１０４と同様である。
【０１０６】
続いて制御部２０３について説明する。
【０１０７】
図７は、制御部２０３の内部構成の例を表しており、制御部２０３が、制御処理部２０３ａ、および注視解釈規則記憶部２０３ｂ、および注視状況記録部２０３ｃから構成されていることを示している。
【０１０８】
図８は注視解釈規則記憶部２０３ｂの内容の例を表しており、注視解釈規則の各エントリが、ＩＤ、および注視対象情報Ａ、および可能ジェスチャ種別リスト情報Ｂなどと分類され記録される様にしている。
【０１０９】
注視解釈規則記憶部２０３ｂの各エントリにおいて、ＩＤの欄は対応する規則の識別記号が記録される。
【０１１０】
また、注視対象情報Ａの欄には解釈すべき注視対象情報の注視対象の種類が記録されており、また、可能ジェスチャ種別リスト情報Ｂの欄には、注視対象情報Ａの欄に記録されてた注視対象を注視している状態で、提示されうるジェスチャの種別のリストが記録されるようにしている。
【０１１１】
図９は注視状況記憶部２０３ｃの内容の例を表しており、注視状況記憶部２０３ｃの各エントリが、ＩＤおよび、時間情報Ａ、および種別リスト情報Ｂなどと分類され記録される様にしている。
【０１１２】
注視状況記憶部２０３ｃの各エントリに於いて、ＩＤは対応する注視状況情報の識別記号である。
【０１１３】
また、時間情報Ａの欄には対応する注視情報の表す注視が行なわれた時間が記録されるようにしており、また種別リスト情報Ｂの欄には、対応する注視が行なわれたことによって規定されるその時点で可能なジェスチャの種別のリストが記録されるようにしている。
【０１１４】
以上が本発明の第２実施形態に係るマルチモーダルインタフェース装置の構成の説明である。
【０１１５】
つづいて、制御部２０３の動作について説明する。
【０１１６】
制御部２０３は、並列あるいは交互に動作する以下の処理手順Ｂおよび処理手順Ｃに従って動作する。
【０１１７】
なお、図１０は処理手順Ｂを説明するフローチャートであり、図１１は処理手順Ｃを説明するフローチャートである。
【０１１８】
＜処理手順Ｂ＞
Ｂ１：注視対象検出部２０１から、注視対象情報Ｅｉを受け取ったら、ステップＢ２へ進み、そうでない場合にはステップＢ１へ進む。
【０１１９】
Ｂ２：注視解釈規則記憶部２０３ｂを参照し、注視対象情報Ｅｉの対象情報Ｂと同一の内容を、注視対象情報ＡにもつエントリＳｉを探す。
【０１２０】
Ｂ３：注視状況情報記憶部２０３ｃに新たなエントリＵｉを作成し、エントリＵｉの時間情報Ａの欄に、注視対象情報Ｅｉの時間情報Ａの内容を複写し、かつエントリＵｉの種別リスト情報Ｂの欄に、ステップＢ２で検索した注視解釈規則記憶部２０３ｂのエントリＳｉの可能ジェスチャ種別リスト情報Ｂの内容を複写する。
【０１２１】
Ｂ４：ステップＢ１へ進む。
【０１２２】
＜処理手順Ｃ＞
Ｃ１：ジェスチャ認識部２０２から、ジェスチャ認識情報Ｇｊを受け取ったら、ステップＣ２へ進み、そうでなければステップＣ１へ進む。
【０１２３】
Ｃ２：ジェスチャ認識情報Ｇｊを参照し、その開始時間情報Ａの内容Ｔｊｓと、終了時間情報Ｔｊｅを得る。
【０１２４】
Ｃ３：注視解釈状況記憶部２０３ｃの内容を参照し、時間情報Ａの値が、Ｔｊｓ以降で、かつＴｊｅ以前の値である、注視解釈状況情報２０２ｃのエントリの集合Ｓｕを得る。
【０１２５】
Ｃ４：集合Ｓｕが空集合なら、Ｃ７へ進む。
【０１２６】
Ｃ５：エントリの集合Ｓｕの全ての要素の種別リスト情報Ｂの欄に、ジェスチャ認識情報Ｇｊのジェスチャ種別情報Ｃの内容が含まれる場合は、ステップＣ６へ進み、そうでない場合はステップＣ７へ進む。
【０１２７】
Ｃ６：ジェスチャ認識情報Ｇｊをジェスチャ入力として受理し、アプリケーション２０４へ送りステップＣ１へ進む。
【０１２８】
Ｃ７：ジェスチャ認識情報Ｇｊをジェスチャ入力として受理せずに破棄し、ステップＣ１へ進む。
【０１２９】
続いて、本発明の第２実施形態の処理について、具体例を用いて説明する。
【０１３０】
（１）まず、時点ｔ１０の時点で、本装置の利用者が、他の人物の方向を向いていたものとする。
【０１３１】
（２）これに対する注視対象検出部２０１での処理によって、図５のＩＤがｑ２０１に示すような注視対象情報が生成され、制御部２０３へ伝えられる。
【０１３２】
（３）このｑ２０１の注視対象情報を受けとったため、ステップＢ１からステップＢ２へとの分岐が起こり、ステップＢ２での処理によって注視対象情報ｑ２０１の対象情報Ｂと同一の内容である「他人物１」と同じ種類の値を、その注視対象情報Ａの欄に持つ注視解釈規則記憶部２０２ｂのエントリＳ４０１がＳｉとして検索される。
【０１３３】
（４）ステップＢ３での処理によって、注視状況情報記憶部２０３ｃに新たなエントリｕ５０１が生成され、その時間情報Ａの欄に、注視対象情報ｑ２０１の時間情報Ａの内容が複写され、かつ、エントリｕ５０１の種別リスト情報Ｂの欄に、エントリｓ４０１の可能ジェスチャ種別リスト情報Ｂの内容が複写された後、ステップＢ４によりステップＢ１へ戻る。
【０１３４】
（５）以後上記と同様の処理が、注視対象検出部２０１から順次得られる図５に示した注視対象情報ｑ２０２〜ｑ１０４に対して施され、結果として図９に示した注視状況記憶部２０２ｃの注視状況情報ｕ５０２〜ｕ５０４のエントリが生成される。
【０１３５】
（６）ここで、ジェスチャ認識部２０２から図６ジェスチャ認識情報の例のｒ３０１のエントリに示したジェスチャ認識情報が得られたとする。
【０１３６】
（７）このジェスチャ認識情報ｒ３０１に対して、ステップＣ１の処理により、ステップＣ２への分岐が起こる。
【０１３７】
（８）ステップＣ２によって、ｒ３０１の開始時間情報Ａの値＝ｔ１１と終了時間情報Ｂの値＝ｔ１２が得られる。
【０１３８】
（９）続いて、ステップＣ３の処理によって、注視状況記憶部２０３ｃから、ｔ１１〜ｔ１２の間の注視状況情報が検索され、結果として、エントリｕ５０２とエントリｕ５０３とを要素とする集合Ｓｕが得られる。
【０１３９】
（１０）Ｓｕは空集合でないのでステップＣ４からＣ５へと進む。
【０１４０】
（１１）ステップＣ５の処理によって、エントリｕ５０２とエントリｕ５０３の双方の種別リスト情報Ｂに、ジェスチャ認識情報ｒ３０１のジェスチャ種別情報Ｃの値「うなづき」が含まれるかどうかが調べられるが、ここでは、条件が成立しないため、Ｃ７へ進む。
【０１４１】
（１２）ステップＣ７によって、ジェスチャ認識情報ｒ３０１が示唆した「うなづき」がジェスチャとして受理されずに破棄されステップＣ１へ進み初期状態へ戻る。
【０１４２】
これは、時点ｔ１１〜ｔ１２に於いて、利用者が他の人物を注視している状態に於いて検出されたうなづきジェスチャの候補は、本装置への入力を意図したジェスチャではないと、本装置が判断したことに相当する。
【０１４３】
また、以上の処理と同様の処理によって、図６のｒ３０２に示したｔ２０〜ｔ２４に渡る「うなづき」ジェスチャ認識情報では、図９に示した注視状況記憶部２０２ｃのｕ５１１〜ｕ５１６のエントリの種別リスト情報Ｂの全てが「うなづき」を含んではいないため破棄されるが、これは、時点ｔ２０〜ｔ２４の利用者のうなづきのジェスチャ入力の可能性を持つ信号が検知されたが、その時点での利用者の注視対象が、「画面」→「利用者手元」→「画面」へと推移していることを根拠として、このジェスチャ入力の候補は誤って抽出されたものであると判断されジェスチャ候補が破棄された例である。
【０１４４】
一方、時点ｔ３１〜ｔ３３に渡って検出された図６のｒ３０３のエントリに対応するジェスチャ入力候補に関しては、本装置によって「うなづき」のジェスチャ入力として受理され、アプリケーション２０４へと送られることになる。
【０１４５】
その手順を順を追って説明する。
【０１４６】
（１）まず注視対象検出部２０１での処理によって、図５のＩＤがｑ２２１に示すような注視対象情報が生成され、制御部２０３へ伝えられる。
【０１４７】
（２）このｑ２２１の注視対象情報を受けとったため、ステップＢ１からステップＢ２へとの分岐が起こり、ステップＢ２での処理によって注視対象情報ｑ２２１の対象情報Ｂと同一の内容である「カメラ１」と同じ種類の値を、その注視対象情報Ａの欄に持つ注視解釈規則記憶部２０２ｂのエントリＳ４０４がＳｋとして検索される。
【０１４８】
（３）ステップＢ３での処理によって、注視状況情報記憶部２０３ｃに新たなエントリｕ５２１が生成され、その時間情報Ａの欄に、注視対象情報ｑ２２１の時間情報Ａの内容が複写され、かつ、エントリｕ５２１の種別リスト情報Ｂの欄に、エントリｓ４０４の可能ジェスチャ種別リスト情報Ｂの内容が複写された後、ステップＢ４によりステップＢ１へ戻る。
【０１４９】
（４）以後上記と同様の処理が、注視対象検出部２０１から順次得られる図５に示した注視対象情報ｑ２３２〜ｑ２３４に対して施され、結果として図９に示した注視状況記憶部２０３ｃのｕ５２２〜ｕ５２４のエントリが生成される。
【０１５０】
（５）ここで、ジェスチャ認識部２０２から図６ジェスチャ認識情報の例のｒ３０３のエントリに示したジェスチャ認識情報が得られたとする。
【０１５１】
（６）このジェスチャ認識情報ｒ３０３に対して、ステップＣ１の処理により、ステップＣ２の分岐が起こる。
【０１５２】
（７）ステップＣ２によって、ｒ３０３の開始時間情報Ａの値＝ｔ３０と終了時間情報Ｂの値＝ｔ３３が得られる。
【０１５３】
（８）続いて、ステップＣ３の処理によって、注視状況記憶部２０３ｃから、ｔ３０〜ｔ３３の間の注視状況情報が検索され、結果として、エントリｕ５２１、エントリｕ５２２、エントリｕ５２３、およびエントリｕ５２４を含む集合Ｓｖが得られる。
【０１５４】
（９）Ｓｖは空集合でないのでステップＣ４からＣ５へと進む。
【０１５５】
（１０）ステップＣ５の処理によって、エントリｕ５２１〜エントリｕ５２４の全種別リスト情報Ｂに、ジェスチャ認識情報ｒ３０３のジェスチャ種別情報Ｃの値「うなづき」が含まれるかどうかが調べられ、ここでは、条件が成立し、Ｃ６へ進む。
【０１５６】
（１１）ステップＣ６によって、ジェスチャ認識情報ｒ３０３が示唆した「うなづき」がジェスチャとして受理され、アプリケーション２０４へ送られた上で、ステップＣ１へ進み初期状態へ戻る。
【０１５７】
これは、利用者がカメラをずっと注視したままの状態において、提示された「うなづき」ジェスチャの候補は、利用者からシステムへの入力を意図したジェスチャ入力として信頼できるという判断を行ない受理されたことに相当するものである。
【０１５８】
かくしてこのように構成された本装置の第２実施形態によれば、ジェスチャ入力の解析精度が不十分であるため、たとえば、ジェスチャ入力の認識処理において、入力デバイスから刻々得られる信号のなかから、利用者が入力メッセージとして意図した信号部分の切りだしに失敗するという問題を回避することが出来、その結果、誤動作などによる利用者への負担を起こさないインタフェースを実現することが可能となる。
【０１５９】
また、利用者が現在の操作対象である計算機などへの入力として用いるだけでなく、例えば周囲の他の人間とのコミュニケーションを行なう場合にも利用されるメディアを用いたインタフェース装置において、利用者がインタフェース装置ではなく、たとえば自分の横にいる他人に対してジェスチャを示したりした場合にも、インタフェース装置が自分への入力であると誤って判断しないインタフェース装置を実現するものである。
【０１６０】
さらに、たとえば、ボタンを押したり、メニュー選択などによって、特別な操作によって入力モードの変更を行なう必要がないため、自然なインタフェース装置を実現することが出来る。
【０１６１】
また、本発明によって、人間同士のコミュニケーションにおいては重要な役割を演じていると言われる非言語メッセージを、効果的に利用することが可能となる。
【０１６２】
（ｉｉｉ）第３の実施形態
続いて、図面を参照して本発明の第３実施形態に係るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法について説明する。
【０１６３】
図１２は、本発明の第３実施形態に係るマルチモーダルインタフェース装置の構成例を示しており、本装置が、注視対象検出部３０１、および入力部３０２、および出力部３０３、および対話管理部３０４、およびアプリケーション３０５から構成されていることを表している。
【０１６４】
図１２において、３０１は注視対象検出部であり、利用者の注視対象を検出するが、本注視対象検出部３０１に関しては、前述の第２実施形態における注視対象検出部２０１と同様の構成によって実現し、同様の注視対象情報を出力するものとする。
【０１６５】
図１２において、３０２は入力部であり、利用者からの音声入力、あるいは画像入力、あるいはキーボード、マウス、ジョイスティック、トラックボール、タッチセンサー、ボタンなどといった機器の操作入力などの、入力を受け付ける様にしている。
【０１６６】
図１２において、３０３は出力部であり、利用者への音声出力、画像出力、あるいは提力装置を通じた力出力など、出力を提示する様にしている。
【０１６７】
図１２において、３０４は対話管理部であり、入力部３０２および出力部３０３を、例えばスクリプトや、あるいは発話対や、あるいは発話交換構造や、あるいはプランニング手法などを用いた従来手法によって制御し、例えば、利用者からの入力信号の受付と利用者への出力信号の提示、および該入力信号と出力信号の時間調整、あるいは利用者への確認や問い返しのための対話などを含む、利用者と本装置との間での対話（＝インタラクション）を実現するようにしている。
【０１６８】
図１２において、３０５はアプリケーションであり、対話管理部３０４から提供される利用者からの要求などに対して、例えばデータベースの検索や、推論処理や、あるいは算術処理などによって応答の内容を決定し、対話管理部３０４に返すようにしている。
【０１６９】
対話管理部３０４は、注視対象検出部３０１から随時提供される注視対象情報を参照して、以下に示す＜処理手順Ｄ＞にしたがった処理によって動作することで、本装置の効果を実現する。
【０１７０】
なお、図１３は処理手順Ｄを説明するフローチャートである。
【０１７１】
＜処理手順Ｄ＞
Ｄ１：入力部３０２を通じて利用者からの入力Ｉを受けとる場合はステップＤ２へ進み、出力部３０３を通じて利用者へ出力Ｏを利用者に出力する場合は、ステップＤ９へ進む。
【０１７２】
Ｄ２：タイマＱをリセットしスタートする。
【０１７３】
Ｄ３：タイマＱがあらかじめ定めた値Ｈを超えたらステップＤ１へ進む。
【０１７４】
Ｄ４：注視対象検出部３０１から得られる注視対象情報Ｗの対象情報Ｂの内容を参照し、あらかじめ定めた特定の物体あるいは領域である注視対象Ｘを注視していることが判明したら、ステップＤ２へ進む。
【０１７５】
Ｄ５：入力３０２によって、利用者からの入力Ｉが検知された場合は、ステップＤ７へ進む。
【０１７６】
Ｄ６：ステップＤ３へ進む。
【０１７７】
Ｄ７：入力部３０２による入力Ｉの処理結果が、対話制御部３０４を通じて、アプリケーション３０５へと渡される。
【０１７８】
Ｄ８：アプリケーション３０５によって、利用者に応答すべき出力Ｏが決定され、対話管理部３０４へと渡される。
【０１７９】
Ｄ９：出力部３０３を通じて、利用者への出力Ｏの出力を開始する。
【０１８０】
Ｄ１０：出力部３０３を通じての出力Ｏが終了したらステップＤ１へ進む。
【０１８１】
Ｄ１１：注視対象検出部３０１から得られる注視対象情報Ｖの対象情報Ｂの内容を参照し、あらかじめ定めた特定の物体あるいは領域である注視対象Ｙへの利用者の注視を検出した場合には、ステップＤ１３へ進む。
【０１８２】
Ｄ１２：ステップＤ１０へ進む。
【０１８３】
Ｄ１３：現在の出力Ｏの提示を中断した後で、出力Ｏの利用者への再提示を行なう。
【０１８４】
Ｄ１４：利用者からの、例えば「えっ」といった非言語音声が入力されるなど、本装置から利用者への、出力Ｏの伝達が正しく行われなかったことを表す明示的な入力がなされた場合には、ステップＤ１６へ進む。
【０１８５】
Ｄ１５：ステップＤ１へ進む。
【０１８６】
Ｄ１６：出力Ｏに関して利用者が理解しているかどうかに関する確認の対話処理を起動する。
【０１８７】
Ｄ１７：ステップＤ１へ進む。
【０１８８】
続いて、具体的例を用いて第３実施実施形態の動作説明を行なう。
【０１８９】
まず、仮定として、入力手段３０２として音声入力を持ち、出力手段３０３としてスピーカから出力される音声出力とディスプレイからの出力される画像情報出力を持つマルチモーダルインタフェース装置を例として説明を行なう。
【０１９０】
また、処理手順ＤのステップＤ４に現れる特定の注視対象Ｘとしては、マインズアイ（後述）が設定されているものとし、また処理手順ＤのステップＤ１１に現れる特定の注視対象Ｙとして、スピーカ部分が設定されているものとする。
【０１９１】
まずはじめ、本装置から利用者に向かって、例えば「宛先を教えて下さい」という音声出力がなされ、この質問に対する利用者からの回答を本装置が受けとるという状況であるものとする。
【０１９２】
この質問に対する利用者からの音声入力を受けとるため、ステップＤ１からＤ２への分岐が行われる。
【０１９３】
続いて、タイマＱによって時間Ｈの間、ステップＤ２〜ステップＤ６の処理ループが繰り返されるが、今回はその時間の間に利用者から例えば「神戸市です」という音声入力Ｉ１がなされたとする。
【０１９４】
ここまでに行なわれた処理は、従来のマルチモーダルインタフェース処理あるいは対話装置における処理と同様のものである。
【０１９５】
次に、上述と同じ状況に対して、利用者が入力すべき情報（例えば宛先）を即座に答えることが出来ず、入力すべき情報を思い出すために、マインズアイと呼ばれる行動をとった場合を考えてみる。
【０１９６】
このマインズアイとは、人間が何らかの情報を思い出したり、あるいは考えをまとめようとする場合に、ある特定の方向を向く傾向があることを指すものであり、典型的には、斜め上方向を向く場合が多い。
【０１９７】
本装置では、利用者があらかじめ定めた特定の注視対象（この場合は斜め上方）を注視した場合に、注視対象検出部３０１が、対象情報Ｂの値として記号「マインズアイ」を含む注視対象情報Ｗ１を出力するようにしている。
【０１９８】
そのため、処理手順ＤのステップＤ２〜Ｄ６の利用者からの入力を待つ処理ループの中を処理している間に、利用者がマインズアイと呼ばれる行動（具体的には、この場合は斜め上方向を注視する行動）を行なうと、注視対象検出部３０１によってそれが検知され、例えば図５のエントリｑ２５１あるいはｑ２５２の対象情報Ｂの欄に示した記号「マインズアイ」を含む注視対象情報が出力されることとなる。
【０１９９】
これにより、ステップＤ４からステップＤ２へと進み、タイマＱがリセットされ、結果として利用者の入力を待つ時間が延長されることとなる。
【０２００】
以上の処理によって、本装置では利用者が入力すべき情報を想起するなどのために、マインズアイと呼ばれる行動を行なった際に、自動的に入力待ち受け時間が延長され、結果としてユーザフレンドリーなマルチモーダルインタフェースが実現されることとなる。
【０２０１】
つづいて、この音声入力Ｉ１により、ステップＤ５からステップＤ７〜Ｄ８へと進み、例えば、利用者の出力として「新しい郵便番号を教えて下さい」という音声出力に対応する出力Ｏ１がアプリケーション３０５によって、決定され、対話管理部３０４に渡されたものとする。
【０２０２】
続いて、ステップＤ９へと進み、出力Ｏ１に関する音声出力「新しい郵便番号…」が利用者へと提示され始めたものとする。
【０２０３】
ここから、ステップＤ１０〜Ｄ１２の処理ループによって、利用者への出力Ｏ１の提示が続けられるが、今回は、その出力の途中で、利用者が現在提示されつつある出力の一部分、例えば「新しい郵便番号」の部分が、聞きとれなかったため、スピーカを注視したものとする。
【０２０４】
この利用者のスピーカへの注視は、注視対象検出部３０１により検知され、注視対象情報Ｖ１として対話管理部３０４に渡される。
【０２０５】
この注視対象情報Ｖ１により、ステップＤ１１からステップＤ１３へと分岐する。
【０２０６】
ステップＤ１３により、現在出力途中であった出力は中断され、出力部３０２を通じて、再度利用者に提示され直す。
【０２０７】
ここで、利用者が再提示出力を受け取れた場合には、ステップＤ１５からステップＤ１へとすすみ初期状態へと戻る。
【０２０８】
以上の処理によって、本装置では、利用者が出力情報の受け取りに失敗した場合にも、あらかじめ定めた特定の注視対象を注視するだけで再提示が行われるため、出力情報を正しく受け取ることが出来る。
【０２０９】
なおこれは、人間が人間同士の対話に於いて、例えば理解できなかったりあるいは聞き取りに失敗した場合などに、無意識に対話相手を見ることによって、その障害の発生を対話相手にフィードバックするという行動と同様の行動を、本装置に対して行なう利用者に対して適切に対応するための機能を実現するものである。
【０２１０】
あるいは、ステップＤ１３の再提示によっても利用者が出力情報を正しく受け取れなかった場合にも、その障害の発声を利用者が明示的に提示することで、ステップＤ１４、Ｄ１６〜Ｄ１７の処理によって、確認の対話を起動することが出来る。
【０２１１】
かくしてこのように構成された本第３実施形態によれば、利用者が情報入力の待ち受け時間を延長するために、例えばボタンを押すなどといった恣意的な操作を行なうことが不要で、自然で、利用者にとって繁雑でなく、習得のための訓練が不要であり、利用者の負担を増加しないマルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが可能となる。
【０２１２】
また、人間同士のコミュニケーションにおいては重要な役割を演じていると言われる、非言語メッセージを、効果的に利用することが出来るマルチモーダルインタフェース装置およびマルチモーダルインタフェース方式を提供することが可能となる。
【０２１３】
また、利用者からの入力に対応して利用者への適切な出力を行なったり、あるいは利用者からの入力と利用者への出力のタイミングを適切に制御するために、利用者の発話が開始されるタイミングや、あるいは利用者の発話が終了するタイミングなどを、事前に予測することが可能となる。
【０２１４】
また、利用者からの入力の認識に失敗したり、あるいは利用者への情報の出力に失敗をした場合など、利用者との間のコミュニケーションに関する何らかの障害が発生した場合などには、その障害の発生を適切に検知することが可能となる。
【０２１５】
また、検知した障害を解決するための、例えば確認のための情報の再提示や、あるいは利用者への問い返し質問対話や、あるいは対話の論議の流れの適切な管理を行なうことが可能となる。
【０２１６】
尚、本発明は、以上の各実施形態に限定されるものではない。
【０２１７】
まず、上述の第１実施形態では、利用者の呼吸状態の検出にカメラから得られる画像情報の解析による方法を示したが、例えば利用者の身体や衣服などに装着あるいは近接して設置するセンサーなどを用いた方法によっても、同様の効果を得ることが可能である。
【０２１８】
また、上述の第１実施形態では、検知した利用者の呼吸の状態に関する情報を、利用者の発話の開始時間の予測や、あるいはある発話に継続して行なわれる発話の検出などに利用する例を示したが、例えば利用者の呼吸の深さまでを検出し、呼吸の深さと、後続する発話の全体、あるいは次の息継ぎまでのフレーズの長さとの関係を、あらかじめ用意しておいたり、あるいはその時点までの実際の利用履歴から抽出した学習データなどから推測した値などを参照することで、利用者の呼吸の深さに応じて、続く発話の長さを予測し、該発声の取り込み処理や、あるいは音響分析や、あるいは言語的解析処理や、あるいは対話における発話交替タイミング管理処理などに於いて、利用するように構成することも可能である。
【０２１９】
また、上述の第２実施形態では、＜処理手順Ｃ＞のステップＣ２〜ステップＣ５の処理において、各ジェスチャ入力候補の開始時間と終了時間との間の全時間区間に対応する注視状況情報記憶部２０２ｃのエントリに関して、種別リスト情報Ｂを参照した条件判断を行ない、該ジェスチャ入力候補を受理すべきかどうかを判断するようにしているが、例えば、該ジェスチャ入力候補の提示されている時間の内の、例えば時間比率の上での最初の一部分であるとか、あるいは最後の一部分であるとか、あるいは最初の一部分と最後の一部分の双方などといった、特定の部分に関してのみ、同様の条件判断を行なって、該ジェスチャ入力候補を受理すべきかどうかを判断するように構成することも可能である。
【０２２０】
さらに、この条件判断に使う部分の時間的位置や、箇所数などを利用毎にあらかじめ調整しておいたり、あるいは自動的適応的に調整する様に構成することも可能であり、これにより、ある利用者は例えばある特定の方向を注視しながらジェスチャ入力を開始し、その後視線を逸した後該ジェスチャ入力を終えるといった癖などを持っている場合にも適切にジェスチャ入力を受理することのできるマルチモーダルインタフェース装置およびマルチモーダルインタフェース方法を実現することが出来る。
【０２２１】
また、上述の第１乃至第３実施形態では、装置として本発明を実現する場合のみを示したが、上述の具体例の中で示した処理手順、フローチャートをプログラムとして記述し、実装し、汎用の計算機システムで実行することによっても同様の機能と効果を得ることが可能である。
【０２２２】
すなわち、この場合図１４の汎用コンピュータの構成の例に示したように、ＣＰＵ４０１、メモリ４０２、大容量記憶装置４０３、通信インタフェース４０４からなる汎用コンピュータに、入力インタフェース４０４ａ〜４０４ｎと、入力デバイス４０５ａ〜４０５ｎ、そして、出力インタフェース４０７ａ〜４０７ｍ、出力デバイス４０８ａ〜４０８ｍを設け、入力デバイス４０６ａ〜４０６ｎに、マイクやキーボード、ペンタブレット、ＯＣＲ、マウス、スイッチ、タッチパネル、カメラ、データグローブ、データスーツといった部品を使用し、出力デバイス４０８ａ〜４０８ｍとして、ディスプレイ、スピーカ、フォースディスプレイ、等を用いて、ＣＰＵ４０１によるソフトウェア制御により、上述のごとき動作を実現することが出来る。
【０２２３】
すなわち、第１乃至第３実施形態に記載した手法は、コンピュータに実行させることの出来るプログラムとして、磁気ディスク（フロッピーディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体を用いてコンピュータにプログラムを読み込み、ＣＰＵ４０５で実行させれば、本発明のマルチモーダルインタフェース装置を実現することが出来ることになる。
【０２２４】
【発明の効果】
以上説明したように、本発明によれば、新たに利用可能となった各入出力メディアあるいは、複数の入出力メディアを効率的に利用し、高能率で、効果的で、利用者の負担を軽減することが出来るマルチモーダルインタフェースを実現することができる。
【図面の簡単な説明】
【図１】本発明の第１実施形態に係るマルチモーダルインタフェース装置の構成を示すブロック図。
【図２】同第１実施形態のマルチモーダルインタフェース装置で用いられる呼吸状況情報の例を示す図。
【図３】同第１実施形態のマルチモーダルインタフェース装置の処理手順（Ａ）の内容を示すフローチャート。
【図４】本発明の第２実施形態に係るマルチモーダルインタフェース装置の構成を示すブロック図。
【図５】同第２実施形態のマルチモーダルインタフェース装置で用いられる注視対象情報の例を示す図。
【図６】同第２実施形態のマルチモーダルインタフェース装置で用いられるジェスチャ認識情報の例を示す図。
【図７】同第２実施形態のマルチモーダルインタフェース装置に設けられた制御部の内部構成の例を示すブロック図。
【図８】同第２実施形態のマルチモーダルインタフェース装置で用いられる注視解釈規則記憶部の内容の例を示す図。
【図９】同第２実施形態のマルチモーダルインタフェース装置で用いられる注視状況記憶部の内容の例を示す図。
【図１０】同第２実施形態のマルチモーダルインタフェース装置の処理手順（Ｂ）の内容を示すフローチャート。
【図１１】同第２実施形態のマルチモーダルインタフェース装置の処理手順（Ｃ）の内容を示すフローチャート。
【図１２】本発明の第３実施形態に係るマルチモーダルインタフェース装置の構成を示すブロック図。
【図１３】同第３実施形態のマルチモーダルインタフェース装置の処理手順（Ｄ）の内容を示すフローチャート。
【図１４】本発明の各実施形態に係るマルチモーダルインタフェース装置を実現するコンピュータの構成例を示すブロック図。
【符号の説明】
１０１…呼吸検出部
１０２…音声入力部
１０３…制御部
１０４…アプリケーション
２０１…注視対象検出部
２０２…ジェスチャ認識部
２０３…制御部
２０４…アプリケーション
２０３ａ…制御処理部
２０３ｂ…注視解釈規則記憶部
２０３ｃ…注視状況記憶部
３０１…注視対象検出部
３０２…入力部
３０３…出力部
３０４…対話管理部
３０５…アプリケーション
４０１…ＣＰＵ
４０２…メモリ
４０３…大容量記憶装置
４０４…通信インタフェース
４０５ａ〜ｎ…入力デバイス
４０６ａ〜ｎ…入力インタフェース
４０７ａ〜ｍ…出力デバイス
４０８ａ〜ｍ…出力インタフェース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a multimodal interface device and a multimodal interface method for interacting with a user.
[0002]
[Prior art]
In recent years, in various computer systems such as personal computers, input and output of multimedia information such as voice information and image information in addition to conventional keyboard and mouse input and character and image information output via a display, etc. It is possible to do that.
[0003]
In addition to these situations, demands for spoken dialogue systems that interact with users using speech input / output have increased due to natural language analysis, natural language generation, and advances in speech recognition, speech synthesis technology, and dialogue processing technology. “TOSBURG-II” (Volume J77-D-II, No. 8, pp1417-1428, 1994) is a dialogue system that can use voice input by free speech. A new spoken dialogue system has been developed.
[0004]
Furthermore, in addition to the voice input / output described above, visual information input using a camera, for example, or touch panel, pen, tablet, data glove, foot switch, human sensor, head mound display, force display ( There is an increasing demand for multimodal dialogue systems that interact with users using information that can be exchanged with users through various input / output devices, such as force devices.
[0005]
This multi-modal interface does not communicate using only one medium (channel) such as voice, even in a dialogue between humans, but it is not exchanged through various media such as gestures, hand gestures or facial expressions. Considering that natural and smooth interaction is performed by interacting with language messages (“Intelligent Multimedia Interfaces”, Maybury MT, Eds., The AAAI Press / The MIT Press, 1993). As a powerful method for realizing a natural and easy-to-use human interface, expectations are growing.
[0006]
Conventionally, for example, when a voice input is made by a user, a voice section is detected by, for example, performing analog / digital conversion on the input voice waveform signal and calculating power per unit time, for example, FFT ( Analyzing by a method such as Fast Fourier Transform), for example, using a method such as HMM (Hidden Markov Model) to estimate the utterance content by performing collation processing with a speech recognition dictionary that is a standard pattern prepared in advance. Then, processing according to the result is performed.
[0007]
Alternatively, for example, when a pointing gesture is input from a user through a contact-type input device such as a touch sensor, coordinate information that is output information of the touch sensor, or time series information thereof, or input pressure information, Alternatively, a process for identifying the pointing destination is performed using an input time interval or the like.
[0008]
Or, for example, “Uncalibrated Stereo Vision with Pointing for a Man-Machine Interface” (R. Cipolla, et.al., Proceedings of MVA'94, IAPR Work16, MAP'16. Using the method shown in, the real world pointed to by the user by photographing the user's hand, etc. using one or more cameras and analyzing the observed shape or movement It is possible to input an instruction object inside or an instruction object on the display screen.
[0009]
Similarly, by using a distance sensor using infrared light, for example, by recognizing the position, shape, or movement of the user's hand, the target object indicated in the real world, Alternatively, it is possible to input a pointing gesture to an instruction target on the display screen.
[0010]
Alternatively, for example, by attaching a magnetic sensor or an acceleration sensor to the user's hand, the spatial position, movement, or shape of the hand can be input, or developed for virtual reality (VR = Virtual Reality) technology. The user is wearing a data glove or data suit, and the user's hand or body is analyzed by analyzing the movement, position, or shape of the user's hand. Alternatively, it is possible to input an instruction target on the display screen.
[0011]
By the way, in response to the input from the user, appropriate output to the user is performed, the timing of the input from the user and the output to the user is appropriately controlled, or the input from the user If there is a failure related to communication with the user, such as when recognition of the user fails or when information output to the user fails, the failure is detected and the failure is detected. In order to solve the problem, for example, re-presentation of information for confirmation, question-answering question dialogue to the user, or dialogue management processing for appropriately managing the flow of discussion of the dialogue is required.
[0012]
Conventionally, in such dialogue management processing, a method using a script which is a prepared dialogue flow, or a pair of utterances such as a question / answer and a greeting / greeting, which are pairs of utterances and utterance exchanges, are used. By using information such as structure, or by planning the entire conversation flow as a personal plan (plan) for each participant of the dialogue or a joint plan (plan) between participants. Methods are used.
[0013]
[Problems to be solved by the invention]
However, since the analysis accuracy of the input from each medium and the nature of each input / output medium have not been clarified, each input / output medium or multiple input / output media that has become newly available There is a problem that a multi-modal interface that efficiently uses, efficiently, and effectively reduces the burden on the user has not been realized. Specifically, it is as follows.
[0014]
In other words, since the analysis accuracy of the input from each medium is insufficient, for example, the occurrence of misrecognition due to ambient noise in voice input or the recognition of the signal obtained from the input device in the gesture input recognition process. Among them, there is a problem that malfunction occurs due to failure of extraction of a signal portion intended by the user as an input message, which is a burden on the user.
[0015]
Also, an interface using media that is used not only as an input to the computer that the user is currently operating, such as voice input and gesture input, but also when talking to other people around, for example In the case of a device, the user incorrectly determines that the interface device is an input to himself / herself even if he / she talks to another person beside him / her or shows a gesture instead of the interface device. , Including the burden of performing recognition processing, etc., causing malfunctions, canceling malfunctions, restoring the effects of malfunctions, and requiring users to pay constant attention to avoid malfunctions There is a problem that has become a burden on.
[0016]
In addition, since input signal processing is continuously performed even in situations that are not originally required, there is a problem that the execution speed and utilization efficiency of other services related to the device being used are reduced due to the processing load. is there.
[0017]
In order to solve this problem, a method of changing the mode by a special operation such as pressing a button or selecting a menu when inputting a voice or a gesture is used. Such special operations are not necessary for human-to-human conversations, and are not only unnatural interfaces, but also complicated for users, and depending on the type of operation, training for acquisition However, there is a problem that the burden on the user is increased due to the necessity of.
[0018]
In addition, for example, when switching whether voice input is possible by operating a button, input by voice media can be communicated by using only the mouth, so that, for example, both operations performed by hand are not disturbed. There is a problem that it is not possible to make use of the original advantages of audio media.
[0019]
Conventionally, in the input of pointing gestures, for example, in the interface method realized using a touch sensor, there is a problem that pointing gestures cannot be performed from a distant position or without touching the device. .
[0020]
Furthermore, for example, the interface method realized by the user wearing a data glove, a magnetic sensor, an acceleration sensor, or the like has a problem that it cannot be used unless the device is attached.
[0021]
On the other hand, the interface method realized by detecting the shape, position, or movement of the user's hand using a camera, etc., does not provide sufficient accuracy, so the user intended input It is difficult to properly extract only gestures. As a result, hand movements, shapes, etc. that the user does not intend to input as gestures are mistakenly recognized as gesture input. The gesture that the user intends to input often fails to be extracted correctly as a gesture input, and as a result, for example, it is necessary to correct the influence of malfunction caused by misrecognition. The gesture input that the user intended to input is not actually input correctly to the system, and the user needs to input again. , There is a problem that increases the burden on the user.
[0022]
In addition, the conventional multi-modal interface is said to play an important role in human-to-human communication. Non-verbal messages such as gaze matching (eye contact), gaze position, gestures, gestures, facial expressions, etc. There is a problem that it cannot be used effectively.
[0023]
In addition, in order to perform appropriate output to the user in response to input from the user, or to properly control the timing of input from the user and output to the user, the user's utterance is Although it is necessary to predict in advance when to start or when the user's utterance ends, methods using scripts, methods using information such as utterance pairs or utterance exchange structures, and planning There is a problem that it is difficult to do it only with the conventional dialog management process using the method according to the above.
[0024]
In addition, when there is some kind of failure related to communication with the user, such as when recognition of input from the user fails or when information output to the user fails, the failure It is necessary to detect the occurrence, but it can be done only by the conventional dialog management process using a method using a script, a method using information such as an utterance pair or an utterance exchange structure, or a method using planning. There is a problem that it is difficult.
[0025]
In addition, it is necessary to re-present information for confirmation, for example, re-present information for confirmation, question-and-answer question dialogue to the user, or dialogue management processing to properly manage the flow of discussion of the dialogue. However, the problem is that it is difficult to do so only with the conventional dialog management process using a method using a script, a method using information such as an utterance pair or an utterance exchange structure, or a method using planning. There is.
[0026]
The present invention has been made in view of such circumstances, and by making it possible to control the interface operation for dialogue with the user using a non-language message, each input that is newly available can be used. An object is to provide a multimodal interface device and a multimodal interface method capable of efficiently using output media or a plurality of input / output media, being highly efficient, effective, and reducing the burden on the user. To do.
[0027]
In addition, one of the specific purposes of the present invention is due to misrecognition caused by insufficient analysis accuracy of input from each medium or failure of extraction of a signal part intended by a user as an input message. To provide a multi-modal interface device and a multi-modal interface method that do not cause a malfunction and cause no extra burden on the user.
[0028]
Another specific purpose is not only to be used as an input to the computer that is the current operation target, such as voice input or gesture input, but also when talking to other people around, for example. In the interface device using the media to be used, when the user talks to other people next to him or shows a gesture instead of the interface device, the interface device is an input to himself / herself. It is an object to provide a multimodal interface device and a multimodal interface method that are not erroneously determined as such.
[0029]
In addition, another specific purpose is the malfunction caused by erroneously recognizing that a message that the user did not intend to input to the computer as described above is an input to the user, recovery of the influence, Another object is to provide a multimodal interface device and a multimodal interface method that do not cause a burden on the user including a load that the user has to constantly pay attention to avoid malfunction.
[0030]
In addition, another specific purpose is that the processing of the input signal is continuously performed even in an originally unnecessary scene. Therefore, depending on the processing load, execution of other services related to the device being used is performed. To provide a multimodal interface device and a multimodal interface method in which speed and utilization efficiency are not reduced.
[0031]
Another specific purpose is that when inputting voice or gestures, there is no need to change the mode by a special operation such as pressing a button or selecting a menu. To provide a multimodal interface device and a multimodal interface method that are not complicated, do not require training for learning, and do not increase the burden on the user.
[0032]
In addition, another specific purpose is, for example, that communication can be performed using only the mouth, and it is possible to use both at the same time without interfering with the work performed by hand, for example. To provide a multimodal interface apparatus and a multimodal interface method that can make use of the advantages inherent in audio media.
[0033]
In addition, another specific purpose is a multimodal that can appropriately extract only the gesture intended by the user when inputting the gesture from a remote location or without touching the device. An interface device and a multimodal interface method are provided.
[0034]
In addition, another specific purpose is said to play an important role in communication between humans, such as gaze matching (eye contact), gaze position, gestures such as gestures, hand gestures, facial expressions, etc. To provide a multimodal interface device and a multimodal interface method capable of effectively using language messages.
[0035]
Furthermore, another specific purpose is to appropriately output to the user in response to the input from the user, or to appropriately control the timing of the input from the user and the output to the user. Therefore, it is an object to provide a multimodal interface device and a multimodal interface method capable of predicting in advance the timing at which a user's utterance starts or the timing at which the user's utterance ends.
[0036]
Furthermore, another specific purpose is that there are some obstacles related to communication with the user, such as failure to recognize input from the user or failure to output information to the user. To provide a multimodal interface device and a multimodal interface method capable of appropriately detecting the occurrence of a failure when it occurs.
[0037]
In addition, another specific purpose is to re-present information for confirmation, such as reconfirmation of the detected failure, or a question-and-answer question dialogue to the user, or appropriate discussion flow of dialogue. To provide a multimodal interface device and a multimodal interface method capable of performing easy management.
[0038]
[Means for Solving the Problems]
The multimodal interface device of the present invention observes the breathing situation of the user. Outputs breathing status information indicating whether the breathing state of the user is steady breathing, which is inspiration or exhaustion in a steady state, or unsteady inspiration, which is inhalation in an unsteady state due to deep breathing or breathing A breathing state recognition unit; an input voice processing unit that performs at least one process of capturing, recording, processing, analyzing, or recognizing voice generated by the user; and the user's voice based on the breathing state information And a control means for controlling the input voice processing means when non-stationary intake air is detected, and executing an acceptance / rejection control process for switching the voice input from the user from the non-accepting state to the accepting state. And
[0042]
In this way, by controlling the operation of the input voice processing means based on the respiratory status information recognized by the user, misrecognition caused by insufficient analysis accuracy of voice input or the user intended as input voice It is possible to provide a multimodal interface device that does not cause a malfunction due to a failure to cut out a signal portion and does not cause an extra burden on the user.
[0050]
DETAILED DESCRIPTION OF THE INVENTION
(I) First embodiment
Hereinafter, a multimodal interface device and a multimodal interface method according to a first embodiment of the present invention will be described with reference to the drawings.
[0051]
FIG. 1 is a configuration example of a multimodal interface device according to the first embodiment of the present invention, in which 101 is a respiration detection unit, 102 is a voice input unit, 103 is a control unit, and 104 is an application. This multi-modal interface device is a system for supporting a dialogue with a user by voice information using a computer or the like.
[0052]
In FIG. 1, reference numeral 101 denotes a respiration detection unit. For example, “automatic setting of a region of interest (ROI) of a respiration monitoring system by visual sensing” (Miyake et al., 17th Medical Informatics Conference Proceedings, 1-C-1-3, pp.168-169, 1997), for example, from the user image obtained from the camera, for example, by observing the chest of the user and accompanying the breathing By detecting this, the user's breathing state is detected and output as breathing status information as needed. Further, by processing information from a sensor that is mounted on or close to the user's body, the user's breathing situation can be observed.
[0053]
FIG. 2 shows an example of respiratory condition information output by the respiratory detection unit 101.
[0054]
In FIG. 2, the ID column represents an identification symbol of each breathing situation information, the time information A records the time when the corresponding breathing situation was observed, and is observed in the situation information B. A symbol representing the state of breathing is recorded.
[0055]
In the column of status information B of each respiratory status information, “steady breathing (intake)” and “steady breathing (exhaust)” are observed that the user is inhaling and exhausting in a steady state, respectively. It represents that.
[0056]
“Unsteady breathing (intake)” and “unsteady breathing (exhaust)” indicate that the user is inhaling and exhausting in an unsteady state such as deep breathing and breathing, for example. Represents.
[0057]
In FIG. 1, reference numeral 102 denotes an audio input unit. For example, an audio signal emitted by a user is converted into an electric signal by a microphone or the like, or is input as an input signal to the apparatus. For example, A / D (analog / digital) conversion is performed to convert the expression into a form that can be processed by the present apparatus, or analysis processing or processing is performed using, for example, FFT (Fast Fourier Transform). Or, for example, by performing a collation process between a standard pattern prepared in advance and an input signal using a method such as a composite similarity method, HMM (Hidden Markov Model), DP (Dynamic Programming), or a neural network. The processing is done.
[0058]
An operation such as capturing, recording, processing, analyzing, analyzing, or recognizing a voice uttered by the user by the voice input unit 102 is controlled by the control unit 103 and obtained by the voice input unit 102. The processing result of the voice input is also passed to the application 104 under the control of the control unit 103.
[0059]
In FIG. 1, reference numeral 103 denotes a control unit.
[0060]
The control unit 103 refers to the respiration status information sequentially obtained from the respiration detection unit 101, appropriately controls at least one of the audio input unit 102 and the application 104, and controls whether or not to accept an audio input signal from the user. Control processing, noise reduction processing, audio signal conversion processing, and the like.
[0061]
Since the operation of the control unit 103 plays an essential role in realizing the effects of the apparatus, the details will be described later.
[0062]
In FIG. 1, reference numeral 104 denotes an application, which receives the output of the voice input unit 102 according to the control of the control unit 103. For example, the database system outputs a search result corresponding to the input search request, or The voice recording system performs services such as appropriately storing input voice signals, and corresponds to an application program of a computer.
[0063]
Next, the control unit 103 will be described in detail.
[0064]
The control unit 103 is configured to operate according to the following processing procedure A. FIG. 3 is a flowchart for explaining the processing contents of the processing procedure A.
[0065]
<Processing procedure A>
A1: The voice input unit 102 is controlled to set the voice input to the “non-acceptance state”.
[0066]
A2: The content of the respiration status information obtained from the respiration detection unit 101 is constantly monitored, and if “unsteady respiration (inspiration)” is detected, the process proceeds to step A3; otherwise, the process remains in step A2.
[0067]
A3: The voice input unit 102 is controlled and the voice input is accepted.
[0068]
A4: The value of timer T is set to 0, and timer T is (re) started.
[0069]
A5: Regarding the timer T, if a predetermined time tA has elapsed, the process proceeds to step A1, and if not, the process proceeds to step A6.
[0070]
A6: If the voice input I from the user has been made at the present time, the process proceeds to step A8; otherwise, the process proceeds to step A7.
[0071]
A7: At the present time, if “unsteady breathing (inspiration)” is detected from the respiration status information obtained from the respiration detection unit 101, the process proceeds to step A4; otherwise, the process proceeds to step A5.
[0072]
A8: The processing result of the voice input unit 102 for the voice input I is passed to the application 104, and the process proceeds to step A4.
[0073]
The above is the configuration and function of the first embodiment according to the present invention.
[0074]
Here, the process described above will be described in detail using a specific example.
[0075]
(1) First, the voice input of this apparatus is not accepted by the process of step A1.
[0076]
(2) Here, it is assumed that noise is generated around the user.
[0077]
(3) Since the voice input is not accepted here, the voice recognition misrecognition caused by the noise does not occur.
[0078]
(4) Next, it is assumed that the user inhales a great deal of speech to make a voice input to the apparatus.
[0079]
(5) This behavior is detected by the respiration detecting unit 101, and respiration status information as shown in the entry of p104 in FIG. 2 is output.
[0080]
(6) Furthermore, the voice input is changed to the acceptance state by the processing of steps A2 to A4, and the timer T is started.
[0081]
(7) Here, it is assumed that the user performs voice input.
[0082]
(8) Since the voice input has been accepted by the processing so far, the user's voice input is accepted, and in step A8, the processing result is sent to the application 104, and the desired service is provided to the user. The
[0083]
Through the above processing, the user can input voice naturally without performing an explicit or arbitrary operation, and the occurrence of malfunction due to ambient noise can be eliminated.
[0084]
(9) Thereafter, the timer T is restarted by the process of step A4.
[0085]
(10a) If there is no voice input to be performed by the user at this stage, the user is silent, and when the timer T has passed tA, the process proceeds to step A1 by the process of step A5, Voice input returns to the non-acceptance state.
[0086]
(10b) Alternatively, if there is a voice input to be performed next by the user and the next voice input is performed, the voice is accepted again by the processing of step A6, and the processing result is obtained by step A8. After being sent to the application 104 and the desired service is provided to the user, the process proceeds to step A4 where the timer T is restarted and the standby time for voice input from the user is extended.
[0087]
(10c) Alternatively, if there is a voice input to be performed next by the user, but the voice is not yet spoken and a breath is made for voice preparation, the process proceeds to step A4 by the process of step A7, The timer T is restarted and the standby time for voice input from the user is extended.
[0088]
(11) The above voice input process or voice input standby time extension process is repeated as many times as necessary in accordance with the user's behavior, and then proceeds to step A1 by the branch of step A5 to return to the initial state. Return.
[0089]
Thus, according to the first embodiment of the present apparatus configured as described above, misrecognition due to insufficient analysis accuracy of voice input, or failure to cut out a signal part intended by the user as input voice. Therefore, it is possible to provide a multimodal interface device and a multimodal interface system that do not cause malfunction due to the above and do not cause an extra burden on the user.
[0090]
In addition, multimodal interface devices and multimodal interface methods that reduce the processing load of input audio signals in situations that are not necessary and that do not reduce the execution speed and usage efficiency of other services related to the devices being used. Can be provided.
[0091]
Also, when performing voice input, for example, there is no need to change the mode by special operations such as pressing a button or selecting a menu, etc., it is natural, not complicated for the user, and training for acquisition is unnecessary. It is possible to provide a multimodal interface device and a multimodal interface method that do not increase the burden on the user.
[0092]
In addition, for example, it is possible to communicate using only the mouth, and to take advantage of the original benefits of audio media, for example, it is possible to use both at the same time without interfering with the work done by hand. A multimodal interface device and a multimodal interface method can be provided.
[0093]
In addition, it is said that it plays an important role in communication between humans, and can provide a multimodal interface device and a multimodal interface method that can efficiently use non-language messages. The effect is produced.
[0094]
(Ii) Second embodiment
Subsequently, a multimodal interface device and a multimodal interface method according to a second embodiment of the present invention will be described with reference to the drawings.
[0095]
FIG. 4 shows a configuration example of a multimodal interface device according to the second embodiment of the present invention, which includes a gaze target detection unit 201, a gesture recognition unit 202, a control unit 203, and an application program 204.
[0096]
In FIG. 4, reference numeral 201 denotes a gaze target detection unit. For example, the appearance of a user is observed by a method similar to “Object operation device and object operation method” of Japanese Patent Application No. 09-62681, for example. By analyzing image information or the like, an object that the user is gazing at is detected and output as gazing object information as needed.
[0097]
FIG. 5 illustrates an example of gaze target information output from the gaze target detection unit 201.
[0098]
In each entry of FIG. 5, an identification signal of each gaze target information is recorded in the ID column, and information on the time when the corresponding gaze is detected is recorded in the time information A column. I have to.
[0099]
In the target information B column, a symbol representing an object or region that is the target of the corresponding gaze is recorded.
[0100]
(The symbol “mines eye” described in the target information B column of entries q251 and q252 in FIG. 5 will be described later.)
In FIG. 4, reference numeral 202 denotes a gesture recognition unit, which is obtained by processing user image information obtained by one or a plurality of cameras or the like, or by a remote sensor such as an infrared sensor, a wearing sensor, or the like. It recognizes the gesture input from the user by analyzing the movement of the body part such as the user's hand or the whole body by processing the signal, etc. The gesture input is analyzed and recognized by, for example, “Uncalibrated Stereo Vision”. with Pointing for a Man-Machine Interface "(R. Cipolla, et.al., Proceedings of MVA'94, IAPR Works on Machine Vision Application, 16 app. 16). 1994.) shown in such methods can be used.
[0101]
FIG. 6 illustrates an example of gesture recognition information output from the gesture recognition unit 202. In each entry of FIG. 6, ID represents an identification symbol of each gesture recognition information, and the start time information A and the end time information B columns are recorded with the start and end times of the corresponding gesture, respectively. I have to.
[0102]
In the column of gesture type information C, the type of gesture obtained by the processing in the gesture recognition unit 202 is recorded as a symbol.
[0103]
In FIG. 4, reference numeral 203 denotes a control unit that controls the gaze target detection unit 201, the gesture recognition unit 202, and the application 204. Based on the line-of-sight detection information, the control unit 203 controls whether or not to accept gesture input, or adjusts parameter information used for detecting or recognizing the gesture input.
[0104]
The control unit 203 plays an important role in realizing the effects of the apparatus, and the details of the operation will be described later.
[0105]
In FIG. 4, reference numeral 204 denotes an application, and the role of this component is the same as that of the application 104 in the first embodiment.
[0106]
Next, the control unit 203 will be described.
[0107]
FIG. 7 illustrates an example of an internal configuration of the control unit 203, and shows that the control unit 203 includes a control processing unit 203a, a gaze interpretation rule storage unit 203b, and a gaze situation recording unit 203c. Yes.
[0108]
FIG. 8 shows an example of the contents of the gaze interpretation rule storage unit 203b. Each entry of the gaze interpretation rule is classified and recorded as an ID, gaze target information A, possible gesture type list information B, and the like. ing.
[0109]
In each entry in the gaze interpretation rule storage unit 203b, the ID column records the identification symbol of the corresponding rule.
[0110]
The type of gaze target of gaze target information to be interpreted is recorded in the column of gaze target information A, and recorded in the column of gaze target information A in the column of possible gesture type list information B. A list of types of gestures that can be presented while the user is gazing at the gaze target is recorded.
[0111]
FIG. 9 shows an example of the contents of the gaze status storage unit 203c, and each entry of the gaze status storage unit 203c is classified and recorded as an ID, time information A, type list information B, and the like. .
[0112]
In each entry of the gaze status storage unit 203c, ID is an identification symbol of the corresponding gaze status information.
[0113]
The time at which the gaze indicated by the corresponding gaze information is recorded is recorded in the time information A column, and the column of the type list information B is defined by the corresponding gaze being performed. A list of gesture types that are possible at that time is recorded.
[0114]
The above is the description of the configuration of the multimodal interface device according to the second embodiment of the present invention.
[0115]
Next, the operation of the control unit 203 will be described.
[0116]
The control unit 203 operates according to the following processing procedure B and processing procedure C that operate in parallel or alternately.
[0117]
10 is a flowchart for explaining the processing procedure B, and FIG. 11 is a flowchart for explaining the processing procedure C.
[0118]
<Processing procedure B>
B1: When the gaze target information Ei is received from the gaze target detection unit 201, the process proceeds to step B2, and otherwise the process proceeds to step B1.
[0119]
B2: Look up the entry Si having the same content as the target information B of the gaze target information Ei in the gaze target information A with reference to the gaze interpretation rule storage unit 203b.
[0120]
B3: A new entry Ui is created in the gaze status information storage unit 203c, the contents of the time information A of the gaze target information Ei are copied to the time information A column of the entry Ui, and the type list information B of the entry Ui In the column, the contents of the possible gesture type list information B of the entry Si in the gaze interpretation rule storage unit 203b searched in step B2 are copied.
[0121]
B4: Proceed to step B1.
[0122]
<Processing procedure C>
C1: When the gesture recognition information Gj is received from the gesture recognition unit 202, the process proceeds to Step C2, and if not, the process proceeds to Step C1.
[0123]
C2: Refers to the gesture recognition information Gj, and obtains the contents Tjs of the start time information A and the end time information Tje.
[0124]
C3: Referring to the content of the gaze interpretation status storage unit 203c, obtain a set Su of the gaze interpretation status information 202c whose time information A is a value after Tjs and before Tje.
[0125]
C4: If the set Su is an empty set, the process proceeds to C7.
[0126]
C5: If the content of the gesture type information C of the gesture recognition information Gj is included in the column of the type list information B of all the elements of the entry set Su, the process proceeds to step C6. Otherwise, the process proceeds to step C7.
[0127]
C6: Accept the gesture recognition information Gj as a gesture input, send it to the application 204, and proceed to Step C1.
[0128]
C7: Discard the gesture recognition information Gj as a gesture input without accepting it, and proceed to Step C1.
[0129]
Next, the process of the second embodiment of the present invention will be described using a specific example.
[0130]
(1) First, it is assumed that the user of this apparatus is facing the direction of another person at time t10.
[0131]
(2) By the processing in the gaze target detection unit 201 corresponding to this, gaze target information whose ID shown in FIG. 5 is indicated by q 201 is generated and transmitted to the control unit 203.
[0132]
(3) Since the gaze target information of q201 is received, a branch from step B1 to step B2 occurs, and “other person 1” having the same content as the target information B of the gaze target information q201 by the processing in step B2 The entry S401 of the gaze interpretation rule storage unit 202b having the same type value in the gaze target information A column is searched as Si.
[0133]
(4) As a result of the processing in step B3, a new entry u501 is generated in the gaze status information storage unit 203c, the content of the time information A of the gaze target information q201 is copied to the time information A column, and the entry After the contents of the possible gesture type list information B of the entry s401 are copied in the type list information B column of u501, the process returns to step B1 by step B4.
[0134]
(5) Thereafter, the same processing as described above is performed on the gaze target information q202 to q104 shown in FIG. 5 sequentially obtained from the gaze target detection unit 201. As a result, the gaze status storage unit 202c shown in FIG. Entries of gaze status information u502 to u504 are generated.
[0135]
(6) Here, it is assumed that the gesture recognition information shown in the entry r301 in the example of the gesture recognition information in FIG. 6 is obtained from the gesture recognition unit 202.
[0136]
(7) For this gesture recognition information r301, a branch to step C2 occurs by the process of step C1.
[0137]
(8) By step C2, the value of start time information A of r301 = t11 and the value of end time information B = t12 are obtained.
[0138]
(9) Subsequently, in the process of step C3, gaze status information between t11 and t12 is retrieved from the gaze status storage unit 203c, and as a result, a set Su having the entries u502 and u503 as elements is obtained. .
[0139]
(10) Since Su is not an empty set, the process proceeds from step C4 to C5.
[0140]
(11) By the process of step C5, it is checked whether or not the value “nodding” of the gesture type information C of the gesture recognition information r301 is included in the type list information B of both the entry u502 and the entry u503. Since the condition is not satisfied, the process proceeds to C7.
[0141]
(12) In step C7, the “nodding” suggested by the gesture recognition information r301 is discarded without being accepted as a gesture, and the process proceeds to step C1 to return to the initial state.
[0142]
This is because the nodding gesture candidate detected when the user is gazing at another person at time t11 to t12 is not a gesture intended to be input to the apparatus. Is equivalent to the judgment.
[0143]
Further, by the same processing as the above processing, in the “nodding” gesture recognition information over t20 to t24 shown in r302 of FIG. 6, the type list of entries of u511 to u516 in the gaze status storage unit 202c shown in FIG. Since all of the information B does not contain “nodding”, it is discarded. This is because a signal having the possibility of inputting the nodding gesture of the user at the time t20 to t24 is detected. Based on the fact that the user's gaze target has changed from “screen” → “user's hand” → “screen”, this gesture input candidate is determined to have been extracted in error, and the gesture candidate is This is an example of being discarded.
[0144]
On the other hand, the gesture input candidate corresponding to the entry of r303 in FIG. 6 detected from time t31 to time t33 is accepted as a gesture input of “nodding” by this apparatus and sent to the application 204.
[0145]
The procedure will be explained step by step.
[0146]
(1) First, gaze target information whose ID is indicated by q 221 in FIG. 5 is generated by the processing in the gaze target detection unit 201 and is transmitted to the control unit 203.
[0147]
(2) Since the gaze target information of q221 is received, a branch from step B1 to step B2 occurs, and “camera 1” having the same content as the target information B of the gaze target information q221 is obtained by the processing in step B2. The entry S404 in the gaze interpretation rule storage unit 202b having the same type of value in the gaze target information A column is searched as Sk.
[0148]
(3) As a result of the processing in step B3, a new entry u521 is generated in the gaze status information storage unit 203c, the contents of the time information A of the gaze target information q221 are copied to the time information A column, and the entry After the contents of the possible gesture type list information B of the entry s404 are copied in the column of the type list information B of u521, the process returns to step B1 by step B4.
[0149]
(4) Thereafter, the same processing as described above is performed on the gaze target information q232 to q234 shown in FIG. 5 sequentially obtained from the gaze target detection unit 201. As a result, the gaze state storage unit 203c shown in FIG. Entries for u522 to u524 are generated.
[0150]
(5) Here, it is assumed that the gesture recognition information shown in the entry r303 in the example of the gesture recognition information in FIG. 6 is obtained from the gesture recognition unit 202.
[0151]
(6) For this gesture recognition information r303, a branch of step C2 occurs by the process of step C1.
[0152]
(7) By step C2, the value of start time information A of r303 = t30 and the value of end time information B = t33 are obtained.
[0153]
(8) Subsequently, gaze status information between t30 and t33 is retrieved from the gaze status storage unit 203c by the process of step C3, and as a result, a set including entry u521, entry u522, entry u523, and entry u524 Sv is obtained.
[0154]
(9) Since Sv is not an empty set, the process proceeds from step C4 to C5.
[0155]
(10) By the processing of step C5, it is checked whether or not the value “nodding” of the gesture type information C of the gesture recognition information r303 is included in the all type list information B of the entry u521 to the entry u524. It establishes and it progresses to C6.
[0156]
(11) In step C6, the “nodding” suggested by the gesture recognition information r303 is accepted as a gesture and sent to the application 204. Then, the process proceeds to step C1 and returns to the initial state.
[0157]
This is because, while the user kept gazing at the camera for a long time, the presented “nodding” gesture candidate was accepted as a gesture input intended for input to the system by the user, and was accepted. It is equivalent to.
[0158]
Thus, according to the second embodiment of the present apparatus configured as described above, since the analysis accuracy of the gesture input is insufficient, for example, in the gesture input recognition process, from among the signals obtained from the input device every moment, It is possible to avoid the problem that the user fails to cut out the signal portion intended as the input message, and as a result, it is possible to realize an interface that does not cause a burden on the user due to malfunction or the like.
[0159]
In addition to being used as an input to a computer that is the current operation target of a user, for example, in an interface device using a medium that is also used when communicating with other humans around the user, For example, an interface device that does not mistakenly determine that the input is input to the user is realized even when a gesture is shown to another person beside the user instead of the interface device.
[0160]
Further, since it is not necessary to change the input mode by a special operation, for example, by pressing a button or selecting a menu, a natural interface device can be realized.
[0161]
In addition, according to the present invention, it is possible to effectively use a non-language message that is said to play an important role in communication between humans.
[0162]
(Iii) Third embodiment
Subsequently, a multimodal interface device and a multimodal interface method according to a third embodiment of the present invention will be described with reference to the drawings.
[0163]
FIG. 12 shows a configuration example of a multimodal interface device according to the third embodiment of the present invention. This device includes a gaze target detection unit 301, an input unit 302, an output unit 303, and a dialogue management unit 304. , And the application 305.
[0164]
In FIG. 12, reference numeral 301 denotes a gaze target detection unit that detects a user's gaze target. The gaze target detection unit 301 is realized by the same configuration as the gaze target detection unit 201 in the second embodiment described above. The same gaze target information is output.
[0165]
In FIG. 12, reference numeral 302 denotes an input unit that accepts input such as voice input or image input from a user, or operation input of a device such as a keyboard, mouse, joystick, trackball, touch sensor, or button. ing.
[0166]
In FIG. 12, reference numeral 303 denotes an output unit that presents an output such as a sound output to the user, an image output, or a force output through a force device.
[0167]
In FIG. 12, reference numeral 304 denotes a dialogue management unit which controls the input unit 302 and the output unit 303 by a conventional method using, for example, a script, an utterance pair, an utterance exchange structure, or a planning method. Including the reception of input signals from the user and presentation of the output signals to the user, and the time adjustment of the input signals and the output signals, or a dialogue for confirmation and inquiry to the user, etc. Dialogue (= interaction) with the device is realized.
[0168]
In FIG. 12, reference numeral 305 denotes an application which determines the content of a response to a request from a user provided from the dialogue management unit 304 by, for example, database search, inference processing, or arithmetic processing, The message is returned to the dialogue management unit 304.
[0169]
The dialogue management unit 304 refers to the gaze target information provided from the gaze target detection unit 301 as needed, and operates according to the processing according to <Processing Procedure D> shown below, thereby realizing the effect of the present apparatus.
[0170]
FIG. 13 is a flowchart for explaining the processing procedure D.
[0171]
<Processing procedure D>
D1: When receiving input I from the user through the input unit 302, the process proceeds to step D2, and when outputting output O to the user through the output unit 303, the process proceeds to step D9.
[0172]
D2: Reset and start timer Q.
[0173]
D3: If the timer Q exceeds a predetermined value H, the process proceeds to step D1.
[0174]
D4: With reference to the content of the target information B of the gaze target information W obtained from the gaze target detection unit 301, if it is determined that the gaze target X, which is a predetermined specific object or region, is gaze, go to step D2. move on.
[0175]
D5: If the input I from the user is detected by the input 302, the process proceeds to step D7.
[0176]
D6: Proceed to step D3.
[0177]
D7: The processing result of the input I by the input unit 302 is passed to the application 305 through the dialogue control unit 304.
[0178]
D8: The application 305 determines an output O to be answered to the user and passes it to the dialog management unit 304.
[0179]
D9: Output of the output O to the user is started through the output unit 303.
[0180]
D10: When the output O through the output unit 303 is completed, the process proceeds to step D1.
[0181]
D11: With reference to the content of the target information B of the gaze target information V obtained from the gaze target detection unit 301, when the user's gaze on the gaze target Y that is a predetermined specific object or area is detected, Proceed to step D13.
[0182]
D12: Proceed to step D10.
[0183]
D13: After the presentation of the current output O is interrupted, the presentation of the output O to the user is performed again.
[0184]
D14: When an explicit input indicating that the output O has not been correctly transmitted from the apparatus to the user, for example, a non-language voice such as “U” is input from the user To step D16.
[0185]
D15: Proceed to step D1.
[0186]
D16: A dialogue process for confirming whether or not the user understands the output O is activated.
[0187]
D17: Proceed to step D1.
[0188]
Subsequently, the operation of the third embodiment will be described using a specific example.
[0189]
First, as an assumption, a multimodal interface device having an audio input as the input unit 302 and an audio output output from a speaker and an image information output from a display as the output unit 303 will be described as an example.
[0190]
Further, as a specific gaze target X appearing in step D4 of the processing procedure D, a Mineseye (described later) is set, and as a specific gaze target Y appearing in step D11 of the processing procedure D, a speaker portion is It is assumed that it is set.
[0191]
First, it is assumed that a voice output such as “Tell me the destination” is made from the apparatus to the user, and the apparatus receives an answer from the user to this question.
[0192]
In order to receive voice input from the user for this question, a branch from step D1 to D2 is performed.
[0193]
Subsequently, the processing loop from step D2 to step D6 is repeated by the timer Q during the time H, and it is assumed that the user has made a voice input I1 such as “Is Kobe City” during this time.
[0194]
The processing performed so far is the same as conventional multimodal interface processing or processing in an interactive apparatus.
[0195]
Next, in the same situation as described above, when the user cannot answer the information to be input (for example, the destination) immediately and takes an action called Mines Eye to remember the information to be input. I'll think about it.
[0196]
Mineseye refers to the tendency to face a specific direction when a person remembers some information or tries to put together ideas, and is typically directed diagonally upward. There are many cases.
[0197]
In this apparatus, when the user gazes at a predetermined gaze target (in this case, obliquely upward), the gaze target detection unit 301 includes the symbol “Mine's Eye” as the value of the target information B. W1 is output.
[0198]
Therefore, while processing in the processing loop waiting for input from the user in steps D2 to D6 of the processing procedure D, the user performs an action called “mines eye” (specifically, in this case, diagonally upward ) Is detected by the gaze target detection unit 301, and for example, gaze target information including the symbol “mines eye” shown in the target information B column of the entry q251 or q252 in FIG. 5 is output. The Rukoto.
[0199]
As a result, the process proceeds from step D4 to step D2, the timer Q is reset, and as a result, the time for waiting for the user's input is extended.
[0200]
With the above processing, this device automatically extends the waiting time for input when performing an action called “mines eye” in order to recall information to be input by the user. A modal interface will be realized.
[0201]
Subsequently, the voice input I1 proceeds from step D5 to steps D7 to D8. For example, the application 305 determines an output O1 corresponding to the voice output “Tell me a new zip code” as the user output. Suppose that it is passed to the dialogue management unit 304.
[0202]
Subsequently, the process proceeds to step D9, where it is assumed that the voice output “new zip code...
[0203]
From here, the presentation of the output O1 to the user is continued by the processing loop of steps D10 to D12. This time, in the middle of the output, a part of the output currently being presented by the user, for example, “new mail” Since the “number” part could not be heard, it is assumed that the speaker was watched.
[0204]
The user's gaze on the speaker is detected by the gaze target detection unit 301 and passed to the dialogue management unit 304 as gaze target information V1.
[0205]
This gaze target information V1 branches from step D11 to step D13.
[0206]
In step D13, the output that is currently being output is interrupted and presented again to the user through the output unit 302.
[0207]
If the user has received the re-presentation output, the process proceeds from step D15 to step D1 and returns to the initial state.
[0208]
With the above processing, even when the user fails to receive the output information, the present apparatus can correctly receive the output information because re-presentation is performed simply by gazing at a predetermined specific gaze target. .
[0209]
Note that this is a behavior in which humans in the dialogue between humans, for example, when they cannot understand or fail to hear, feedback the occurrence of the failure to the dialogue partner by unconsciously looking at the conversation partner. A function for appropriately responding to the user who performs the same action on the apparatus is realized.
[0210]
Alternatively, even if the user cannot correctly receive the output information even after re-presentation in step D13, the user can explicitly confirm the utterance of the failure by the processing in steps D14 and D16 to D17. Can be started.
[0211]
Thus, according to the third embodiment configured as described above, it is not necessary for the user to perform an arbitrary operation such as pressing a button in order to extend the standby time for information input, and it is natural. It is possible to provide a multimodal interface device and a multimodal interface method that are not complicated for the user, do not require training for acquisition, and do not increase the burden on the user.
[0212]
In addition, it is possible to provide a multimodal interface device and a multimodal interface method that can effectively use non-language messages, which are said to play an important role in communication between humans.
[0213]
In addition, the user's utterance starts in order to appropriately output to the user in response to the input from the user, or to appropriately control the timing of the input from the user and the output to the user. It is possible to predict in advance the timing at which the user utters or when the user's utterance ends.
[0214]
In addition, when there is some kind of failure related to communication with the user, such as when recognition of input from the user fails or when information output to the user fails, the failure It is possible to detect occurrence appropriately.
[0215]
In addition, for example, re-presentation of information for confirmation, question-answering question dialogue to the user, or appropriate discussion flow of dialogue can be performed in order to solve the detected failure.
[0216]
The present invention is not limited to the above embodiments.
[0217]
First, in the first embodiment described above, the method of analyzing the image information obtained from the camera for detecting the breathing state of the user has been described. For example, a sensor that is mounted on or close to the user's body or clothes. The same effect can be obtained by a method using the above.
[0218]
Further, in the first embodiment described above, an example in which the information related to the detected breathing state of the user is used for predicting the start time of the user's utterance or detecting the utterance continuously performed for a certain utterance. For example, it is possible to detect the depth of breathing of the user and prepare the relationship between the depth of breathing and the length of the subsequent utterance or the length of the phrase until the next breathing, or By referring to the value estimated from the learning data extracted from the actual usage history up to that point, the length of the subsequent utterance is predicted according to the breathing depth of the user, and the utterance capturing process Alternatively, it may be configured to be used in acoustic analysis, linguistic analysis processing, or speech alternation timing management processing in dialogue.
[0219]
In the second embodiment described above, the gaze situation information storage unit corresponding to the entire time interval between the start time and the end time of each gesture input candidate in the processing of Step C2 to Step C5 of <Processing Procedure C>. With respect to the entry 202c, a condition is determined with reference to the type list information B, and it is determined whether or not the gesture input candidate should be accepted. For example, within the time when the gesture input candidate is presented A similar conditional judgment is made only for a specific part, such as the first part of the time ratio, the last part, or both the first part and the last part, It can also be configured to determine whether to accept the gesture input candidate.
[0220]
Furthermore, the time position of the part used for the condition judgment, the number of places, etc. can be adjusted in advance for each use, or can be configured to automatically adjust adaptively. For example, when a user has a habit of starting gesture input while gazing at a specific direction and then ending the gesture input after losing his / her line of sight, the user can appropriately accept the gesture input. A modal interface device and a multimodal interface method can be realized.
[0221]
In the first to third embodiments described above, only the case where the present invention is realized as an apparatus has been shown. However, the processing procedure and flowchart shown in the above specific example are described as a program, mounted, and general-purpose. It is possible to obtain the same function and effect by executing on the computer system.
[0222]
That is, in this case, as shown in the configuration example of the general-purpose computer in FIG. 14, the input interface 404 a to 404 n and the input device 405 a to the general-purpose computer including the CPU 401, the memory 402, the mass storage device 403, and the communication interface 404. 405n, and output interfaces 407a to 407m and output devices 408a to 408m are provided, and components such as a microphone, keyboard, pen tablet, OCR, mouse, switch, touch panel, camera, data glove, and data suit are provided on the input devices 406a to 406n. It is possible to use the output devices 408a to 408m as the output devices 408a to 408m and realize the operation as described above by software control by the CPU 401 using a display, a speaker, a force display, or the like. That.
[0223]
In other words, the methods described in the first to third embodiments can be recorded on a magnetic disk (floppy disk, hard disk, etc.), optical disk (CD-ROM, DVD, etc.), semiconductor memory, etc. as programs that can be executed by a computer. If a program is read into a computer using a medium and executed by the CPU 405, the multimodal interface device of the present invention can be realized.
[0224]
【The invention's effect】
As described above, according to the present invention, each newly input / output medium or a plurality of input / output media can be efficiently used, which is highly efficient, effective, and burdens the user. A multimodal interface that can be reduced can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a multimodal interface device according to a first embodiment of the present invention.
FIG. 2 is a view showing an example of respiratory status information used in the multimodal interface device of the first embodiment.
FIG. 3 is a flowchart showing the contents of a processing procedure (A) of the multimodal interface device of the first embodiment;
FIG. 4 is a block diagram showing a configuration of a multimodal interface device according to a second embodiment of the present invention.
FIG. 5 is a view showing an example of gaze target information used in the multimodal interface device of the second embodiment.
FIG. 6 is a view showing an example of gesture recognition information used in the multimodal interface device of the second embodiment.
FIG. 7 is a block diagram showing an example of an internal configuration of a control unit provided in the multimodal interface device of the second embodiment.
FIG. 8 is a view showing an example of contents of a gaze interpretation rule storage unit used in the multimodal interface device of the second embodiment.
FIG. 9 is a view showing an example of contents of a gaze situation storage unit used in the multimodal interface device of the second embodiment.
FIG. 10 is a flowchart showing the contents of a processing procedure (B) of the multimodal interface device of the second embodiment;
FIG. 11 is a flowchart showing the contents of a processing procedure (C) of the multimodal interface device of the second embodiment;
FIG. 12 is a block diagram showing a configuration of a multimodal interface device according to a third embodiment of the present invention.
FIG. 13 is a flowchart showing the contents of a processing procedure (D) of the multimodal interface device of the third embodiment;
FIG. 14 is a block diagram illustrating a configuration example of a computer that realizes a multimodal interface device according to each embodiment of the invention.
[Explanation of symbols]
101 ... Respiration detector
102 ... Voice input unit
103. Control unit
104 ... Application
201 ... gaze target detection unit
202 ... Gesture recognition unit
203 ... Control unit
204 ... Application
203a ... control processing unit
203b ... Gaze interpretation rule storage unit
203c ... Gaze status storage unit
301... Gaze target detection unit
302: Input unit
303: Output unit
304 ... Dialogue management unit
305 ... Application
401 ... CPU
402: Memory
403 ... Mass storage device
404 ... Communication interface
405a-n ... Input device
406a-n: Input interface
407a-m: Output device
408a-m: Output interface

Claims

利用者の呼吸の状況を観察し利用者の呼吸の状態が定常状態での吸気または排気である定常呼吸であるか深呼吸または息継ぎによる非定常状態での吸気である非定常吸気であるかを示す呼吸状況情報を出力する呼吸状況認識手段と、
利用者の発する音声の、取り込み、あるいは録音、あるいは加工、あるいは分析、あるいは認識の少なくとも一つの処理を行なう入力音声処理手段と、
前記呼吸状況情報に基づいて前記利用者の非定常吸気が検出された場合、前記入力音声処理手段を制御して、利用者からの音声入力を非受け付け状態から受け付け状態に切り替える受け付け可否制御処理を実行する制御手段とを具備したことを特徴とするマルチモーダルインタフェース装置。Observe the state of breathing of the user and indicate whether the breathing state of the user is steady breathing, which is inspiration or exhaustion in a steady state, or unsteady inspiration, which is inspiration in a nonsteady state due to deep breathing or breathing Breathing state recognition means for outputting breathing state information;
An input voice processing means for performing at least one process of capturing, recording, processing, analyzing, analyzing or recognizing a voice uttered by a user;
When an unsteady inspiration of the user is detected based on the breathing state information, an acceptability control process for controlling the input sound processing means to switch the sound input from the user from the non-accepted state to the accepted state is performed. And a control means for executing the multimodal interface device.

前記呼吸状況認識手段は、
利用者の様子を撮影することにより得られた画像情報の処理、あるいは利用者の身体に装着あるいは近接して配置したセンサから得られたセンサ情報の処理によって、利用者の呼吸の状況を観察することを特徴とする請求項１記載のマルチモーダルインタフェース装置。The breathing state recognition means
Observe the user's breathing status by processing image information obtained by photographing the user's situation, or by processing sensor information obtained from a sensor placed on or close to the user's body The multimodal interface device according to claim 1.

利用者の呼吸の状況を観察し利用者の呼吸の状態が定常状態での吸気または排気である定常呼吸であるか深呼吸または息継ぎによる非定常状態での吸気である非定常吸気であるかを示す呼吸状況情報を出力する呼吸状況認識ステップと、Observe the state of breathing of the user and indicate whether the breathing state of the user is steady breathing, which is inspiration or exhaustion in a steady state, or unsteady inspiration, which is inspiration in a nonsteady state due to deep breathing or breathing A breathing state recognition step for outputting breathing state information;
利用者の発する音声の、取り込み、あるいは録音、あるいは加工、あるいは分析、あるいは認識の少なくとも一つの処理を行なう入力音声処理ステップと、An input voice processing step for performing at least one process of capturing, recording, processing, analyzing, analyzing or recognizing a voice uttered by a user;
前記呼吸状況情報に基づいて前記利用者の非定常吸気が検出された場合、前記入力音声処理ステップを制御して、利用者からの音声入力を非受け付け状態から受け付け状態に切り替える受け付け可否制御処理を実行する制御ステップとを具備したことを特徴とするマルチモーダルインタフェース方法。When an unsteady inspiration of the user is detected based on the breathing state information, an acceptability control process for controlling the input sound processing step to switch the sound input from the user from the non-accepted state to the accepted state is performed. And a control step for executing the multimodal interface method.

前記呼吸状況認識ステップは、The breathing state recognition step includes
利用者の様子を撮影することにより得られた画像情報の処理、あるいは利用者の身体に装着あるいは近接して配置したセンサから得られたセンサ情報の処理によって、利用者の呼吸の状況を観察することを特徴とする請求項３記載のマルチモーダルインタフェース方法。Observe the user's breathing status by processing image information obtained by photographing the user's situation, or by processing sensor information obtained from sensors placed on or close to the user's body 4. The multimodal interface method according to claim 3, wherein: