JPH11249773A

JPH11249773A - Device and method for multimodal interface

Info

Publication number: JPH11249773A
Application number: JP4836498A
Authority: JP
Inventors: Tetsuro Chino; 哲朗知野; Katsumi Tanaka; 克己田中
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-02-27
Filing date: 1998-02-27
Publication date: 1999-09-17
Anticipated expiration: 2018-02-27
Also published as: JP3844874B2

Abstract

PROBLEM TO BE SOLVED: To provide a natural interaction with a user by controlling interface operation while utilizing a non-language message. SOLUTION: A non-language message such as a glance or respiration of a user is observed, and based on the observation, an interaction is managed. Concretely, based on information on the respiration of the user detected by a respiration detecting part 101, a control part 103 controls the start/continuation/stop of voice input due to a voice input part 102. Thus, smooth and efficient interactive processing with the user can be provided without any special operation, and moreover newly available input/output medium or media can be efficiently utilized, and an interaction system capable of efficiently and effectively reducing the burden of the user can be provided.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、利用者と対話する
マルチモーダルインタフェース装置およびマルチモーダ
ルインタフェース方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a multimodal interface device and a multimodal interface method for interacting with a user.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータをはじめ
とする各種計算機システムにおいては、従来のキーボー
ドやマウスなどによる入力と、ディスプレイなどによる
文字や画像情報の出力に加えて、音声情報や画像情報な
どのマルチメディア情報を入出力することが可能になっ
て来ている。2. Description of the Related Art In recent years, in various computer systems such as personal computers, in addition to the conventional input using a keyboard or a mouse, and the output of characters and image information using a display, etc. It has become possible to input and output media information.

【０００３】こういった状況に加え、自然言語解析や自
然言語生成、あるいは音声認識や音声合成技術あるいは
対話処理技術の進歩などによって、利用者と音声入出力
を用いて対話する音声対話システムへの要求が高まって
おり、自由発話による音声入力を利用可能な対話システ
ムである“ＴＯＳＢＵＲＧ−ＩＩ”（電気情報通信学会
論文誌、Ｖｏｌ．Ｊ７７−Ｄ−ＩＩ、Ｎｏ．８，ｐｐ１
４１７−１４２８，１９９４）など、様々な音声対話シ
ステムの開発がなされている。[0003] In addition to these situations, natural language analysis and natural language generation, or advances in speech recognition, speech synthesis technology, and dialog processing technology have led to the development of speech dialogue systems that interact with users using voice input and output. "TOSBURG-II" is a dialogue system that is capable of utilizing voice input by free utterance, with increasing demand (Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J77-D-II, No. 8, pp1).
417-1428, 1994).

【０００４】また、さらに、こう言った音声入出力に加
え、例えばカメラを使った視覚情報入力を利用したり、
あるいは、タッチパネル、ペン、タブレット、データグ
ローブ、フットスイッチ、対人センサ、ヘッドマウンド
ディスプレイ、フォースディスプレイ（提力装置）な
ど、様々な入出力デバイスを通じて利用者と授受できる
情報を利用して、利用者とインタラクションを行なうマ
ルチモーダル対話システムへの要求が高まっている。Further, in addition to the voice input / output described above, for example, visual information input using a camera is used,
Alternatively, using information that can be exchanged with the user through various input / output devices, such as touch panels, pens, tablets, data gloves, foot switches, interpersonal sensors, head mounted displays, force displays, etc. There is an increasing demand for a multi-modal interaction system for performing interactions.

【０００５】このマルチモーダルインタフェースは、人
間同士の対話においても、例えば音声など一つのメディ
ア（チャネル）のみを用いてコミュニケーションを行な
っている訳ではなく、身振りや手ぶりあるいは表情とい
った様々なメディアを通じて授受される非言語メッセー
ジを駆使して対話することによって、自然で円滑なイン
タラクションを行っている（“Ｉｎｔｅｌｌｉｇｅｎｔ
ＭｕｌｔｉｍｅｄｉａＩｎｔｅｒｆａｃｅｓ”，Ｍ
ａｙｂｕｒｙＭ．Ｔ，Ｅｄｓ．，ＴｈｅＡＡＡＩ
Ｐｒｅｓｓ／ＴｈｅＭＩＴＰｒｅｓｓ，１９９３）
ことから考えても、自然で使いやすいヒューマンインタ
フェースを実現するための一つの有力な方法として期待
が高まっている。This multi-modal interface does not use only one medium (channel) such as voice for communication in human-to-human conversation, but exchanges information through various media such as gestures, hand gestures, and facial expressions. By interacting with non-verbal messages that are used, natural and smooth interaction is performed (“Intelligent”).
Multimedia Interfaces ", M
aybury M. T, Eds. , The AAAI
Press / The MIT Press, 1993)
Considering this, expectations are increasing as one of the leading methods for realizing a natural and easy-to-use human interface.

【０００６】従来、たとえば利用者から音声入力がなさ
れた場合には、入力された音声波形信号を例えばアナロ
グ／デジタル変換し、単位時間当たりのパワー計算を行
なうことなどによって、音声区間を検出し、例えばＦＦ
Ｔ（高速フーリエ変換）などの方法によって分析し、例
えば、ＨＭＭ（隠れマルコフモデル）などの方法を用い
て、あらかじめ用意した標準パターンである音声認識辞
書と照合処理を行なうことなどによって、発声内容を推
定し、その結果に応じた処理を行なう。Conventionally, for example, when a user inputs a voice, a voice section is detected by, for example, converting the input voice waveform signal from analog to digital and calculating power per unit time. For example, FF
The utterance content is analyzed by a method such as T (Fast Fourier Transform) and, for example, by using a method such as HMM (Hidden Markov Model) and collating with a speech recognition dictionary which is a standard pattern prepared in advance. Estimate and perform processing according to the result.

【０００７】あるいは、例えばタッチセンサなどの接触
式の入力装置を通じて、利用者からの指し示しジェスチ
ャの入力がなされた場合には、タッチセンサの出力情報
である、座標情報、あるいはその時系列情報、あるいは
入力圧力情報、あるいは入力時間間隔などを用いて、指
し示し先を同定する処理を行なう。[0007] Alternatively, when a pointing gesture is input from a user through a contact-type input device such as a touch sensor, for example, coordinate information or time-series information or output information of the touch sensor is input. A process for identifying the pointed-to target is performed using the pressure information or the input time interval.

【０００８】あるいは、例えば、“Ｕｎｃａｌｉｂｒａ
ｔｅｄＳｔｅｒｅｏＶｉｓｉｏｎｗｉｔｈＰｏ
ｉｎｔｉｎｇｆｏｒａＭａｎ−Ｍａｃｈｉｎｅ
Ｉｎｔｅｒｆａｃｅ”（Ｒ．Ｃｉｐｏｌｌａ，ｅｔ．
ａｌ．，ＰｒｏｃｅｅｄｉｎｇｓｏｆＭＶＡ’９
４，ＩＡＰＲＷｏｒｋｓｈｏｐｏｎＭａｃｈｉ
ｎｅＶｉｓｉｏｎＡｐｐｌｉｃａｔｉｏｎ，ｐ
ｐ．１６３−１６６，１９９４．）などに示された方法
を用いて、単数あるいは複数のカメラを用いて、利用者
の手などを撮影し、観察された、形状、あるいは動作な
どを解析することによって、利用者の指し示した、実世
界中の指示対象、あるいは表示画面上の指示対象などを
入力することが出来るようにしている。Alternatively, for example, “Uncalibra
ted Stereo Vision With Po
inting for a Man-Machine
Interface "(R. Cipolla, et.
al. , Proceedings of MVA'9
4, IAPR Works on Machi
ne Vision Application, p
p. 163-166, 1994. ), Using one or more cameras, photographing the user's hand, etc., and analyzing the observed, shape, or movement, etc., The user can input an instruction target in the real world or an instruction target on a display screen.

【０００９】また、同様に、例えば赤外線などを用いた
距離センサなどを用いて、利用者の手の、位置、形、あ
るいは動きなどを認識することで、利用者の指し示し
た、実世界中の指示対象、あるいは表示画面上の指示対
象などへの指し示しジェスチャを入力することが出来る
ようにしている。Similarly, by recognizing the position, shape, or movement of the user's hand using, for example, a distance sensor using infrared rays or the like, the user's hand points to the real world. It is possible to input a gesture for pointing to the pointing target or the pointing target on the display screen.

【００１０】あるいは、利用者の手に、例えば磁気セン
サや加速度センサなどを装着することによって、手の空
間的位置や、動き、あるいは形状を入力したり、仮想現
実（ＶＲ＝ＶｉｒｔｕａｌＲｅａｌｉｔｙ）技術のた
めに開発された、データグローブやデータスーツを利用
者が装着することで、利用者の手や体の、動き、位置、
あるいは形状を解析することなどによって、利用者の指
し示した、実世界中の指示対象、あるいは表示画面上の
指示対象などを入力することが出来るようにしている。[0010] Alternatively, by attaching a magnetic sensor, an acceleration sensor, or the like to the user's hand, for example, the spatial position, movement, or shape of the hand can be input, and virtual reality (VR) technology can be used. By wearing data gloves and data suits developed for the user, the movement, position,
Alternatively, by analyzing the shape or the like, it is possible to input a pointing target in the real world or a pointing target on a display screen, which the user has pointed out.

【００１１】ところで、利用者からの入力に対応して利
用者への適切な出力を行なったり、あるいは利用者から
の入力と利用者への出力のタイミングを適切に制御した
り、あるいは、利用者からの入力の認識に失敗したりあ
るいは利用者への情報の出力に失敗をした場合など、利
用者との間のコミュニケーションに関する何らかの障害
が発生した場合などには、その障害の発生を検知し、か
つその障害を解決するための、例えば確認のための情報
の再提示や、あるいは利用者への問い返し質問対話や、
あるいは対話の論議の流れを適切に管理するための対話
管理処理が必要となる。By the way, appropriate output to the user is performed in response to the input from the user, or the timing of the input from the user and the output to the user is appropriately controlled, or If there is any failure in communication with the user, such as failure to recognize input from or failure to output information to the user, the occurrence of the failure is detected, In order to solve the obstacle, for example, re-presentation of information for confirmation, or question-and-answer questions to the user,
Alternatively, a dialogue management process for appropriately managing the flow of the dialogue discussion is required.

【００１２】従来、こういった対話管理処理には、あら
かじめ用意した対話の流れであるスクリプトを利用した
方法や、あるいは例えば質問／回答、挨拶／挨拶といっ
た互いに対となる発話の組である発話対や発話交換構造
といった情報を利用した方法や、あるいは、対話の流れ
全体を対話の参加者の各個人の計画（プラン）あるいは
参加者間の共同の計画（プラン）として形式化し記述、
生成あるいは認識するプランニングによる方法などが用
いられている。Conventionally, such dialogue management processing uses a method using a script, which is a flow of a dialogue prepared in advance, or an utterance pair which is a pair of utterances which are paired with each other, for example, question / answer, greeting / greeting. And a method using information such as the utterance exchange structure, or the whole dialogue flow is formalized and described as individual plans (plans) of the participants of the dialogue or joint plans (plans) between the participants,
A method based on planning, such as generation or recognition, is used.

【００１３】[0013]

【発明が解決しようとする課題】しかし、従来、それぞ
れのメディアからの入力の解析精度の低さや、それぞれ
の入出力メディアの性質が明らかとなっていないため、
新たに利用可能となった各入出力メディアあるいは、複
数の入出力メディアを効率的に利用し、高能率で、効果
的で、利用者の負担を軽減する、マルチモーダルインタ
フェースは実現されていないという問題がある。具体的
には、次の通りである。However, since the analysis accuracy of the input from each medium and the characteristics of each input / output medium have not been clarified,
No multimodal interface has been realized that efficiently uses each newly available input / output media or multiple input / output media, is efficient, effective, and reduces the burden on users. There's a problem. Specifically, it is as follows.

【００１４】つまり、各メディアからの入力の解析精度
が不十分であるため、たとえば、音声入力における周囲
雑音などに起因する誤認識の発生や、あるいはジェスチ
ャ入力の認識処理において、入力デバイスから刻々得ら
れる信号のなかから、利用者が入力メッセージとして意
図した信号部分の切りだしに失敗することなどによっ
て、誤動作が起こり、利用者への負担となっているとい
う問題がある。That is, since the analysis accuracy of the input from each medium is insufficient, for example, in the occurrence of erroneous recognition due to the ambient noise in the voice input, or in the recognition processing of the gesture input, it is obtained from the input device every moment. There is a problem in that, for example, when a user fails to cut out a signal portion intended as an input message from among the signals to be input, a malfunction occurs, and this burdens the user.

【００１５】また、音声入力やジェスチャ入力など、利
用者が現在の操作対象である計算機などへの入力として
用いるだけでなく、例えば周囲の他の人間へ話しかけた
りする場合にも利用されるメディアを用いたインタフェ
ース装置では、利用者が、インタフェース装置ではな
く、たとえば自分の横にいる他人に対して話しかけた
り、ジェスチャを示したりした場合にも、インタフェー
ス装置が自分への入力であると誤って判断をして、認識
処理などを行なって、誤動作を起こり、その誤動作の取
消や、誤動作の影響の復旧や、誤動作を避けるために利
用者が絶えず注意を払わなくてはいけなくなるなどの負
荷を含め、利用者への負担となっているという問題があ
る。Media such as voice input and gesture input that are used not only by a user as input to a computer or the like which is the current operation target but also when talking to other people around the user are used. In the interface device used, even when the user speaks or shows a gesture to another person besides the interface device, for example, the user incorrectly determines that the interface device is an input to the user. And perform recognition processing to cause malfunctions, including the load of canceling the malfunction, restoring the effects of the malfunction, and requiring the user to constantly pay attention to avoid the malfunction. However, there is a problem that the burden is imposed on the user.

【００１６】また、本来不要な場面においても、入力信
号の処理が継続的にして行われるため、その処理負荷に
よって、利用している装置に関与する他のサービスの実
行速度や利用効率が低下するという問題がある。In addition, even when the input signal is originally unnecessary, the processing of the input signal is continuously performed. Therefore, the processing load reduces the execution speed and the use efficiency of other services related to the apparatus being used. There is a problem.

【００１７】また、この問題を解決するために、音声や
ジェスチャなどの入力を行なう際に、たとえば、ボタン
を押したり、メニュー選択などによって、特別な操作に
よってモードを変更するなどという方法が用いられてい
るが、このような特別な操作は、人間同士の会話では不
要な操作であるために不自然なインタフェースとなるだ
けでなく、利用者にとって繁雑であったり、操作の種類
によっては、習得のための訓練が必要となったりするこ
とによって、利用者の負担を増加するという問題があ
る。In order to solve this problem, a method of changing the mode by a special operation, for example, by pressing a button or selecting a menu when inputting a voice or a gesture is used. However, such a special operation is not only an unnecessary operation in a conversation between humans, so it not only results in an unnatural interface, but is also complicated for the user, and depending on the type of operation, learning may be difficult. The problem is that there is a problem that the burden on the user increases due to the necessity of training for the purpose.

【００１８】また、例えば、音声入力の可否をボタン操
作によって切替える場合などでは、音声メディアによる
入力は、本来、口だけを使ってコミュニケーションが出
来るため、例えば手で行っている作業を妨害することが
なく、双方を同時に利用することが可能であると言う、
音声メディア本来の利点を活かすことが出来ないという
問題がある。In addition, for example, in the case where the input of voice input is switched by a button operation, the input by the voice media can be performed by using only the mouth, so that the work performed by hand may be disturbed. No, it is possible to use both at the same time,
There is a problem that the original advantages of the audio media cannot be utilized.

【００１９】また、従来、指し示しジェスチャの入力に
於いて、例えばタッチセンサを用いて実現されたインタ
フェース方法では、離れた位置からや、機器に接触せず
に、指し示しジェスチャを行なうことが出来ないという
問題がある。Conventionally, in inputting a pointing gesture, an interface method implemented using, for example, a touch sensor does not allow a pointing gesture to be performed from a remote position or without touching a device. There's a problem.

【００２０】さらに、例えばデータグローブや、磁気セ
ンサや、加速度センサなどを利用者が装着することで実
現されたインタフェース方法では、機器を装着しなけれ
ば利用できないという問題点がある。Furthermore, the interface method realized by the user wearing a data glove, a magnetic sensor, an acceleration sensor, or the like, for example, has a problem in that the method cannot be used unless a device is worn.

【００２１】一方、カメラなどを用いて、利用者の手な
どの形状、位置、あるいは動きを検出することで実現さ
れているインタフェース方法では、十分な精度が得られ
ないために、利用者が入力を意図したジェスチャだけ
を、適切に抽出することが困難であり、結果として、利
用者がジェスチャとしての入力を意図していない手の動
きや、形やなどを、誤ってジェスチャ入力であると誤認
識してしまったり、あるいは利用者が入力を意図したジ
ェスチャを、ジェスチャ入力であると正しく抽出するこ
とが出来ない場合が多発し、結果として、例えば誤認識
のために引き起こされる誤動作の影響の訂正が必要にな
ったり、あるいは利用者が入力を意図して行なったジェ
スチャ入力が実際にはシステムに正しく入力されず、利
用者が再度入力を行なう必要が生じ、利用者の負担を増
加させてしまうという問題がある。On the other hand, the interface method implemented by detecting the shape, position, or movement of the user's hand or the like using a camera or the like does not provide sufficient accuracy. It is difficult to properly extract only gestures intended for gestures, and as a result, a user's hand movements, shapes, and the like that are not intended to be input as gestures are mistaken for gesture input. In many cases, gestures intended to be recognized or input by the user cannot be correctly extracted as a gesture input, and as a result, for example, correction of the effects of malfunctions caused by erroneous recognition. Is required, or a gesture input intentionally performed by the user is not correctly input to the system, and the user performs input again. Cormorants need arises, there is a problem that increases the burden on the user.

【００２２】また、従来のマルチモーダルインタフェー
スでは、人間同士のコミュニケーションにおいては重要
な役割を演じていると言われる、視線一致（アイコンタ
クト）、注視位置、身振り、手振りなどのジェスチャ、
顔表情など非言語メッセージを、効果的に利用すること
が出来ないという問題がある。In the conventional multimodal interface, gestures such as eye-gaze matching (eye contact), gaze position, gesture, and hand gesture, which are said to play an important role in communication between humans,
There is a problem that non-verbal messages such as facial expressions cannot be used effectively.

【００２３】また、利用者からの入力に対応して利用者
への適切な出力を行なったり、あるいは利用者からの入
力と利用者への出力のタイミングを適切に制御するため
には、利用者の発話が開始されるタイミングや、あるい
は利用者の発話が終了するタイミングなどを、事前に予
測する必要があるが、スクリプトを利用した方法や、あ
るいは発話対や発話交換構造といった情報を利用した方
法や、プランニングによる方法などを用いた従来の対話
管理処理だけではそれを行なうことが困難であるという
問題がある。In order to perform appropriate output to the user in response to the input from the user, or to appropriately control the timing of the input from the user and the output to the user, It is necessary to predict in advance the timing at which the utterance starts or the timing at which the user's utterance ends, but a method using a script, or a method using information such as utterance pairs or utterance exchange structures In addition, there is a problem that it is difficult to perform such a process only by a conventional dialog management process using a method based on planning or the like.

【００２４】また、利用者からの入力の認識に失敗した
り、あるいは利用者への情報の出力に失敗をした場合な
ど、利用者との間のコミュニケーションに関する何らか
の障害が発生した場合などには、その障害の発生を検知
する必要があるが、スクリプトを利用した方法や、ある
いは発話対や発話交換構造といった情報を利用した方法
や、プランニングによる方法などを用いた従来の対話管
理処理だけではそれを行なうことが困難であるという問
題がある。Further, in the case of failure in recognizing the input from the user or failure in outputting the information to the user, for example, in the case where some trouble relating to the communication with the user occurs, It is necessary to detect the occurrence of the failure, but it is only possible to use a script-based method, a method using information such as utterance pairs and utterance exchange structures, or a conventional dialog management process using a method based on planning. There is a problem that it is difficult to perform.

【００２５】また、検知した障害を解決するための、例
えば確認のための情報の再提示や、あるいは利用者への
問い返し質問対話や、あるいは対話の論議の流れを適切
に管理するための対話管理処理が必要であるが、スクリ
プトを利用した方法や、あるいは発話対や発話交換構造
といった情報を利用した方法や、プランニングによる方
法などを用いた従来の対話管理処理だけではそれを行な
うことが困難であるという問題がある。[0025] In order to solve the detected fault, for example, re-presentation of information for confirmation, or question-to-user question dialogue, or dialogue management for appropriately managing the flow of discussion of the dialogue. Processing is necessary, but it is difficult to do it only with the conventional dialog management processing using the method using scripts, the method using information such as utterance pairs and utterance exchange structure, and the method based on planning. There is a problem that there is.

【００２６】本発明はこのような事情を考慮してなされ
たもので、非言語メッセージを用いて利用者との対話の
ためのインタフェース動作を制御できるようにすること
により、新たに利用可能となった各入出力メディアある
いは、複数の入出力メディアを効率的に利用し、高能率
で、効果的で、利用者の負担を軽減することが出来るマ
ルチモーダルインタフェース装置およびマルチモーダル
インタフェース方法を提供することを目的とする。The present invention has been made in view of such circumstances, and has been newly made available by controlling an interface operation for dialogue with a user using a non-verbal message. To provide a multi-modal interface device and a multi-modal interface method capable of efficiently using each input / output medium or a plurality of input / output media, and being highly efficient, effective, and capable of reducing the burden on the user. With the goal.

【００２７】また、本発明の具体的な目的の一つは、各
メディアからの入力の解析精度が不十分さに起因する誤
認識や、利用者が入力メッセージとして意図した信号部
分の切りだしの失敗に起因する誤動作を起こさず、利用
者への余分な負担を生じないマルチモーダルインタフェ
ース装置およびマルチモーダルインタフェース方法を提
供することである。Further, one of the specific objects of the present invention is to prevent erroneous recognition due to insufficient analysis accuracy of an input from each medium and to cut out a signal portion intended by a user as an input message. An object of the present invention is to provide a multi-modal interface device and a multi-modal interface method that do not cause a malfunction due to a failure and do not generate an extra burden on a user.

【００２８】また、他の具体的な目的は、音声入力やジ
ェスチャ入力など、利用者が現在の操作対象である計算
機などへの入力として用いるだけでなく、例えば周囲の
他の人間へ話しかけたりする場合にも利用されるメディ
アを用いたインタフェース装置では、利用者が、インタ
フェース装置ではなく、たとえば自分の横にいる他人に
対して話しかけたり、ジェスチャを示したりした場合
に、インタフェース装置が自分への入力であると誤って
判断することがないマルチモーダルインタフェース装置
およびマルチモーダルインタフェース方法を提供するこ
とである。Another specific purpose is not only to be used as an input to a computer or the like as a current operation target by a user, such as voice input or gesture input, but also to talk to other people around the user. In an interface device using a medium that is also used in some cases, when a user speaks or shows a gesture to another person beside the user, instead of the interface device, the interface device An object of the present invention is to provide a multi-modal interface device and a multi-modal interface method that do not erroneously determine an input.

【００２９】また、別の具体的な目的は、上述のような
計算機への入力を利用者が意図していないメッセージを
誤って自己への入力であると誤認識したことによる誤動
作や、その影響の復旧や、誤動作を避けるために利用者
が絶えず注意を払わなくてはいけなくなるなどの負荷を
含めた利用者への負担を生じないマルチモーダルインタ
フェース装置およびマルチモーダルインタフェース方法
を提供することである。Another specific purpose is that a malfunction caused by erroneously recognizing a message that the user does not intend to input to the computer as an input to the user, and the effect of the malfunction. To provide a multi-modal interface device and a multi-modal interface method that do not cause a burden on a user including a load such as a user having to constantly pay attention to avoid a malfunction or a malfunction. .

【００３０】また、さらにもう一つの具体的な目的は、
本来不要な場面においても、入力信号の処理が継続的に
して行われるため、その処理負荷によって、利用してい
る装置に関与する他のサービスの実行速度や利用効率が
低下してしまうことのないマルチモーダルインタフェー
ス装置およびマルチモーダルインタフェース方法を提供
することである。Further, another specific purpose is as follows.
Even in a scene that is originally unnecessary, the processing of the input signal is performed continuously, so that the processing load does not reduce the execution speed and the use efficiency of other services related to the device being used. A multi-modal interface device and a multi-modal interface method are provided.

【００３１】また、さらにもう一つの具体的な目的は、
音声やジェスチャなどの入力を行なう際に、たとえば、
ボタンを押したり、メニュー選択などといった特別な操
作によるモード変更が必要なく、自然で、利用者にとっ
て繁雑でなく、習得のための訓練が不要であり、利用者
の負担を増加しないマルチモーダルインタフェース装置
およびマルチモーダルインタフェース方法を提供するこ
とである。Still another specific purpose is as follows.
When inputting voices and gestures, for example,
A multi-modal interface device that does not require a mode change by a special operation such as pressing a button or selecting a menu, is natural, is not complicated for the user, does not require training for learning, and does not increase the burden on the user. And a multimodal interface method.

【００３２】また、さらにもう一つの具体的な目的は、
例えば、口だけを使ってコミュニケーションが出来、例
えば手で行なっている作業を妨害することがなく、双方
を同時に利用することが可能であると言う、音声メディ
ア本来の利点を活かすことが出来るマルチモーダルイン
タフェース装置およびマルチモーダルインタフェース方
法を提供することである。Further, another specific purpose is as follows.
For example, a multi-modal that can take advantage of the inherent advantages of audio media, such as being able to communicate using only the mouth and not disturbing the work done by hand, for example, and being able to use both simultaneously. An interface device and a multi-modal interface method are provided.

【００３３】また、さらにもう一つの具体的な目的は、
離れた位置からや、機器に接触せずに、ジェスチャの入
力を行なう際に、利用者が入力を意図したジェスチャだ
けを、適切に抽出できるマルチモーダルインタフェース
装置およびマルチモーダルインタフェース方法を提供す
ることである。Still another specific purpose is as follows.
By providing a multi-modal interface device and a multi-modal interface method capable of appropriately extracting only a gesture intended by a user when performing a gesture input from a remote position or without touching a device. is there.

【００３４】また、さらにもう一つの具体的な目的は、
人間同士のコミュニケーションにおいては重要な役割を
演じていると言われる、視線一致（アイコンタクト）、
注視位置、身振り、手振りなどのジェスチャ、顔表情な
ど非言語メッセージを、効果的に利用することが出来る
マルチモーダルインタフェース装置およびマルチモーダ
ルインタフェース方法を提供することである。Still another specific purpose is as follows.
Eye contact, which is said to play an important role in human-to-human communication,
An object of the present invention is to provide a multimodal interface device and a multimodal interface method that can effectively use non-verbal messages such as a gaze position, a gesture such as a gesture, a hand gesture, and a facial expression.

【００３５】また、さらにもう一つの具体的な目的は、
利用者からの入力に対応して利用者への適切な出力を行
なったり、あるいは利用者からの入力と利用者への出力
のタイミングを適切に制御するために、利用者の発話が
開始されるタイミングや、あるいは利用者の発話が終了
するタイミングなどを、事前に予測することの出来るマ
ルチモーダルインタフェース装置およびマルチモーダル
インタフェース方法を提供することである。Still another specific purpose is as follows.
User's utterance is started in order to perform appropriate output to the user in response to the input from the user, or to appropriately control the timing of the input from the user and the output to the user. An object of the present invention is to provide a multi-modal interface device and a multi-modal interface method capable of predicting in advance the timing or the timing at which a user's utterance ends.

【００３６】また、さらにもう一つの具体的な目的は、
利用者からの入力の認識に失敗したり、あるいは利用者
への情報の出力に失敗をした場合など、利用者との間の
コミュニケーションに関する何らかの障害が発生した場
合などには、その障害の発生を適切に検知することの出
来るマルチモーダルインタフェース装置およびマルチモ
ーダルインタフェース方法を提供することである。Still another specific purpose is as follows.
If there is any failure in communication with the user, such as when input recognition from the user fails or information output to the user fails, the occurrence of the failure is determined. An object of the present invention is to provide a multi-modal interface device and a multi-modal interface method that can appropriately detect a signal.

【００３７】また、さらにもう一つの具体的な目的は、
検知した障害を解決するための、例えば確認のための情
報の再提示や、あるいは利用者への問い返し質問対話
や、あるいは対話の論議の流れの適切な管理を行なうこ
との出来るマルチモーダルインタフェース装置およびマ
ルチモーダルインタフェース方法を提供することであ
る。Still another specific purpose is as follows.
A multi-modal interface device capable of re-presenting information for confirmation, resolving a detected fault, for example, or performing a question-and-request question dialogue to a user, or appropriately managing a flow of a discussion of a dialogue; and The object is to provide a multimodal interface method.

【００３８】[0038]

【課題を解決するための手段】本発明は、利用者からの
情報の入力を受けつける入力手段、および利用者への情
報の出力を行なう出力手段、および利用者との対話を管
理する対話管理手段の内、少なくとも一つを利用者イン
タフェースとして有するマルチモーダルインタフェース
装置において、利用者の表情、発声、注視、ジェスチ
ャ、姿勢、あるいは身体動作の少なくとも一つからなる
非言語メッセージを認識し非言語メッセージ情報として
出力する非言語メッセージ認識手段と、前記非言語メッ
セージ情報に基づいて、前記利用者との間のインタフェ
ースのために行われる前記入力手段あるいは前記出力手
段あるいは前記対話管理手段の少なくとも一つの動作を
制御する制御手段とを具備したことを特徴とする。SUMMARY OF THE INVENTION The present invention provides input means for receiving input of information from a user, output means for outputting information to a user, and dialog management means for managing a dialog with the user. A multi-modal interface device having at least one as a user interface, the non-verbal message information comprising at least one of a user's facial expression, utterance, gaze, gesture, posture, and physical movement, and Non-verbal message recognizing means to output as at least one operation of the input means or the output means or the dialogue managing means performed for an interface with the user based on the non-verbal message information. And control means for controlling.

【００３９】このマルチモーダルインタフェース装置に
おいては、人間同士のコミュニケーションとして重要な
利用者の表情、発声、注視、ジェスチャ、姿勢、あるい
は身体動作の少なくとも一つからなる非言語メッセージ
が認識され、その結果に応じて、例えば音声情報やジェ
スチャ情報などを用いて行われる利用者との間のインタ
フェース動作が制御される。具体的には、認識した非言
語メッセージに応じて、音声入力の開始／継続／中断の
制御、あるいは情報の再提示、あるいは確認のための対
話の起動、あるいはジェスチャ入力の認識候補の信憑性
の判断、あるいは対話タイミングを調整等が行われる。
これにより、特別な操作無しで、利用者との間でスムー
ズで効率の良い対話処理を実現できるようになり、新た
に利用可能となった各入出力メディアあるいは、複数の
入出力メディアを効率的に利用でき、高能率で、効果的
で、利用者の負担を軽減することが出来る対話システム
を実現できる。In this multi-modal interface device, a non-verbal message composed of at least one of a user's facial expression, utterance, gaze, gesture, posture, and body movement, which is important for human-to-human communication, is recognized. Accordingly, the interface operation with the user performed using, for example, voice information or gesture information is controlled. Specifically, according to the recognized non-verbal message, control of start / continuation / interruption of voice input, re-presentation of information, activation of a dialog for confirmation, or credibility of a gesture input recognition candidate Judgment or adjustment of the conversation timing is performed.
This allows smooth and efficient interaction with the user to be performed without any special operation, and allows the newly available input / output media or multiple input / output media to be used efficiently. It is possible to realize an interactive system that can be used efficiently, efficiently, effectively, and can reduce the burden on the user.

【００４０】前記非言語メッセージ認識手段は、利用者
からの音声入力の取り込み、あるいは利用者の動作ある
いは表情の観察、あるいは利用者の目の動きの検出、あ
るいは利用者の頭部の動きの検知、あるいは利用者の手
や足など体の一部あるいは全体の動き若しくは姿勢の検
知、あるいは利用者の動作の取り込み、あるいは利用者
の接近、離脱、着席の検知の内、少なくとも一つの処理
によって、利用者の表情、発声、注視、ジェスチャ、姿
勢、あるいは身体動作の少なくとも一つからなる非言語
メッセージを認識するものである。これは、例えば、利
用者からの音声入力を取り込むマイク、あるいは利用者
の動作や表情などを観察するカメラ、あるいは利用者の
目の動きを検出するアイトラッカ、あるいは利用者の頭
部の動きを検知するヘッドトラッカ、あるいは利用者の
手や足など体の一部あるや全体の動きや姿勢を検知する
身体センサ、あるいは利用者が装着してその動作などを
取り込むデータグローブ、あるいはデータスーツ、ある
いは利用者の接近、離脱、着席などを検知する対人セン
サなどの内、少なくとも一つを用いて実現できる。The non-verbal message recognition means captures a voice input from a user, observes a user's motion or facial expression, detects a user's eye movement, or detects a user's head movement. , Or by detecting at least one of the movement or posture of a part or the whole body such as a user's hand or foot, or capturing the movement of the user, or detecting the approach, departure, or sitting of the user, It recognizes a non-verbal message consisting of at least one of a user's facial expression, utterance, gaze, gesture, posture, and body movement. This includes, for example, a microphone that captures voice input from the user, a camera that observes the user's movements and facial expressions, an eye tracker that detects the movement of the user's eyes, or a movement that detects the movement of the user's head. Head tracker, or a body sensor that detects the movement or posture of a part of or the entire body such as the user's hand or foot, or a data glove or data suit that is worn by the user and captures its movement It can be realized by using at least one of an interpersonal sensor for detecting approach, departure, and seating of a person.

【００４１】また、本発明のマルチモーダルインタフェ
ース装置は、利用者の様子を撮影するカメラ等から得ら
れる画像情報への処理、あるいは利用者の身体に装着あ
るいは近接して設置したセンサ情報の処理など少なくと
も一つの方法を用いることにより、利用者の呼吸の状況
を観察し呼吸状況情報として出力する呼吸状況認識手段
と、入力手段の一つとして、利用者の発する音声の、取
り込み、あるいは録音、あるいは加工、あるいは分析、
あるいは認識の少なくとも一つの処理をなう入力音声処
理手段と、該呼吸状況情報に基づいて、該入力音声処理
手段を制御して、利用者からの音声入力信号の受け付け
可否制御、あるいは音声区間の推定処理、あるいは雑音
低減処理、あるいは音声信号変換処理の少なくとも一つ
の処理の動作を制御する制御手段とを具備したことを特
徴とするものである。Further, the multimodal interface device of the present invention can process image information obtained from a camera or the like for photographing the state of a user, or process sensor information attached to or placed close to the user's body. By using at least one method, the respiratory condition recognition means for observing the user's respiratory condition and outputting it as respiratory condition information, and as one of the input means, the voice emitted by the user, capturing, or recording, or Processing or analysis,
Or, input voice processing means for performing at least one process of recognition, and based on the breathing status information, controlling the input voice processing means to control whether or not to receive a voice input signal from a user, or And control means for controlling the operation of at least one of the estimation processing, the noise reduction processing, and the audio signal conversion processing.

【００４２】このように利用者から認識した呼吸状況情
報に基づいて入力音声処理手段の動作を制御することに
より、音声入力の解析精度が不十分さに起因する誤認識
や、利用者が入力音声として意図した信号部分の切りだ
しの失敗に起因する誤動作を起こさず、利用者への余分
な負担を生じないマルチモーダルインタフェース装置を
提供すること等が可能となる。As described above, by controlling the operation of the input voice processing means based on the respiratory situation information recognized by the user, erroneous recognition due to insufficient analysis accuracy of the voice input and the user's As a result, it is possible to provide a multi-modal interface device that does not cause a malfunction due to a failure in extracting a signal portion intended and does not cause an extra burden on a user.

【００４３】また、本発明のマルチモーダルインタフェ
ース装置は、利用者の様子を撮影するカメラ等から得ら
れる画像情報への処理、あるいは利用者の身体に装着あ
るいは近接して配置したセンサ情報の処理など少なくと
も一つの方法で、利用者の呼吸の状況を観察し呼吸状況
情報として出力する呼吸状況認識手段と、利用者の入力
と本装置からの出力のタイミングを管理する対話管理手
段と、該呼吸状況情報に基づいて、該対話管理手段を制
御して、利用者からの入力信号の受け付けタイミング、
および本装置からの利用者への出力信号の出力タイミン
グの少なくとも一つを調整する制御手段とを具備したこ
とを特徴とするものである。Further, the multimodal interface device of the present invention can process image information obtained from a camera or the like for photographing the state of a user, or process sensor information attached to or placed close to the body of the user. At least one method for observing a user's respiratory condition and outputting the information as respiratory condition information; a dialogue managing means for managing user input and output timing from the apparatus; Based on the information, the dialogue control means is controlled to receive an input signal from a user,
And control means for adjusting at least one of the output timings of the output signal from the present apparatus to the user.

【００４４】また、本発明のマルチモーダルインタフェ
ース装置は、カメラなどによって取り込んだ利用者の画
像情報を解析処理、あるいは利用者の頭部あるいは眼
部、あるいは身体に装着あるいは近接して設置したセン
サ情報の解析によって、利用者の利用者の視線方向を検
出し、注視対象情報として出力する視線検出手段と、カ
メラなどによって得られる利用者の画像情報の処理、あ
るいは赤外線等の遠隔センサ、装着センサなどによって
得られる信号の処理などによって、利用者の手など体の
部分あるいは体の全体の動作を解析し利用者からのジェ
スチャ入力を認識するジェスチャ認識手段と、該視線検
出情報に基づいて、ジェスチャ入力の受け付け可否、あ
るいはジェスチャ入力の検出あるいは認識に用いられる
パラメータ情報の調整などによって、該ジェスチャ認識
手段の動作を制御する制御手段とを具備したことを特徴
とするものである。Further, the multimodal interface device of the present invention analyzes and processes the user's image information captured by a camera or the like, or obtains sensor information attached to or installed close to the user's head or eyes or body. Gaze detection means that detects the gaze direction of the user by the analysis of the user and outputs it as gaze target information, processing of the user's image information obtained by a camera or the like, or a remote sensor such as infrared rays, a wearing sensor, etc. Gesture recognition means for analyzing the operation of a part of the body such as a user's hand or the whole body by recognizing a gesture input from the user by processing a signal obtained by the user, and a gesture input based on the gaze detection information. Of the parameter information used to detect or recognize gesture input The like, it is characterized in that it has a control means for controlling the operation of the gesture recognition unit.

【００４５】このように利用者の視線を観測することに
よって得られる注視対象情報を用いてジェスチャ認識手
段の動作を制御することにより、ジェスチャ入力の解析
精度が不十分であることが原因で、たとえば、ジェスチ
ャ入力の認識処理において、入力デバイスから刻々得ら
れる信号のなかから、利用者が入力メッセージとして意
図した信号部分の切りだしに失敗するという問題を回避
することが出来、その結果、誤動作などによる利用者へ
の負担を起こさないインタフェースを実現することが可
能となる。また、利用者が現在の操作対象である計算機
などへの入力として用いるだけでなく、例えば周囲の他
の人間とのコミュニケーションを行なう場合にも利用さ
れるメディアを用いたインタフェース装置においては、
利用者がインタフェース装置ではなく、たとえば自分の
横にいる他人に対してジェスチャを示したりした場合に
も、インタフェース装置が自分への入力であると誤って
判断しないインタフェース装置を実現することができ
る。As described above, by controlling the operation of the gesture recognizing means by using the gaze target information obtained by observing the user's line of sight, for example, due to insufficient analysis accuracy of the gesture input, In the gesture input recognition process, it is possible to avoid a problem that a user fails to cut out a signal portion intended as an input message from signals obtained from the input device every moment, and as a result, a malfunction may occur. It is possible to realize an interface that does not cause a burden on the user. In addition, in an interface device using a medium that is used not only by the user as input to a computer or the like that is the current operation target, but also when communicating with other surrounding people, for example,
Even if the user gives a gesture to another person besides the user instead of the interface device, for example, an interface device that does not erroneously determine that the input is an input to the user can be realized.

【００４６】また、本発明のマルチモーダルインタフェ
ース装置は、カメラなどによって取り込んだ利用者の画
像情報を解析処理、あるいは利用者の頭部あるいは腹
部、あるいは身体に装着あるいは近接して設置したセン
サ情報の解析によって、利用者の利用者の視線方向を検
出し、注視対象情報として出力する視線検出手段と、画
像ディスプレイ、あるいはスピーカ、あるいは提力装置
（フォースディスプレイ）など少なくとも一つの機器を
通じて、音声情報あるいは画像情報あるいは力情報の少
なくとも一つを利用者への出力として提示する出力手段
と、該注視対象情報に基づいて、該出力手段の動作を制
御し、例えば、該出力手段から利用者の情報提示を行な
っている途中あるいは直後に、利用者が該出力手段を注
視した際に、該出力情報を再度出力するなどするよう出
力手段を制御する制御手段とを具備したことを特徴とす
るものである。Further, the multimodal interface device of the present invention analyzes and processes the image information of the user captured by a camera or the like, or obtains the sensor information attached to or installed near the head or abdomen or the body of the user. Through the analysis, the gaze direction of the user is detected, and the gaze detection unit that outputs the gaze target information, and at least one device such as an image display, a speaker, or a force-supplying device (force display), provide audio information or Output means for presenting at least one of image information or force information as output to a user, and controlling the operation of the output means based on the gaze target information, for example, presenting information of the user from the output means When the user gazes at the output means during or immediately after performing the It is characterized in that it has a control means for controlling the output means to the like outputs the broadcast again.

【００４７】また、本発明のマルチモーダルインタフェ
ース装置は、カメラなどによって取り込んだ利用者の画
像情報を解析処理、あるいは利用者の頭部あるいは腹
部、あるいは身体に装着あるいは近接して設置したセン
サ情報の解析によって、利用者の利用者の視線方向を検
出し、注視対象情報として出力する視線検出手段と、画
像ディスプレイ、あるいはスピーカ、あるいは提力装置
（フォースディスプレイ）など少なくとも一つの機器を
通じて、音声情報あるいは画像情報あるいは力情報の少
なくとも一つを利用者への出力として提示する出力手段
と、利用者との対話を管理し、該出力手段を通じた利用
者への伝達に失敗した情報の再提示、あるいは情報伝達
が達成されたかどうかを確認するための確認の対話など
の内、少なくとも一つを行なう対話管理手段と、該注視
対象情報に基づいて、例えば、該出力手段から利用者の
情報提示を行なっている途中あるいは直後に、利用者の
該出力手段を注視した際に、情報の再提示、あるいは情
報伝達が達成されたかどうかを確認するための確認対話
あるいは、利用者からの情報入力を行なう様、該対話管
理手段を制御する制御手段とを具備したことを特徴とす
るものである。Further, the multi-modal interface device of the present invention analyzes and processes image information of a user captured by a camera or the like, or obtains information of sensor information attached to or installed close to the user's head or abdomen or body. Through the analysis, the gaze direction of the user is detected, and the gaze detection unit that outputs the gaze target information, and at least one device such as an image display, a speaker, or a force-supplying device (force display), provide audio information or Output means for presenting at least one of image information or force information as output to the user, and managing dialogue with the user, re-presentation of information that failed to be transmitted to the user through the output means, or At least one of the confirmation dialogues to confirm that communication has been achieved When the user gazes at the output means, for example, during or immediately after presenting the user information from the output means, based on the dialogue management means for performing Control means for controlling the dialog management means so as to confirm whether presentation or information transmission has been achieved or to input information from a user. .

【００４８】また、本発明のマルチモーダルインタフェ
ース装置は、カメラなどによって取り込んだ利用者の画
像情報を解析処理、あるいは利用者の頭部あるいは眼
部、あるいは身体に装着あるいは近接して設置したセン
サ情報の解析によって、利用者の利用者の視線方向を検
出し、注視対象情報として出力する視線検出手段と、利
用者との対話を管理し、利用者からシステムへの入力の
タイミング、およびシステムから利用者への出力のタイ
ミングの少なくとも一方を管理する対話管理手段と、該
注視対象情報に応じて、該対話管理部を制御し、対話中
の利用者の注視対象が特定の方向あるいは領域に存在す
る場合には、制御パラメータを調整するなどして利用者
の入力あるいは利用者の出力の開始、あるいは中断、あ
るいは終了タイミングの少なくとも一つを調整する制御
手段とを具備したことを特徴とするものである。Further, the multimodal interface device of the present invention analyzes and processes the image information of the user captured by a camera or the like, or detects the sensor information attached to or installed near the head or eye or the body of the user. Gaze detection means that detects the user's gaze direction and outputs it as gaze target information by analyzing the user, manages dialogue with the user, the timing of input from the user to the system, and usage from the system Dialog management means for managing at least one of output timings to a user, and controlling the dialog management unit in accordance with the gaze target information, so that the gaze target of the user during the dialog exists in a specific direction or area. In such a case, the user input or user output is started, interrupted, or ended by adjusting the control parameters. It is characterized in that it has a control means for adjusting at least one.

【００４９】このように利用者を観測することによって
得られる注視対象情報を用いて対話管理部を制御するこ
とにより、利用者が情報入力の待ち受け時間を延長する
ために、例えばボタンを押すなどといった恣意的な操作
を行なうことが不要となり、自然で、利用者にとって繁
雑でなく、習得のための訓練が不要であり、利用者の負
担を増加しないマルチモーダルインタフェース装置を実
現できる。By controlling the dialogue management unit using the target information obtained by observing the user as described above, the user can press a button or the like to extend the waiting time for inputting information. It is not necessary to perform an arbitrary operation, and it is possible to realize a multimodal interface device which is natural, is not complicated for the user, does not require training for learning, and does not increase the burden on the user.

【００５０】[0050]

【発明の実施の形態】（ｉ）第１の実施形態以下、図面を参照して、本発明の第１実施形態に係るマ
ルチモーダルインタフェース装置およびマルチモーダル
インタフェース方法について説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS (i) First Embodiment A multimodal interface device and a multimodal interface method according to a first embodiment of the present invention will be described below with reference to the drawings.

【００５１】図１は、本発明の第１実施形態に係るマル
チモーダルインタフェース装置の構成例であり、１０１
は呼吸検出部、１０２は音声入力部、１０３は制御部、
１０４はアプリケーションである。このマルチモーダル
インタフェース装置はコンピュータなどを用いて、音声
情報による利用者との対話を支援するためのシステムで
ある。FIG. 1 shows a configuration example of a multimodal interface device according to a first embodiment of the present invention.
Is a respiration detection unit, 102 is a voice input unit, 103 is a control unit,
104 is an application. This multi-modal interface device is a system for supporting a dialog with a user by voice information using a computer or the like.

【００５２】図１に於いて、１０１は呼吸検出部を表し
ており、例えば、「ビジュアルセンシングによる呼吸監
視システムの関心領域（ＲＯＩ）の設定の自動化」（三
宅他、第１７回医療情報学連合大会予稿、１−Ｃ−１−
３、ｐｐ．１６８−１６９、１９９７）に示された方法
などの様に、例えばカメラから得られる利用者の画像か
ら、例えば利用者の胸部を観察し呼吸に付随する動作を
検出することなどによって、利用者の呼吸の状態を検知
し、呼吸状況情報として随時出力するようにしている。
また、利用者の身体に装着あるいは近接して配置したセ
ンサからの情報を処理することによって、利用者の呼吸
の状況を観察することもできる。In FIG. 1, reference numeral 101 denotes a respiration detection unit, for example, “Automation of setting of a region of interest (ROI) of a respiration monitoring system by visual sensing” (Miyake et al., 17th Medical Informatics Alliance) Conference proceedings, 1-C-1-
3, pp. 168-169, 1997), for example, by observing a user's chest and detecting an action associated with breathing from an image of the user obtained from a camera, for example. The state of respiration is detected and output as needed as respiration status information.
Further, by processing information from a sensor mounted on or close to the user's body, the user's respiratory condition can be observed.

【００５３】図２は、呼吸検出部１０１が出力する呼吸
状況情報の例を表している。FIG. 2 shows an example of the respiratory condition information output by the respiratory detecting unit 101.

【００５４】図２に於いて、ＩＤの欄は各呼吸状況情報
の識別記号を表しており、時間情報Ａは対応する呼吸の
状況が観察された時刻が記録されており、また状況情報
Ｂには観察された呼吸の状況を表す記号が記録されるよ
うにしている。In FIG. 2, the column of ID represents an identification symbol of each respiratory condition information, the time information A records the time at which the corresponding respiratory condition was observed, and the status information B includes Has recorded a symbol representing the observed respiratory situation.

【００５５】各呼吸状況情報の状況情報Ｂの欄に於い
て、「定常呼吸（吸気）」および「定常呼吸（排気）」
は、利用者が定常状態で、それぞれ吸気および排気を行
なっていることが観察されたことを表している。In the column of status information B of each respiratory status information, "steady breathing (inspiration)" and "steady breathing (exhausting)"
Indicates that the user is performing intake and exhaust in a steady state, respectively.

【００５６】また、「非定常呼吸（吸気）」および「非
定常呼吸（排気）」は、利用者が、例えば深呼吸や息継
ぎなど非定常状態で、それぞれ吸気および排気を行なっ
ていることが観察されたことを表している。The "unsteady breathing (inspiration)" and "unsteady breathing (exhaust)" indicate that the user is inhaling and exhausting in an unsteady state, for example, in a deep breathing or breathing state. It represents that.

【００５７】また、図１に於いて、１０２は音声入力部
を表しており、例えばマイクなどによって利用者の発し
た音声信号を電気信号に変換するなどして本装置への入
力信号として取り込んだり、あるいはさらに例えばＡ／
Ｄ（アナログディジタル）変換を施すことによって本装
置で処理可能な表現への変換を行なったり、あるいはさ
らに、例えばＦＦＴ（高速フーリエ変換）などを用いて
分析処理や加工処理を行なったり、あるいはさらに例え
ば複合類似度法やＨＭＭ（隠れマルコフモデル）やＤＰ
（ダイナミックプログラミング）やニューラルネットワ
ークなどといった方法を用いてあらかじめ用意した標準
パターンと入力信号との間での照合処理を行なうことな
どによって認識処理を行なったりするようにしている。In FIG. 1, reference numeral 102 denotes an audio input unit, which converts an audio signal emitted by a user into an electric signal by a microphone or the like, and takes it as an input signal to the apparatus. Or even for example A /
By performing D (analog-to-digital) conversion, conversion to an expression that can be processed by the present apparatus is performed. Further, analysis processing or processing processing is performed using, for example, FFT (fast Fourier transform), or further, for example. Compound similarity method, HMM (Hidden Markov Model), DP
Recognition processing is performed by, for example, performing collation processing between a standard pattern prepared in advance and an input signal using a method such as (dynamic programming) or a neural network.

【００５８】本音声入力部１０２による利用者の発する
音声の、取り込み、あるいは録音、あるいは加工、ある
いは分析、あるいは認識といった動作は制御部１０３に
よって制御されるようになっており、また音声入力部１
０２によって得られる音声入力の処理結果も制御部１０
３の制御に従って、アプリケーション１０４へと渡され
るようにしている。The operation of capturing, recording, processing, analyzing, or recognizing the voice uttered by the user by the voice input unit 102 is controlled by the control unit 103.
02, the processing result of the voice input obtained by the control unit 10
According to the control of No. 3, it is passed to the application 104.

【００５９】図１に於いて、１０３は制御部である。In FIG. 1, reference numeral 103 denotes a control unit.

【００６０】制御部１０３は、呼吸検出部１０１から逐
次得られる呼吸状況情報を参照し、音声入力部１０２お
よびアプリケーション１０４の内少なくとも一方を適宜
制御し、利用者からの音声入力信号の受け付け可否制
御、音声区間の推定処理、雑音低減処理、音声信号変換
処理などを制御する。The control unit 103 controls at least one of the voice input unit 102 and the application 104 by referring to the respiration status information sequentially obtained from the respiration detection unit 101, and controls whether or not to accept a voice input signal from a user. , Speech section estimation processing, noise reduction processing, audio signal conversion processing, and the like.

【００６１】なお、本制御部１０３の動作が、本装置の
効果の実現において本質的な役割を演ずるものであるた
めその詳細は後述することとする。Since the operation of the control unit 103 plays an essential role in realizing the effect of the present apparatus, the details will be described later.

【００６２】図１に於いて、１０４はアプリケーション
であり、制御部１０３の制御に応じて音声入力部１０２
の出力を受けとり、例えばデータベースシステムでは、
入力された検索要求に対応する検索結果を出力したり、
あるいは音声録音システムでは、入力された音声信号を
適切に保存するなどといったサービスを行なうものであ
り、コンピュータのアプリケーションプログラムに相当
する。In FIG. 1, reference numeral 104 denotes an application, which is controlled by the control
Of the database, for example, in a database system,
Output search results corresponding to the input search request,
Alternatively, a voice recording system provides a service such as appropriately storing an input voice signal, and corresponds to an application program of a computer.

【００６３】つづいて、制御部１０３について詳説す
る。Next, the control section 103 will be described in detail.

【００６４】制御部１０３は以下の処理手順Ａに従って
動作するようにしている。なお、図３は処理手順Ａの処
理内容を説明するフローチャートである。The control unit 103 operates according to the following processing procedure A. FIG. 3 is a flowchart for explaining the processing contents of the processing procedure A.

【００６５】＜処理手順Ａ＞Ａ１：音声入力部１０２を制御し、音声入力を「非受け
付け状態」とする。<Processing Procedure A> A1: The voice input unit 102 is controlled to make the voice input “non-accepting state”.

【００６６】Ａ２：呼吸検出部１０１から得られる呼吸
状況情報の内容を常時監視し、「非定常呼吸（吸気）」
を検出した場合にはステップＡ３へ進み、そうでない場
合はステップＡ２に留まる。A2: The contents of the respiration status information obtained from the respiration detection unit 101 are constantly monitored, and "unsteady respiration (inspiration)" is performed.
Is detected, the process proceeds to step A3; otherwise, the process remains at step A2.

【００６７】Ａ３：音声入力部１０２を制御し、音声入
力を受け付け状態とする。A3: The voice input unit 102 is controlled so that voice input is accepted.

【００６８】Ａ４：タイマＴの値を０とした上で、タイ
マＴを（再）スタートする。A4: After the value of the timer T is set to 0, the timer T is (re) started.

【００６９】Ａ５：タイマＴに関して、あらかじめ定め
た時間ｔＡが経過していたら、ステップＡ１へ進み、そ
うでなければステップＡ６へ進む。A5: Regarding the timer T, if the predetermined time tA has elapsed, the process proceeds to step A1, otherwise, the process proceeds to step A6.

【００７０】Ａ６：現時点において、利用者からの音声
入力Ｉがなされていたら、ステップＡ８へ進み、そうで
なければステップＡ７へ進む。A6: If a voice input I has been made by the user at the present time, the process proceeds to step A8, and if not, the process proceeds to step A7.

【００７１】Ａ７：現時点に於いて、呼吸検出部１０１
から得られる呼吸状況情報により、「非定常呼吸（吸
気）」が検出されたら、ステップＡ４へ進み、そうでな
ければステップＡ５へ進む。A7: At this time, the respiration detecting unit 101
If "unsteady breathing (inspiration)" is detected from the respiration status information obtained from step (A), the process proceeds to step A4, and if not, the process proceeds to step A5.

【００７２】Ａ８：音声入力Ｉに対する音声入力部１０
２の処理結果を、アプリケーション１０４へ渡し、ステ
ップＡ４へ進む。A8: Voice input unit 10 for voice input I
The processing result of step 2 is passed to the application 104, and the process proceeds to step A4.

【００７３】以上が本発明に係る第１実施形態の構成と
その機能である。The above is the configuration and functions of the first embodiment according to the present invention.

【００７４】ここで先ず上述した処理について、具体例
を用いて詳しく説明する。Here, first, the above-mentioned processing will be described in detail using a specific example.

【００７５】（１）まず、ステップＡ１の処理によっ
て、本装置の音声入力が非受け付け状態になる。(1) First, by the processing of step A1, the voice input of the present apparatus is set in a non-accepting state.

【００７６】（２）ここで、利用者の周囲で雑音が発生
したとする。(2) Here, it is assumed that noise occurs around the user.

【００７７】（３）ここでは音声入力は非受け付け状態
にあるので、この雑音に起因する音声認識の誤認識は発
生しない。(3) Here, since voice input is in a non-accepting state, no erroneous voice recognition occurs due to this noise.

【００７８】（４）つづいて、利用者が本装置への音声
入力を行なうために、発声のために大きく息を吸ったも
のとする。(4) Subsequently, it is assumed that the user inhales greatly to produce a voice in order to input a voice to the apparatus.

【００７９】（５）この行動が、呼吸検出部１０１によ
って検知され、図２のｐ１０４のエントリに示した通り
の呼吸状況情報が出力される。(5) This action is detected by the respiration detection unit 101, and the respiration status information as shown in the entry p104 in FIG. 2 is output.

【００８０】（６）さらに、ステップＡ２〜Ａ４の処理
によって、音声入力が受け付け状態に変更され、タイマ
Ｔがスタートされる。(6) Further, by the processing of steps A2 to A4, the voice input is changed to the receiving state, and the timer T is started.

【００８１】（７）ここで利用者が音声入力を行なった
とする。(7) Here, it is assumed that the user performs voice input.

【００８２】（８）ここまでの処理によって音声入力は
受け付け状態であるため、利用者の音声入力が受け付け
られ、ステップＡ８によって、その処理結果がアプリケ
ーション１０４へと送られ、所望のサービスが利用者に
提供される。(8) Since the voice input has been accepted by the processing up to this point, the user's voice input is accepted, and in step A8, the processing result is sent to the application 104, and the desired service is transmitted to the user. Provided to

【００８３】以上の処理によって、利用者は明示的ある
いは恣意的な操作をすることなく自然に音声入力を行な
うことが可能となり、また周囲雑音による誤動作の発生
も解消することが出来ている。With the above-described processing, the user can naturally input a voice without performing any explicit or arbitrary operation, and the occurrence of malfunction due to ambient noise can be eliminated.

【００８４】（９）その後、ステップＡ４の処理によっ
てタイマＴがリスタートされる。(9) Thereafter, the timer T is restarted by the processing in step A4.

【００８５】（１０ａ）もしこの段階で利用者が行なう
べき音声入力がない場合には、利用者は、黙っているこ
ととなり、タイマＴがｔＡを経過した段階でステップＡ
５の処理によって、ステップＡ１へ進み、音声入力が非
受け付け状態に戻る。(10a) If there is no voice input to be made by the user at this stage, the user is silent, and when the timer T has passed tA, step A
By the process of 5, the process proceeds to step A1, and the voice input returns to the non-accepting state.

【００８６】（１０ｂ）あるいは、もしこの利用者が次
に行なうべき音声入力があり、次の音声入力を行なった
場合には、ステップＡ６の処理によって、再度音声が受
け付けられ、ステップＡ８によって、その処理結果がア
プリケーション１０４へと送られ、所望のサービスが利
用者に提供されたのち、ステップＡ４へ進み、タイマＴ
がリスタートされ、利用者からの音声入力の待ち受け時
間が延長される。(10b) Alternatively, if there is a voice input to be made next by this user and the next voice input is made, the voice is accepted again by the processing in step A6, and the voice is received in step A8. After the processing result is sent to the application 104 and the desired service is provided to the user, the process proceeds to step A4 and the timer T
Is restarted, and the waiting time for voice input from the user is extended.

【００８７】（１０ｃ）あるいは、もしこの利用者が次
に行なうべき音声入力があるが、まだ発声を行わず、発
声準備のために息継ぎを行なった場合には、ステップＡ
７の処理によって、ステップＡ４へ進み、タイマＴがリ
スタートされ、利用者からの音声入力の待ち受け時間が
延長される。(10c) Alternatively, if this user has a voice input to be performed next, but has not yet made a utterance and has taken a breath to prepare for utterance, step A
By the process of 7, the process proceeds to step A4, the timer T is restarted, and the waiting time for voice input from the user is extended.

【００８８】（１１）以上の音声入力処理あるいは音声
入力の待ち受け時間の延長処理は、利用者の行動に応じ
て任意回必要なだけ繰り返されたのち、ステップＡ５の
分岐によって、ステップＡ１に進み、初期状態に戻る。(11) The above-described voice input process or the process of extending the standby time of voice input is repeated as necessary according to the action of the user as many times as necessary, and then the process proceeds to step A1 by branching to step A5. Return to the initial state.

【００８９】かくしてこのように構成された本装置の第
１の実施形態によれば、音声入力の解析精度が不十分さ
に起因する誤認識や、利用者が入力音声として意図した
信号部分の切りだしの失敗に起因する誤動作を起こさ
ず、利用者への余分な負担を生じないマルチモーダルイ
ンタフェース装置およびマルチモーダルインタフェース
方式を提供することが可能となる。Thus, according to the first embodiment of the present apparatus configured as described above, erroneous recognition due to insufficient analysis accuracy of voice input, and clipping of a signal portion intended by a user as input voice, are performed. It is possible to provide a multi-modal interface device and a multi-modal interface method that do not cause a malfunction due to a failure of a dashi and do not cause an extra burden on a user.

【００９０】また、本来不要な場面での、入力音声信号
の処理負荷を軽減し、利用している装置に関与する他の
サービスの実行速度や利用効率が低下しない、マルチモ
ーダルインタフェース装置およびマルチモーダルインタ
フェース方式を提供することが出来る。A multi-modal interface device and a multi-modal interface device capable of reducing the processing load of an input audio signal in an originally unnecessary scene and preventing the execution speed and utilization efficiency of other services related to the device being used from being reduced. An interface method can be provided.

【００９１】また、音声入力を行なう際に、たとえば、
ボタンを押したり、メニュー選択などといった特別な操
作によるモード変更が必要なく、自然で、利用者にとっ
て繁雑でなく、習得のための訓練が不要であり、利用者
の負担を増加しないマルチモーダルインタフェース装置
およびマルチモーダルインタフェース方式を提供するこ
とが出来る。Further, when performing voice input, for example,
A multi-modal interface device that does not require a mode change by a special operation such as pressing a button or selecting a menu, is natural, is not complicated for the user, does not require training for learning, and does not increase the burden on the user. And a multi-modal interface system can be provided.

【００９２】また、例えば、口だけを使ってコミュニケ
ーションが出来、例えば手で行なっている作業を妨害す
ることがなく、双方を同時に利用することが可能である
と言う、音声メディア本来の利点を活かすことが出来る
マルチモーダルインタフェース装置およびマルチモーダ
ルインタフェース方式を提供することが出来る。Further, for example, it is possible to communicate using only the mouth, and for example, it is possible to use both of them at the same time without interfering with the work being performed by hand, and to take advantage of the inherent advantage of the audio media. A multimodal interface device and a multimodal interface system capable of performing the above can be provided.

【００９３】また、人間同士のコミュニケーションにお
いては重要な役割を演じていると言われる、非言語メッ
セージを、効率的に利用することが出来るマルチモーダ
ルインタフェース装置およびマルチモーダルインタフェ
ース方式を提供することが出来るなど、多大な効果が奏
せられる。Further, it is possible to provide a multi-modal interface device and a multi-modal interface system capable of efficiently using a non-verbal message, which is said to play an important role in communication between humans. And so on.

【００９４】（ｉｉ）第２の実施形態続いて、図面を参照して本発明の第２実施形態に係るマ
ルチモーダルインタフェース装置およびマルチモーダル
インタフェース方法について説明する。(Ii) Second Embodiment Next, a multimodal interface device and a multimodal interface method according to a second embodiment of the present invention will be described with reference to the drawings.

【００９５】図４は、本発明の第２実施形態に係るマル
チモーダルインタフェース装置の構成例を表しており、
注視対象検出部２０１、ジェスチャ認識部２０２、制御
部２０３、およびアプリケーションプログラム２０４か
ら構成されている。FIG. 4 shows a configuration example of a multimodal interface device according to the second embodiment of the present invention.
It comprises a gaze target detection unit 201, a gesture recognition unit 202, a control unit 203, and an application program 204.

【００９６】図４に於いて、２０１は注視対象検出部を
表しており、例えば、特願平０９−６２６８１号の「オ
ブジェクト操作装置およびオブジェクト操作方法」と同
様の方法によって、例えば利用者の姿を観察した画像情
報の解析などによって、利用者が注視している対象を検
出し、注視対象情報として随時出力するようにしてい
る。In FIG. 4, reference numeral 201 denotes a gaze target detecting unit. For example, by using a method similar to the "object operating device and object operating method" of Japanese Patent Application No. 09-62681, the user's appearance is detected. An object that the user is gazing at is detected by analyzing the image information obtained by observing, and is output as gazing target information at any time.

【００９７】図５は、注視対象検出部２０１の出力する
注視対象情報の例を表している。FIG. 5 shows an example of the gaze target information output from the gaze target detection unit 201.

【００９８】図５の各エントリに於いて、ＩＤの欄に
は、各注視対象情報の識別信号が記録されており、時間
情報Ａの欄には対応する注視が検出された時刻に関する
情報が記録されるようにしている。In each entry of FIG. 5, an identification signal of each gaze target information is recorded in an ID column, and information on a time when a corresponding gaze is detected is recorded in a time information A column. I am trying to be.

【００９９】また、対象情報Ｂの欄には、対応する注視
の対象となった物体あるいは領域を表す記号が記録され
るようにしている。Further, in the column of the target information B, a symbol indicating the corresponding object or area to be watched is recorded.

【０１００】（なお、図５のエントリｑ２５１およびｑ
２５２の対象情報Ｂの欄に記載された記号「マインズア
イ」については後述する。）図４に於いて、２０２はジェスチャ認識部を表してお
り、これは、単数または複数のカメラなどによって得ら
れる利用者の画像情報の処理、あるいは赤外線センサな
どの遠隔センサ、装着センサなどによって得られる信号
の処理などによって、利用者の手など体の部分あるいは
体の全体の動作を解析し利用者からのジェスチャ入力を
認識するものであり、ジェスチャ入力の解析、認識は、
例えば、“ＵｎｃａｌｉｂｒａｔｅｄＳｔｅｒｅｏ
ＶｉｓｉｏｎｗｉｔｈＰｏｉｎｔｉｎｇｆｏｒ
ａＭａｎ−ＭａｃｈｉｎｅＩｎｔｅｒｆａｃｅ”
（Ｒ．Ｃｉｐｏｌｌａ，ｅｔ．ａｌ．，Ｐｒｏｃｅ
ｅｄｉｎｇｓｏｆＭＶＡ’９４，ＩＡＰＲＷｏ
ｒｋｓｈｏｐｏｎＭａｃｈｉｎｅＶｉｓｉｏｎＡ
ｐｐｌｉｃａｔｉｏｎ，ｐｐ．１６３−１６６，１
９９４．）などに示された方法を用いることができる。(Note that the entries q251 and q251 in FIG. 5
The symbol "Minds eye" described in the column of target information B of 252 will be described later. In FIG. 4, reference numeral 202 denotes a gesture recognition unit, which processes user image information obtained by one or more cameras, or obtained by a remote sensor such as an infrared sensor, a wearing sensor, or the like. By analyzing the movement of the body part such as the user's hand or the whole body by processing the signals received, etc., the gesture input from the user is recognized.The analysis and recognition of the gesture input are as follows.
For example, "Uncalibrated Stereo
Vision with Pointing for
a Man-Machine Interface ”
(R. Cipolla, et. Al., Proce.
edings of MVA'94, IAPR Wo
rkshop on Machine VisionA
application, pp. 163-166, 1
994. ) Can be used.

【０１０１】図６は、ジェスチャ認識部２０２が出力す
るジェスチャ認識情報の例を表している。図６の各エン
トリに於いて、ＩＤは各ジェスチャ認識情報の識別記号
を表しており、開始時間情報Ａおよび終了時間情報Ｂの
欄は、それぞれ対応するジェスチャの開始および終了時
刻が記録されるようにしている。FIG. 6 shows an example of the gesture recognition information output by the gesture recognition unit 202. In each entry of FIG. 6, ID represents an identification symbol of each gesture recognition information, and columns of start time information A and end time information B record the start and end times of the corresponding gesture, respectively. I have to.

【０１０２】また、ジェスチャ種別情報Ｃの欄にはジェ
スチャ認識部２０２における処理によって得られたジェ
スチャの種別が記号で記録されるようにしている。In the column of the gesture type information C, the type of the gesture obtained by the processing in the gesture recognition unit 202 is recorded as a symbol.

【０１０３】図４に於いて、２０３は制御部を表してお
り、注視対象検出部２０１、およびジェスチャ認識部２
０２、およびアプリケーション２０４を制御する。この
制御部２０３が、視線検出情報に基づいて、ジェスチャ
入力の受け付け可否、あるいはジェスチャ入力の検出あ
るいは認識に用いられるパラメータ情報の調整などの制
御を行うことにより、本装置の効果が実現される。In FIG. 4, reference numeral 203 denotes a control unit, and a gaze target detection unit 201 and a gesture recognition unit 2
02, and the application 204. The effect of the present apparatus is realized by the control unit 203 performing control such as whether or not to accept a gesture input or adjustment of parameter information used for detecting or recognizing the gesture input based on the gaze detection information.

【０１０４】なお、本制御部２０３は、本装置の効果を
実現する上で重要な役割を担うものであるため、その動
作の詳細については後述することとする。Since the control section 203 plays an important role in realizing the effects of the present apparatus, the details of its operation will be described later.

【０１０５】図４に於いて、２０４はアプリケーション
を表しており、本部品の役割は、前述第１実施形態にお
けるアプリケーション１０４と同様である。In FIG. 4, reference numeral 204 denotes an application, and the role of this component is the same as that of the application 104 in the first embodiment.

【０１０６】続いて制御部２０３について説明する。Next, the control section 203 will be described.

【０１０７】図７は、制御部２０３の内部構成の例を表
しており、制御部２０３が、制御処理部２０３ａ、およ
び注視解釈規則記憶部２０３ｂ、および注視状況記録部
２０３ｃから構成されていることを示している。FIG. 7 shows an example of the internal configuration of the control section 203. The control section 203 is composed of a control processing section 203a, a gaze interpretation rule storage section 203b, and a gaze state recording section 203c. Is shown.

【０１０８】図８は注視解釈規則記憶部２０３ｂの内容
の例を表しており、注視解釈規則の各エントリが、Ｉ
Ｄ、および注視対象情報Ａ、および可能ジェスチャ種別
リスト情報Ｂなどと分類され記録される様にしている。FIG. 8 shows an example of the contents of the gaze interpretation rule storage unit 203b.
D, gaze target information A, possible gesture type list information B, and the like.

【０１０９】注視解釈規則記憶部２０３ｂの各エントリ
において、ＩＤの欄は対応する規則の識別記号が記録さ
れる。In each entry of the gaze interpretation rule storage unit 203b, the ID column records the identification code of the corresponding rule.

【０１１０】また、注視対象情報Ａの欄には解釈すべき
注視対象情報の注視対象の種類が記録されており、ま
た、可能ジェスチャ種別リスト情報Ｂの欄には、注視対
象情報Ａの欄に記録されてた注視対象を注視している状
態で、提示されうるジェスチャの種別のリストが記録さ
れるようにしている。[0110] The type of the gaze target of the gaze target information to be interpreted is recorded in the column of the gaze target information A, and the gaze target information A is set in the possible gesture type list information B. A list of gesture types that can be presented is recorded with the recorded gaze target being watched.

【０１１１】図９は注視状況記憶部２０３ｃの内容の例
を表しており、注視状況記憶部２０３ｃの各エントリ
が、ＩＤおよび、時間情報Ａ、および種別リスト情報Ｂ
などと分類され記録される様にしている。FIG. 9 shows an example of the contents of the gaze state storage section 203c. Each entry of the gaze state storage section 203c is composed of an ID, time information A, and type list information B.
It is classified as such and recorded.

【０１１２】注視状況記憶部２０３ｃの各エントリに於
いて、ＩＤは対応する注視状況情報の識別記号である。In each entry of the gaze state storage unit 203c, ID is an identification symbol of the corresponding gaze state information.

【０１１３】また、時間情報Ａの欄には対応する注視情
報の表す注視が行なわれた時間が記録されるようにして
おり、また種別リスト情報Ｂの欄には、対応する注視が
行なわれたことによって規定されるその時点で可能なジ
ェスチャの種別のリストが記録されるようにしている。The time at which the gaze indicated by the corresponding gaze information is recorded is recorded in the column of time information A, and the corresponding gaze is recorded in the column of type list information B. A list of the types of gestures that can be made at that time, which is defined by the above, is recorded.

【０１１４】以上が本発明の第２実施形態に係るマルチ
モーダルインタフェース装置の構成の説明である。The configuration of the multimodal interface device according to the second embodiment of the present invention has been described.

【０１１５】つづいて、制御部２０３の動作について説
明する。Subsequently, the operation of the control unit 203 will be described.

【０１１６】制御部２０３は、並列あるいは交互に動作
する以下の処理手順Ｂおよび処理手順Ｃに従って動作す
る。The control unit 203 operates in accordance with the following processing procedures B and C operating in parallel or alternately.

【０１１７】なお、図１０は処理手順Ｂを説明するフロ
ーチャートであり、図１１は処理手順Ｃを説明するフロ
ーチャートである。FIG. 10 is a flowchart illustrating the processing procedure B, and FIG. 11 is a flowchart illustrating the processing procedure C.

【０１１８】＜処理手順Ｂ＞Ｂ１：注視対象検出部２０１から、注視対象情報Ｅｉを
受け取ったら、ステップＢ２へ進み、そうでない場合に
はステップＢ１へ進む。<Processing Procedure B> B1: When gaze target information Ei is received from gaze target detection section 201, the flow proceeds to step B2, and otherwise, the flow proceeds to step B1.

【０１１９】Ｂ２：注視解釈規則記憶部２０３ｂを参照
し、注視対象情報Ｅｉの対象情報Ｂと同一の内容を、注
視対象情報ＡにもつエントリＳｉを探す。B2: With reference to the gaze interpretation rule storage unit 203b, search for the entry Si having the same contents as the target information B of the gaze target information Ei in the gaze target information A.

【０１２０】Ｂ３：注視状況情報記憶部２０３ｃに新た
なエントリＵｉを作成し、エントリＵｉの時間情報Ａの
欄に、注視対象情報Ｅｉの時間情報Ａの内容を複写し、
かつエントリＵｉの種別リスト情報Ｂの欄に、ステップ
Ｂ２で検索した注視解釈規則記憶部２０３ｂのエントリ
Ｓｉの可能ジェスチャ種別リスト情報Ｂの内容を複写す
る。B3: A new entry Ui is created in the gaze state information storage unit 203c, and the content of the time information A of the gaze target information Ei is copied into the time information A column of the entry Ui.
In addition, the contents of the possible gesture type list information B of the entry Si of the gaze interpretation rule storage unit 203b searched in step B2 are copied to the column of the type list information B of the entry Ui.

【０１２１】Ｂ４：ステップＢ１へ進む。B4: Go to step B1.

【０１２２】＜処理手順Ｃ＞Ｃ１：ジェスチャ認識部２０２から、ジェスチャ認識情
報Ｇｊを受け取ったら、ステップＣ２へ進み、そうでな
ければステップＣ１へ進む。<Processing Procedure C> C1: When the gesture recognition information Gj is received from the gesture recognition unit 202, the process proceeds to step C2, otherwise, the process proceeds to step C1.

【０１２３】Ｃ２：ジェスチャ認識情報Ｇｊを参照し、
その開始時間情報Ａの内容Ｔｊｓと、終了時間情報Ｔｊ
ｅを得る。C2: Referring to the gesture recognition information Gj,
The contents Tjs of the start time information A and the end time information Tj
e.

【０１２４】Ｃ３：注視解釈状況記憶部２０３ｃの内容
を参照し、時間情報Ａの値が、Ｔｊｓ以降で、かつＴｊ
ｅ以前の値である、注視解釈状況情報２０２ｃのエント
リの集合Ｓｕを得る。C3: Referring to the contents of the gaze interpretation status storage unit 203c, if the value of the time information A is equal to or later than Tjs and Tj
e, a set Su of entries of the gaze interpretation status information 202c, which is the value before e.

【０１２５】Ｃ４：集合Ｓｕが空集合なら、Ｃ７へ進
む。C4: If the set Su is an empty set, the process proceeds to C7.

【０１２６】Ｃ５：エントリの集合Ｓｕの全ての要素の
種別リスト情報Ｂの欄に、ジェスチャ認識情報Ｇｊのジ
ェスチャ種別情報Ｃの内容が含まれる場合は、ステップ
Ｃ６へ進み、そうでない場合はステップＣ７へ進む。C5: If the contents of the gesture type information C of the gesture recognition information Gj are included in the column of the type list information B of all elements of the set of entries Su, the process proceeds to step C6; otherwise, the process proceeds to step C7. Proceed to.

【０１２７】Ｃ６：ジェスチャ認識情報Ｇｊをジェスチ
ャ入力として受理し、アプリケーション２０４へ送りス
テップＣ１へ進む。C6: The gesture recognition information Gj is received as a gesture input, sent to the application 204, and proceeds to step C1.

【０１２８】Ｃ７：ジェスチャ認識情報Ｇｊをジェスチ
ャ入力として受理せずに破棄し、ステップＣ１へ進む。C7: The gesture recognition information Gj is discarded without being accepted as a gesture input, and the process proceeds to step C1.

【０１２９】続いて、本発明の第２実施形態の処理につ
いて、具体例を用いて説明する。Next, the processing of the second embodiment of the present invention will be described using a specific example.

【０１３０】（１）まず、時点ｔ１０の時点で、本装置
の利用者が、他の人物の方向を向いていたものとする。(1) First, it is assumed that the user of the apparatus is facing the direction of another person at time t10.

【０１３１】（２）これに対する注視対象検出部２０１
での処理によって、図５のＩＤがｑ２０１に示すような
注視対象情報が生成され、制御部２０３へ伝えられる。(2) Gaze target detection unit 201 for this
The gaze target information whose ID in FIG. 5 is indicated by q 201 is generated and transmitted to the control unit 203.

【０１３２】（３）このｑ２０１の注視対象情報を受け
とったため、ステップＢ１からステップＢ２へとの分岐
が起こり、ステップＢ２での処理によって注視対象情報
ｑ２０１の対象情報Ｂと同一の内容である「他人物１」
と同じ種類の値を、その注視対象情報Ａの欄に持つ注視
解釈規則記憶部２０２ｂのエントリＳ４０１がＳｉとし
て検索される。(3) Since the gaze target information of q201 has been received, a branch from step B1 to step B2 occurs, and by the processing in step B2, "other" which is the same content as the target information B of gaze target information q201 Person 1 "
The entry S401 in the gaze interpretation rule storage unit 202b having the same type of value in the column of the gaze target information A is searched for as Si.

【０１３３】（４）ステップＢ３での処理によって、注
視状況情報記憶部２０３ｃに新たなエントリｕ５０１が
生成され、その時間情報Ａの欄に、注視対象情報ｑ２０
１の時間情報Ａの内容が複写され、かつ、エントリｕ５
０１の種別リスト情報Ｂの欄に、エントリｓ４０１の可
能ジェスチャ種別リスト情報Ｂの内容が複写された後、
ステップＢ４によりステップＢ１へ戻る。(4) By the process in step B3, a new entry u501 is generated in the gaze state information storage unit 203c, and the gaze target information q20 is entered in the time information A column.
1 is copied, and the entry u5
After the contents of the possible gesture type list information B of the entry s401 are copied in the column of the type list information B of No. 01,
The process returns to step B1 by step B4.

【０１３４】（５）以後上記と同様の処理が、注視対象
検出部２０１から順次得られる図５に示した注視対象情
報ｑ２０２〜ｑ１０４に対して施され、結果として図９
に示した注視状況記憶部２０２ｃの注視状況情報ｕ５０
２〜ｕ５０４のエントリが生成される。(5) Thereafter, the same processing as described above is performed on the gaze target information q202 to q104 shown in FIG. 5 sequentially obtained from the gaze target detection unit 201. As a result, FIG.
Gaze state information u50 in the gaze state storage unit 202c shown in FIG.
2 to u504 entries are generated.

【０１３５】（６）ここで、ジェスチャ認識部２０２か
ら図６ジェスチャ認識情報の例のｒ３０１のエントリに
示したジェスチャ認識情報が得られたとする。(6) Here, it is assumed that the gesture recognition information shown in the entry of r301 in the example of the gesture recognition information in FIG. 6 is obtained from the gesture recognition unit 202.

【０１３６】（７）このジェスチャ認識情報ｒ３０１に
対して、ステップＣ１の処理により、ステップＣ２への
分岐が起こる。(7) With respect to the gesture recognition information r301, a branch to step C2 occurs by the processing of step C1.

【０１３７】（８）ステップＣ２によって、ｒ３０１の
開始時間情報Ａの値＝ｔ１１と終了時間情報Ｂの値＝ｔ
１２が得られる。(8) By step C2, the value of the start time information A of r301 = t11 and the value of the end time information B = t11
12 is obtained.

【０１３８】（９）続いて、ステップＣ３の処理によっ
て、注視状況記憶部２０３ｃから、ｔ１１〜ｔ１２の間
の注視状況情報が検索され、結果として、エントリｕ５
０２とエントリｕ５０３とを要素とする集合Ｓｕが得ら
れる。(9) Subsequently, by the processing of step C3, the gaze state information between t11 and t12 is retrieved from the gaze state storage unit 203c, and as a result, the entry u5
A set Su having the elements 02 and the entry u503 as elements is obtained.

【０１３９】（１０）Ｓｕは空集合でないのでステップ
Ｃ４からＣ５へと進む。(10) Since Su is not an empty set, the process proceeds from step C4 to C5.

【０１４０】（１１）ステップＣ５の処理によって、エ
ントリｕ５０２とエントリｕ５０３の双方の種別リスト
情報Ｂに、ジェスチャ認識情報ｒ３０１のジェスチャ種
別情報Ｃの値「うなづき」が含まれるかどうかが調べら
れるが、ここでは、条件が成立しないため、Ｃ７へ進
む。(11) By the process of step C5, it is checked whether or not the type list information B of both the entry u502 and the entry u503 includes the value “nodding” of the gesture type information C of the gesture recognition information r301. Here, since the condition is not satisfied, the process proceeds to C7.

【０１４１】（１２）ステップＣ７によって、ジェスチ
ャ認識情報ｒ３０１が示唆した「うなづき」がジェスチ
ャとして受理されずに破棄されステップＣ１へ進み初期
状態へ戻る。(12) In step C7, the “nodding” suggested by the gesture recognition information r301 is not accepted as a gesture and is discarded, and the process proceeds to step C1 to return to the initial state.

【０１４２】これは、時点ｔ１１〜ｔ１２に於いて、利
用者が他の人物を注視している状態に於いて検出された
うなづきジェスチャの候補は、本装置への入力を意図し
たジェスチャではないと、本装置が判断したことに相当
する。This is because, at times t11 to t12, nod gesture candidates detected while the user is gazing at another person are not gestures intended to be input to the apparatus. , Which corresponds to the determination made by the present apparatus.

【０１４３】また、以上の処理と同様の処理によって、
図６のｒ３０２に示したｔ２０〜ｔ２４に渡る「うなづ
き」ジェスチャ認識情報では、図９に示した注視状況記
憶部２０２ｃのｕ５１１〜ｕ５１６のエントリの種別リ
スト情報Ｂの全てが「うなづき」を含んではいないため
破棄されるが、これは、時点ｔ２０〜ｔ２４の利用者の
うなづきのジェスチャ入力の可能性を持つ信号が検知さ
れたが、その時点での利用者の注視対象が、「画面」→
「利用者手元」→「画面」へと推移していることを根拠
として、このジェスチャ入力の候補は誤って抽出された
ものであると判断されジェスチャ候補が破棄された例で
ある。Also, by the same processing as the above processing,
In the “nodding” gesture recognition information from t20 to t24 shown in r302 of FIG. 6, all the type list information B of the entries of u511 to u516 of the gaze state storage unit 202c shown in FIG. 9 include “nodding”. Although the signal is discarded because there is no signal, a signal having a possibility of a gesture input of the user's nod at time t20 to t24 is detected, but the user's gaze target at that time is “screen” →
This is an example in which the gesture input candidate is determined to be erroneously extracted on the basis of the transition from “user's hand” to “screen”, and the gesture candidate is discarded.

【０１４４】一方、時点ｔ３１〜ｔ３３に渡って検出さ
れた図６のｒ３０３のエントリに対応するジェスチャ入
力候補に関しては、本装置によって「うなづき」のジェ
スチャ入力として受理され、アプリケーション２０４へ
と送られることになる。On the other hand, the gesture input candidate corresponding to the entry of r303 in FIG. 6 detected from time t31 to t33 is accepted as a gesture input of “nodding” by the present apparatus and sent to the application 204. become.

【０１４５】その手順を順を追って説明する。The procedure will be described step by step.

【０１４６】（１）まず注視対象検出部２０１での処理
によって、図５のＩＤがｑ２２１に示すような注視対象
情報が生成され、制御部２０３へ伝えられる。(1) First, the gaze target detecting unit 201 generates the gaze target information whose ID is indicated by q221 in FIG. 5 and is transmitted to the control unit 203.

【０１４７】（２）このｑ２２１の注視対象情報を受け
とったため、ステップＢ１からステップＢ２へとの分岐
が起こり、ステップＢ２での処理によって注視対象情報
ｑ２２１の対象情報Ｂと同一の内容である「カメラ１」
と同じ種類の値を、その注視対象情報Ａの欄に持つ注視
解釈規則記憶部２０２ｂのエントリＳ４０４がＳｋとし
て検索される。(2) Since the gaze target information of q221 has been received, a branch from step B1 to step B2 occurs, and the content of “camera” having the same contents as the target information B of gaze target information q221 by the processing in step B2. 1 "
The entry S404 in the gaze interpretation rule storage unit 202b having the same type of value in the column of the gaze target information A is searched for as Sk.

【０１４８】（３）ステップＢ３での処理によって、注
視状況情報記憶部２０３ｃに新たなエントリｕ５２１が
生成され、その時間情報Ａの欄に、注視対象情報ｑ２２
１の時間情報Ａの内容が複写され、かつ、エントリｕ５
２１の種別リスト情報Ｂの欄に、エントリｓ４０４の可
能ジェスチャ種別リスト情報Ｂの内容が複写された後、
ステップＢ４によりステップＢ１へ戻る。(3) By the processing in step B3, a new entry u521 is generated in the gaze state information storage unit 203c, and the gaze target information q22 is entered in the time information A column.
1 is copied, and the entry u5
After the contents of the possible gesture type list information B of the entry s404 are copied in the type list information B column 21,
The process returns to step B1 by step B4.

【０１４９】（４）以後上記と同様の処理が、注視対象
検出部２０１から順次得られる図５に示した注視対象情
報ｑ２３２〜ｑ２３４に対して施され、結果として図９
に示した注視状況記憶部２０３ｃのｕ５２２〜ｕ５２４
のエントリが生成される。(4) Thereafter, the same processing as described above is performed on the gaze target information q232 to q234 shown in FIG. 5 sequentially obtained from the gaze target detection unit 201. As a result, FIG.
U522 to u524 of the gaze state storage unit 203c shown in FIG.
Is generated.

【０１５０】（５）ここで、ジェスチャ認識部２０２か
ら図６ジェスチャ認識情報の例のｒ３０３のエントリに
示したジェスチャ認識情報が得られたとする。(5) Here, it is assumed that the gesture recognition information shown in the entry of r303 in the example of the gesture recognition information in FIG. 6 is obtained from the gesture recognition unit 202.

【０１５１】（６）このジェスチャ認識情報ｒ３０３に
対して、ステップＣ１の処理により、ステップＣ２の分
岐が起こる。(6) Branching of step C2 occurs for the gesture recognition information r303 by the processing of step C1.

【０１５２】（７）ステップＣ２によって、ｒ３０３の
開始時間情報Ａの値＝ｔ３０と終了時間情報Ｂの値＝ｔ
３３が得られる。(7) By step C2, the value of the start time information A of r303 = t30 and the value of the end time information B = t30
33 are obtained.

【０１５３】（８）続いて、ステップＣ３の処理によっ
て、注視状況記憶部２０３ｃから、ｔ３０〜ｔ３３の間
の注視状況情報が検索され、結果として、エントリｕ５
２１、エントリｕ５２２、エントリｕ５２３、およびエ
ントリｕ５２４を含む集合Ｓｖが得られる。(8) Then, by the processing of step C3, the gaze state information between t30 and t33 is retrieved from the gaze state storage unit 203c, and as a result, the entry u5
21, a set Sv including the entry u522, the entry u523, and the entry u524 is obtained.

【０１５４】（９）Ｓｖは空集合でないのでステップＣ
４からＣ５へと進む。(9) Since Sv is not an empty set, step C
Proceed from 4 to C5.

【０１５５】（１０）ステップＣ５の処理によって、エ
ントリｕ５２１〜エントリｕ５２４の全種別リスト情報
Ｂに、ジェスチャ認識情報ｒ３０３のジェスチャ種別情
報Ｃの値「うなづき」が含まれるかどうかが調べられ、
ここでは、条件が成立し、Ｃ６へ進む。(10) By the process of step C5, it is checked whether or not the value “Nodding” of the gesture type information C of the gesture recognition information r303 is included in all type list information B of the entries u521 to u524.
Here, the condition is satisfied, and the process proceeds to C6.

【０１５６】（１１）ステップＣ６によって、ジェスチ
ャ認識情報ｒ３０３が示唆した「うなづき」がジェスチ
ャとして受理され、アプリケーション２０４へ送られた
上で、ステップＣ１へ進み初期状態へ戻る。(11) In step C6, the "nodding" suggested by the gesture recognition information r303 is received as a gesture, sent to the application 204, and then proceeds to step C1 to return to the initial state.

【０１５７】これは、利用者がカメラをずっと注視した
ままの状態において、提示された「うなづき」ジェスチ
ャの候補は、利用者からシステムへの入力を意図したジ
ェスチャ入力として信頼できるという判断を行ない受理
されたことに相当するものである。This is because, in a state where the user keeps watching the camera for a long time, the presented “nodding” gesture candidate is determined to be reliable as a gesture input intended to be input to the system by the user, and is determined to be accepted. This is equivalent to what has been done.

【０１５８】かくしてこのように構成された本装置の第
２実施形態によれば、ジェスチャ入力の解析精度が不十
分であるため、たとえば、ジェスチャ入力の認識処理に
おいて、入力デバイスから刻々得られる信号のなかか
ら、利用者が入力メッセージとして意図した信号部分の
切りだしに失敗するという問題を回避することが出来、
その結果、誤動作などによる利用者への負担を起こさな
いインタフェースを実現することが可能となる。[0158] According to the second embodiment of the present apparatus configured as described above, the accuracy of the gesture input analysis is insufficient. For example, in the gesture input recognition processing, the signal obtained from the input device every time is obtained. Above all, it is possible to avoid a problem that a user fails to cut out a signal portion intended as an input message,
As a result, it is possible to realize an interface that does not burden the user due to malfunction or the like.

【０１５９】また、利用者が現在の操作対象である計算
機などへの入力として用いるだけでなく、例えば周囲の
他の人間とのコミュニケーションを行なう場合にも利用
されるメディアを用いたインタフェース装置において、
利用者がインタフェース装置ではなく、たとえば自分の
横にいる他人に対してジェスチャを示したりした場合に
も、インタフェース装置が自分への入力であると誤って
判断しないインタフェース装置を実現するものである。Further, in an interface device using a medium which is used not only by a user as an input to a computer or the like which is a current operation target, but also in communication with other people around, for example,
The present invention realizes an interface device that does not erroneously determine that the interface device is an input to the user even when the user gives a gesture to another person beside the user, instead of the interface device.

【０１６０】さらに、たとえば、ボタンを押したり、メ
ニュー選択などによって、特別な操作によって入力モー
ドの変更を行なう必要がないため、自然なインタフェー
ス装置を実現することが出来る。Further, since there is no need to change the input mode by a special operation, for example, by pressing a button or selecting a menu, a natural interface device can be realized.

【０１６１】また、本発明によって、人間同士のコミュ
ニケーションにおいては重要な役割を演じていると言わ
れる非言語メッセージを、効果的に利用することが可能
となる。Further, according to the present invention, non-verbal messages which are said to play an important role in communication between humans can be effectively used.

【０１６２】（ｉｉｉ）第３の実施形態続いて、図面を参照して本発明の第３実施形態に係るマ
ルチモーダルインタフェース装置およびマルチモーダル
インタフェース方法について説明する。(Iii) Third Embodiment Next, a multimodal interface device and a multimodal interface method according to a third embodiment of the present invention will be described with reference to the drawings.

【０１６３】図１２は、本発明の第３実施形態に係るマ
ルチモーダルインタフェース装置の構成例を示してお
り、本装置が、注視対象検出部３０１、および入力部３
０２、および出力部３０３、および対話管理部３０４、
およびアプリケーション３０５から構成されていること
を表している。FIG. 12 shows an example of the configuration of a multi-modal interface device according to the third embodiment of the present invention. The device comprises a gaze target detecting section 301 and an input section 3.
02, an output unit 303, and a dialog management unit 304,
And the application 305.

【０１６４】図１２において、３０１は注視対象検出部
であり、利用者の注視対象を検出するが、本注視対象検
出部３０１に関しては、前述の第２実施形態における注
視対象検出部２０１と同様の構成によって実現し、同様
の注視対象情報を出力するものとする。In FIG. 12, reference numeral 301 denotes a gaze target detection unit which detects the gaze target of the user. The gaze target detection unit 301 is the same as the gaze target detection unit 201 in the second embodiment. It is realized by the configuration, and outputs the same gaze target information.

【０１６５】図１２において、３０２は入力部であり、
利用者からの音声入力、あるいは画像入力、あるいはキ
ーボード、マウス、ジョイスティック、トラックボー
ル、タッチセンサー、ボタンなどといった機器の操作入
力などの、入力を受け付ける様にしている。Referring to FIG. 12, reference numeral 302 denotes an input unit.
Inputs such as voice input or image input from a user, and operation input of devices such as a keyboard, a mouse, a joystick, a trackball, a touch sensor, and a button are received.

【０１６６】図１２において、３０３は出力部であり、
利用者への音声出力、画像出力、あるいは提力装置を通
じた力出力など、出力を提示する様にしている。In FIG. 12, reference numeral 303 denotes an output unit.
Outputs such as voice output, image output, and force output through a power supply device are presented to the user.

【０１６７】図１２において、３０４は対話管理部であ
り、入力部３０２および出力部３０３を、例えばスクリ
プトや、あるいは発話対や、あるいは発話交換構造や、
あるいはプランニング手法などを用いた従来手法によっ
て制御し、例えば、利用者からの入力信号の受付と利用
者への出力信号の提示、および該入力信号と出力信号の
時間調整、あるいは利用者への確認や問い返しのための
対話などを含む、利用者と本装置との間での対話（＝イ
ンタラクション）を実現するようにしている。In FIG. 12, reference numeral 304 denotes a dialogue management unit which converts an input unit 302 and an output unit 303 into, for example, a script, an utterance pair, an utterance exchange structure,
Alternatively, control is performed by a conventional method using a planning method or the like. For example, reception of an input signal from a user and presentation of an output signal to the user, time adjustment of the input signal and the output signal, or confirmation to the user A dialog (= interaction) between the user and the apparatus, including a dialog for returning a question and the like, is realized.

【０１６８】図１２において、３０５はアプリケーショ
ンであり、対話管理部３０４から提供される利用者から
の要求などに対して、例えばデータベースの検索や、推
論処理や、あるいは算術処理などによって応答の内容を
決定し、対話管理部３０４に返すようにしている。In FIG. 12, reference numeral 305 denotes an application, which responds to a request from a user provided from the dialog management unit 304 by, for example, searching a database, inference processing, or arithmetic processing. It is determined and returned to the dialog management unit 304.

【０１６９】対話管理部３０４は、注視対象検出部３０
１から随時提供される注視対象情報を参照して、以下に
示す＜処理手順Ｄ＞にしたがった処理によって動作する
ことで、本装置の効果を実現する。[0169] The dialogue management unit 304 includes the gaze target detection unit 30.
By referring to the gaze target information provided as needed from 1 and operating according to the following <procedure D>, the effect of the present apparatus is realized.

【０１７０】なお、図１３は処理手順Ｄを説明するフロ
ーチャートである。FIG. 13 is a flowchart for explaining the processing procedure D.

【０１７１】＜処理手順Ｄ＞Ｄ１：入力部３０２を通じて利用者からの入力Ｉを受け
とる場合はステップＤ２へ進み、出力部３０３を通じて
利用者へ出力Ｏを利用者に出力する場合は、ステップＤ
９へ進む。<Processing Procedure D> D1: If the input I from the user is received via the input unit 302, the process proceeds to step D2. If the output O is output to the user via the output unit 303, the process proceeds to step D2.
Go to 9.

【０１７２】Ｄ２：タイマＱをリセットしスタートす
る。D2: Timer Q is reset and started.

【０１７３】Ｄ３：タイマＱがあらかじめ定めた値Ｈを
超えたらステップＤ１へ進む。D3: If the timer Q exceeds the predetermined value H, the process proceeds to step D1.

【０１７４】Ｄ４：注視対象検出部３０１から得られる
注視対象情報Ｗの対象情報Ｂの内容を参照し、あらかじ
め定めた特定の物体あるいは領域である注視対象Ｘを注
視していることが判明したら、ステップＤ２へ進む。D4: With reference to the content of the target information B of the target information W obtained from the target detection unit 301, if it is determined that the user is gazing at the specific object or area G, which is a predetermined area, Proceed to step D2.

【０１７５】Ｄ５：入力３０２によって、利用者からの
入力Ｉが検知された場合は、ステップＤ７へ進む。D5: If an input I from the user is detected by the input 302, the process proceeds to step D7.

【０１７６】Ｄ６：ステップＤ３へ進む。D6: Proceed to step D3.

【０１７７】Ｄ７：入力部３０２による入力Ｉの処理結
果が、対話制御部３０４を通じて、アプリケーション３
０５へと渡される。D7: The processing result of the input I by the input unit 302 is transmitted to the application 3 through the dialog control unit 304.
Passed to 05.

【０１７８】Ｄ８：アプリケーション３０５によって、
利用者に応答すべき出力Ｏが決定され、対話管理部３０
４へと渡される。D8: By the application 305,
The output O to respond to the user is determined, and the dialog management unit 30
Passed to 4.

【０１７９】Ｄ９：出力部３０３を通じて、利用者への
出力Ｏの出力を開始する。D9: Output of the output O to the user via the output unit 303 is started.

【０１８０】Ｄ１０：出力部３０３を通じての出力Ｏが
終了したらステップＤ１へ進む。D10: When the output O through the output unit 303 is completed, the process proceeds to step D1.

【０１８１】Ｄ１１：注視対象検出部３０１から得られ
る注視対象情報Ｖの対象情報Ｂの内容を参照し、あらか
じめ定めた特定の物体あるいは領域である注視対象Ｙへ
の利用者の注視を検出した場合には、ステップＤ１３へ
進む。D11: A case where the user's gaze is detected on the gaze target Y which is a predetermined specific object or area with reference to the contents of the target information B of the gaze target information V obtained from the gaze target detection unit 301 Proceeds to step D13.

【０１８２】Ｄ１２：ステップＤ１０へ進む。D12: Proceed to step D10.

【０１８３】Ｄ１３：現在の出力Ｏの提示を中断した後
で、出力Ｏの利用者への再提示を行なう。D13: After the presentation of the current output O is interrupted, the output O is re-presented to the user.

【０１８４】Ｄ１４：利用者からの、例えば「えっ」と
いった非言語音声が入力されるなど、本装置から利用者
への、出力Ｏの伝達が正しく行われなかったことを表す
明示的な入力がなされた場合には、ステップＤ１６へ進
む。D14: An explicit input indicating that the output O was not correctly transmitted from the apparatus to the user, for example, when a non-verbal voice such as "eh" was input from the user. If so, the process proceeds to step D16.

【０１８５】Ｄ１５：ステップＤ１へ進む。D15: Proceed to step D1.

【０１８６】Ｄ１６：出力Ｏに関して利用者が理解して
いるかどうかに関する確認の対話処理を起動する。D16: Activate an interactive process for confirming whether or not the user understands the output O.

【０１８７】Ｄ１７：ステップＤ１へ進む。D17: Proceed to step D1.

【０１８８】続いて、具体的例を用いて第３実施実施形
態の動作説明を行なう。Next, the operation of the third embodiment will be described using a specific example.

【０１８９】まず、仮定として、入力手段３０２として
音声入力を持ち、出力手段３０３としてスピーカから出
力される音声出力とディスプレイからの出力される画像
情報出力を持つマルチモーダルインタフェース装置を例
として説明を行なう。First, a multimodal interface device having an audio input as the input means 302 and having an audio output from a speaker and an image information output from the display as the output means 303 will be described as an example. .

【０１９０】また、処理手順ＤのステップＤ４に現れる
特定の注視対象Ｘとしては、マインズアイ（後述）が設
定されているものとし、また処理手順ＤのステップＤ１
１に現れる特定の注視対象Ｙとして、スピーカ部分が設
定されているものとする。Further, it is assumed that a mines eye (described later) is set as the specific gazing target X appearing in step D4 of the processing procedure D, and the specific attention target X is set in step D1 of the processing procedure D.
It is assumed that a speaker portion is set as a specific watching target Y appearing in 1.

【０１９１】まずはじめ、本装置から利用者に向かっ
て、例えば「宛先を教えて下さい」という音声出力がな
され、この質問に対する利用者からの回答を本装置が受
けとるという状況であるものとする。First, it is assumed that, for example, a voice output of “Please tell me the destination” is made from the present apparatus to the user, and the present apparatus receives a response to this question from the user.

【０１９２】この質問に対する利用者からの音声入力を
受けとるため、ステップＤ１からＤ２への分岐が行われ
る。In order to receive a voice input from the user in response to this question, a branch is made from step D1 to D2.

【０１９３】続いて、タイマＱによって時間Ｈの間、ス
テップＤ２〜ステップＤ６の処理ループが繰り返される
が、今回はその時間の間に利用者から例えば「神戸市で
す」という音声入力Ｉ１がなされたとする。Subsequently, the processing loop of steps D2 to D6 is repeated during the time H by the timer Q. In this case, it is assumed that a voice input I1 such as "Kobe city" is made by the user during this time. I do.

【０１９４】ここまでに行なわれた処理は、従来のマル
チモーダルインタフェース処理あるいは対話装置におけ
る処理と同様のものである。The processing performed so far is the same as the conventional multimodal interface processing or the processing in the interactive device.

【０１９５】次に、上述と同じ状況に対して、利用者が
入力すべき情報（例えば宛先）を即座に答えることが出
来ず、入力すべき情報を思い出すために、マインズアイ
と呼ばれる行動をとった場合を考えてみる。Next, in the same situation as described above, the user cannot immediately reply to the information to be input (for example, the destination), and in order to remember the information to be input, an action called Minds Eye is taken. Consider the case.

【０１９６】このマインズアイとは、人間が何らかの情
報を思い出したり、あるいは考えをまとめようとする場
合に、ある特定の方向を向く傾向があることを指すもの
であり、典型的には、斜め上方向を向く場合が多い。The mind's eye indicates that a person tends to face a specific direction when trying to recall some information or to summarize ideas. They often face the direction.

【０１９７】本装置では、利用者があらかじめ定めた特
定の注視対象（この場合は斜め上方）を注視した場合
に、注視対象検出部３０１が、対象情報Ｂの値として記
号「マインズアイ」を含む注視対象情報Ｗ１を出力する
ようにしている。In the present apparatus, when the user gazes at a specific predetermined gaze target (in this case, diagonally above), the gaze target detection unit 301 includes the symbol “Minds eye” as the value of the target information B. The gaze target information W1 is output.

【０１９８】そのため、処理手順ＤのステップＤ２〜Ｄ
６の利用者からの入力を待つ処理ループの中を処理して
いる間に、利用者がマインズアイと呼ばれる行動（具体
的には、この場合は斜め上方向を注視する行動）を行な
うと、注視対象検出部３０１によってそれが検知され、
例えば図５のエントリｑ２５１あるいはｑ２５２の対象
情報Ｂの欄に示した記号「マインズアイ」を含む注視対
象情報が出力されることとなる。Therefore, steps D2 to D of processing procedure D
When the user performs an action called Minds Eye (specifically, in this case, an action of gazing obliquely upward) during processing in the processing loop waiting for an input from the user of No. 6, It is detected by the gaze target detection unit 301,
For example, the gaze target information including the symbol “Minds eye” shown in the column of the target information B of the entry q251 or q252 in FIG. 5 is output.

【０１９９】これにより、ステップＤ４からステップＤ
２へと進み、タイマＱがリセットされ、結果として利用
者の入力を待つ時間が延長されることとなる。Thus, steps D4 to D
Proceeding to 2, the timer Q is reset, and as a result, the time for waiting for user input is extended.

【０２００】以上の処理によって、本装置では利用者が
入力すべき情報を想起するなどのために、マインズアイ
と呼ばれる行動を行なった際に、自動的に入力待ち受け
時間が延長され、結果としてユーザフレンドリーなマル
チモーダルインタフェースが実現されることとなる。According to the above-described processing, the present apparatus automatically extends the input waiting time when performing an action called “Mindseye” in order to recall information to be input by the user. A friendly multi-modal interface will be realized.

【０２０１】つづいて、この音声入力Ｉ１により、ステ
ップＤ５からステップＤ７〜Ｄ８へと進み、例えば、利
用者の出力として「新しい郵便番号を教えて下さい」と
いう音声出力に対応する出力Ｏ１がアプリケーション３
０５によって、決定され、対話管理部３０４に渡された
ものとする。Subsequently, the voice input I1 causes the process to proceed from step D5 to steps D7 to D8. For example, the output O1 corresponding to the voice output "Tell me a new postal code" is output from the application 3 as a user output.
05 and passed to the dialog management unit 304.

【０２０２】続いて、ステップＤ９へと進み、出力Ｏ１
に関する音声出力「新しい郵便番号…」が利用者へと提
示され始めたものとする。Then, the process proceeds to a step D9, wherein the output O1
It is assumed that the voice output "new postcode ..." has been presented to the user.

【０２０３】ここから、ステップＤ１０〜Ｄ１２の処理
ループによって、利用者への出力Ｏ１の提示が続けられ
るが、今回は、その出力の途中で、利用者が現在提示さ
れつつある出力の一部分、例えば「新しい郵便番号」の
部分が、聞きとれなかったため、スピーカを注視したも
のとする。From here, the output O1 is continuously presented to the user by the processing loop of steps D10 to D12, but this time, in the middle of the output, a part of the output that the user is currently presenting, for example, Since the "new postal code" part could not be heard, it is assumed that the speaker gazes.

【０２０４】この利用者のスピーカへの注視は、注視対
象検出部３０１により検知され、注視対象情報Ｖ１とし
て対話管理部３０４に渡される。The user's gaze at the speaker is detected by the gaze target detection unit 301 and passed to the dialog management unit 304 as gaze target information V1.

【０２０５】この注視対象情報Ｖ１により、ステップＤ
１１からステップＤ１３へと分岐する。Based on the watch target information V1, step D
The process branches from step 11 to step D13.

【０２０６】ステップＤ１３により、現在出力途中であ
った出力は中断され、出力部３０２を通じて、再度利用
者に提示され直す。At step D13, the output which is currently being output is interrupted, and is again presented to the user via the output unit 302.

【０２０７】ここで、利用者が再提示出力を受け取れた
場合には、ステップＤ１５からステップＤ１へとすすみ
初期状態へと戻る。Here, when the user receives the re-presentation output, the process proceeds from step D15 to step D1 and returns to the initial state.

【０２０８】以上の処理によって、本装置では、利用者
が出力情報の受け取りに失敗した場合にも、あらかじめ
定めた特定の注視対象を注視するだけで再提示が行われ
るため、出力情報を正しく受け取ることが出来る。According to the above processing, in the present apparatus, even when the user fails to receive the output information, re-presentation is performed only by gazing at a predetermined specific gazing target, so that the output information is correctly received. I can do it.

【０２０９】なおこれは、人間が人間同士の対話に於い
て、例えば理解できなかったりあるいは聞き取りに失敗
した場合などに、無意識に対話相手を見ることによっ
て、その障害の発生を対話相手にフィードバックすると
いう行動と同様の行動を、本装置に対して行なう利用者
に対して適切に対応するための機能を実現するものであ
る。[0209] In this case, in the case where a human cannot understand or fails to hear in a conversation between humans, the occurrence of the obstacle is fed back to the conversation partner by unconsciously looking at the conversation partner. This implements a function for appropriately responding to the user who performs the same action as this action on the present apparatus.

【０２１０】あるいは、ステップＤ１３の再提示によっ
ても利用者が出力情報を正しく受け取れなかった場合に
も、その障害の発声を利用者が明示的に提示すること
で、ステップＤ１４、Ｄ１６〜Ｄ１７の処理によって、
確認の対話を起動することが出来る。Alternatively, even if the user does not receive the output information correctly even after the re-presentation in step D13, the user can explicitly present the utterance of the failure, thereby performing the processing in steps D14 and D16 to D17. By
A confirmation dialog can be activated.

【０２１１】かくしてこのように構成された本第３実施
形態によれば、利用者が情報入力の待ち受け時間を延長
するために、例えばボタンを押すなどといった恣意的な
操作を行なうことが不要で、自然で、利用者にとって繁
雑でなく、習得のための訓練が不要であり、利用者の負
担を増加しないマルチモーダルインタフェース装置およ
びマルチモーダルインタフェース方式を提供することが
可能となる。According to the third embodiment thus configured, it is not necessary for the user to perform an arbitrary operation such as pressing a button in order to extend the standby time for inputting information. It is possible to provide a multi-modal interface device and a multi-modal interface system that are natural, not complicated for the user, do not require training for learning, and do not increase the burden on the user.

【０２１２】また、人間同士のコミュニケーションにお
いては重要な役割を演じていると言われる、非言語メッ
セージを、効果的に利用することが出来るマルチモーダ
ルインタフェース装置およびマルチモーダルインタフェ
ース方式を提供することが可能となる。Further, it is possible to provide a multi-modal interface device and a multi-modal interface system which can effectively use non-verbal messages, which are said to play an important role in communication between humans. Becomes

【０２１３】また、利用者からの入力に対応して利用者
への適切な出力を行なったり、あるいは利用者からの入
力と利用者への出力のタイミングを適切に制御するため
に、利用者の発話が開始されるタイミングや、あるいは
利用者の発話が終了するタイミングなどを、事前に予測
することが可能となる。In order to appropriately output to the user in response to the input from the user, or to appropriately control the timing of the input from the user and the output to the user, The timing at which the utterance starts or the timing at which the utterance of the user ends can be predicted in advance.

【０２１４】また、利用者からの入力の認識に失敗した
り、あるいは利用者への情報の出力に失敗をした場合な
ど、利用者との間のコミュニケーションに関する何らか
の障害が発生した場合などには、その障害の発生を適切
に検知することが可能となる。[0214] Further, in the case of failure in recognizing the input from the user or failure in outputting the information to the user, for example, in the case where some trouble concerning the communication with the user occurs, The occurrence of the failure can be appropriately detected.

【０２１５】また、検知した障害を解決するための、例
えば確認のための情報の再提示や、あるいは利用者への
問い返し質問対話や、あるいは対話の論議の流れの適切
な管理を行なうことが可能となる。[0215] In addition, it is possible to re-present information for confirmation, for example, to solve a detected obstacle, or to perform a question dialogue to return a question to a user, or to appropriately manage a flow of discussion of the dialogue. Becomes

【０２１６】尚、本発明は、以上の各実施形態に限定さ
れるものではない。The present invention is not limited to the above embodiments.

【０２１７】まず、上述の第１実施形態では、利用者の
呼吸状態の検出にカメラから得られる画像情報の解析に
よる方法を示したが、例えば利用者の身体や衣服などに
装着あるいは近接して設置するセンサーなどを用いた方
法によっても、同様の効果を得ることが可能である。First, in the first embodiment described above, the method of detecting the user's breathing state by analyzing image information obtained from the camera has been described. The same effect can be obtained by a method using a sensor to be installed.

【０２１８】また、上述の第１実施形態では、検知した
利用者の呼吸の状態に関する情報を、利用者の発話の開
始時間の予測や、あるいはある発話に継続して行なわれ
る発話の検出などに利用する例を示したが、例えば利用
者の呼吸の深さまでを検出し、呼吸の深さと、後続する
発話の全体、あるいは次の息継ぎまでのフレーズの長さ
との関係を、あらかじめ用意しておいたり、あるいはそ
の時点までの実際の利用履歴から抽出した学習データな
どから推測した値などを参照することで、利用者の呼吸
の深さに応じて、続く発話の長さを予測し、該発声の取
り込み処理や、あるいは音響分析や、あるいは言語的解
析処理や、あるいは対話における発話交替タイミング管
理処理などに於いて、利用するように構成することも可
能である。In the first embodiment described above, the detected information on the user's breathing state is used to predict the start time of the user's utterance, or to detect the utterance that is continuously performed after a certain utterance. An example of use is shown, but for example, the depth of the user's breathing is detected, and the relationship between the depth of breathing and the length of the subsequent utterance or the length of the phrase until the next breathing is prepared in advance. By referring to values inferred from learning data extracted from actual use history up to that point in time, or the like, the length of the subsequent utterance is predicted according to the user's breathing depth, and It can also be configured to be used in, for example, a fetching process, an acoustic analysis, a linguistic analysis process, or an utterance change timing management process in a dialogue.

【０２１９】また、上述の第２実施形態では、＜処理手
順Ｃ＞のステップＣ２〜ステップＣ５の処理において、
各ジェスチャ入力候補の開始時間と終了時間との間の全
時間区間に対応する注視状況情報記憶部２０２ｃのエン
トリに関して、種別リスト情報Ｂを参照した条件判断を
行ない、該ジェスチャ入力候補を受理すべきかどうかを
判断するようにしているが、例えば、該ジェスチャ入力
候補の提示されている時間の内の、例えば時間比率の上
での最初の一部分であるとか、あるいは最後の一部分で
あるとか、あるいは最初の一部分と最後の一部分の双方
などといった、特定の部分に関してのみ、同様の条件判
断を行なって、該ジェスチャ入力候補を受理すべきかど
うかを判断するように構成することも可能である。In the second embodiment, in the processing of steps C2 to C5 of <processing procedure C>,
With regard to the entries in the gaze state information storage unit 202c corresponding to the entire time interval between the start time and the end time of each gesture input candidate, a condition determination is performed with reference to the type list information B, and should the gesture input candidate be accepted? Whether the gesture input candidate is the first part or the last part on the time ratio, for example, of the time when the gesture input candidate is presented, or the first part It is also possible to perform a similar condition determination only on a specific part, such as both the part and the last part, to determine whether to accept the gesture input candidate.

【０２２０】さらに、この条件判断に使う部分の時間的
位置や、箇所数などを利用毎にあらかじめ調整しておい
たり、あるいは自動的適応的に調整する様に構成するこ
とも可能であり、これにより、ある利用者は例えばある
特定の方向を注視しながらジェスチャ入力を開始し、そ
の後視線を逸した後該ジェスチャ入力を終えるといった
癖などを持っている場合にも適切にジェスチャ入力を受
理することのできるマルチモーダルインタフェース装置
およびマルチモーダルインタフェース方法を実現するこ
とが出来る。Furthermore, it is possible to adjust the temporal position and the number of points used in the condition judgment in advance for each use, or to automatically adjust the position adaptively. Accordingly, a certain user can appropriately accept a gesture input even if the user has a habit of, for example, starting gesture input while gazing in a specific direction and then ending the gesture input after losing his / her gaze. And a multi-modal interface method and a multi-modal interface method.

【０２２１】また、上述の第１乃至第３実施形態では、
装置として本発明を実現する場合のみを示したが、上述
の具体例の中で示した処理手順、フローチャートをプロ
グラムとして記述し、実装し、汎用の計算機システムで
実行することによっても同様の機能と効果を得ることが
可能である。In the first to third embodiments described above,
Although only the case where the present invention is realized as an apparatus is shown, the same functions and functions can be obtained by describing the processing procedures and flowcharts shown in the above-described specific examples as a program, implementing the program, and executing the program on a general-purpose computer system. The effect can be obtained.

【０２２２】すなわち、この場合図１４の汎用コンピュ
ータの構成の例に示したように、ＣＰＵ４０１、メモリ
４０２、大容量記憶装置４０３、通信インタフェース４
０４からなる汎用コンピュータに、入力インタフェース
４０４ａ〜４０４ｎと、入力デバイス４０５ａ〜４０５
ｎ、そして、出力インタフェース４０７ａ〜４０７ｍ、
出力デバイス４０８ａ〜４０８ｍを設け、入力デバイス
４０６ａ〜４０６ｎに、マイクやキーボード、ペンタブ
レット、ＯＣＲ、マウス、スイッチ、タッチパネル、カ
メラ、データグローブ、データスーツといった部品を使
用し、出力デバイス４０８ａ〜４０８ｍとして、ディス
プレイ、スピーカ、フォースディスプレイ、等を用い
て、ＣＰＵ４０１によるソフトウェア制御により、上述
のごとき動作を実現することが出来る。That is, in this case, as shown in the example of the configuration of the general-purpose computer in FIG. 14, the CPU 401, the memory 402, the mass storage device 403, and the communication interface 4
04, input interfaces 404a to 404n and input devices 405a to 405
n, and output interfaces 407a-407m,
Output devices 408a to 408m are provided, and components such as a microphone, a keyboard, a pen tablet, an OCR, a mouse, a switch, a touch panel, a camera, a data glove, and a data suit are used as input devices 406a to 406n. The above-described operation can be realized by software control by the CPU 401 using a display, a speaker, a force display, and the like.

【０２２３】すなわち、第１乃至第３実施形態に記載し
た手法は、コンピュータに実行させることの出来るプロ
グラムとして、磁気ディスク（フロッピーディスク、ハ
ードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶ
Ｄなど）、半導体メモリなどの記録媒体を用いてコンピ
ュータにプログラムを読み込み、ＣＰＵ４０５で実行さ
せれば、本発明のマルチモーダルインタフェース装置を
実現することが出来ることになる。That is, the methods described in the first to third embodiments include a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DV
D), and by using a recording medium such as a semiconductor memory to read a program into a computer and causing the computer to execute the program, the multimodal interface device of the present invention can be realized.

【０２２４】[0224]

【発明の効果】以上説明したように、本発明によれば、
新たに利用可能となった各入出力メディアあるいは、複
数の入出力メディアを効率的に利用し、高能率で、効果
的で、利用者の負担を軽減することが出来るマルチモー
ダルインタフェースを実現することができる。As described above, according to the present invention,
To realize a multi-modal interface that efficiently uses each newly available input / output media or multiple input / output media, and is highly efficient, effective and can reduce the burden on the user. Can be.

【０２２５】また、各メディアからの入力の解析精度が
不十分さに起因する誤認識や、利用者が入力メッセージ
として意図した信号部分の切りだしの失敗に起因する誤
動作を起こさず、利用者への余分な負担を生じないマル
チモーダルインタフェースを実現することが出来る。In addition, erroneous recognition due to insufficient analysis accuracy of input from each medium and erroneous operation due to failure to cut out a signal portion intended by the user as an input message do not occur to the user. It is possible to realize a multi-modal interface that does not generate an extra burden.

【０２２６】また、音声入力やジェスチャ入力など、利
用者が現在の操作対象である計算機などへの入力として
用いるだけでなく、例えば周囲の他の人間へ話しかけな
りする場合にも利用されるメディアを用いたインタフェ
ース装置では、利用者が、インタフェース装置ではな
く、たとえば自分の横にいる他人に対して話しかけた
り、ジェスチャを示したりした場合に、インタフェース
装置が自分への入力であると誤って判断することがない
マルチモーダルインタフェースを実現することが出来
る。[0226] Media such as voice input and gesture input that is used not only by the user as input to a computer or the like that is the current operation target but also when, for example, he or she speaks to other people around is used. In the interface device used, if the user speaks or shows a gesture to another person besides the user instead of the interface device, for example, the user incorrectly determines that the interface device is an input to the user. It is possible to realize a multi-modal interface without any problem.

【０２２７】また、上述のような計算機への入力を利用
者が意図していないメッセージを誤って自己への入力で
あると誤認識したことによる誤動作や、その影響の復旧
や、誤動作を避けるために利用者が絶えず注意を払わな
くてはいけなくなるなどの負荷を含めた利用者への負担
を生じないマルチモーダルインタフェースを実現するこ
とが出来る。Further, in order to avoid a malfunction due to a user mistakenly recognizing a message that the user does not intend to input to the computer as being an input to the user, recovery of the influence thereof, and avoiding a malfunction. Therefore, it is possible to realize a multi-modal interface that does not cause a burden on the user including a load such as a need for the user to constantly pay attention.

【０２２８】また、本来不要な場面においても、入力信
号の処理が継続的にして行われるため、その処理負荷に
よって、利用している装置に関与する他のサービスの実
行速度や利用効率が低下してしまうことのないマルチモ
ーダルインタフェースを実現することが出来る。[0228] Further, even in a scene that is originally unnecessary, the processing of the input signal is continuously performed, so that the execution speed and utilization efficiency of other services related to the apparatus being used are reduced due to the processing load. It is possible to realize a multi-modal interface that does not occur.

【０２２９】また、音声やジェスチャなどの入力を行な
う際に、たとえば、ボタンを押したり、メニュー選択な
どといった特別な操作によるモード変更が必要なく、自
然で、利用者によって繁雑でなく、習得のための訓練が
不要であり、利用者の負担を増加しないマルチモーダル
インタフェースを実現することが出来る。Further, when inputting a voice or a gesture, for example, it is not necessary to change a mode by a special operation such as pressing a button or selecting a menu. Training is unnecessary, and a multi-modal interface that does not increase the burden on the user can be realized.

【０２３０】また、例えば、口だけを使ってコミュニケ
ーションが出来、例えば手で行なっている作業を妨害す
ることがなく、双方を同時に利用することが可能である
と言う、音声メディア本来の利点を活かすことが出来る
マルチモーダルインタフェースを実現することが出来
る。[0230] Further, for example, it is possible to communicate using only the mouth, and for example, it is possible to use both at the same time without interfering with the work being performed by hand, and to take advantage of the inherent advantage of audio media. Multi-modal interface that can be implemented.

【０２３１】また、離れた位置や、機器に接触せずに、
ジェスチャの入力を行なう際に、利用者が入力を意図し
たジェスチャだけを、適切に抽出できるマルチモーダル
インタフェースを実現することが出来る。[0231] Also, without touching a remote location or equipment,
When inputting a gesture, it is possible to realize a multimodal interface capable of appropriately extracting only a gesture intended by the user to input.

【０２３２】また、人間同士のコミュニケーションにお
いては重要な役割を演じていると言われる、非言語メッ
セージを、効果的に利用することが出来るマルチモーダ
ルインタフェースを実現することが出来る。Further, it is possible to realize a multi-modal interface that can effectively use non-verbal messages, which are said to play an important role in communication between humans.

【０２３３】また、利用者からの入力に対応して利用者
への適切な出力を行なったり、あるいは利用者からの入
力と利用者への出力のタイミングを適切に制御するため
に、利用者の発話が開始されるタイミングや、あるいは
利用者の発話が終了するタイミングなどを、適切に予測
することの出来るマルチモーダルインタフェースを実現
することが出来る。Further, in order to perform appropriate output to the user in response to the input from the user or to appropriately control the timing of the input from the user and the output to the user, It is possible to realize a multimodal interface capable of appropriately predicting the timing at which the utterance starts or the timing at which the user's utterance ends.

【０２３４】また、利用者からの入力の認識に失敗した
り、あるいは利用者への情報の出力に失敗をした場合な
ど、利用者との間のコミュニケーションに関する何らか
の障害が発生した場合などには、その障害の発生を適切
に検知することの出来るマルチモーダルインタフェース
を実現することが出来る。[0234] Also, in the case where any failure in communication with the user occurs, for example, when the recognition of the input from the user fails, or when the output of the information to the user fails, for example, A multimodal interface that can appropriately detect the occurrence of the failure can be realized.

【０２３５】また、検知した障害を解決するための、例
えば確認のための情報の再提示や、あるいは利用者への
問い返し質問対話や、あるいは対話の論議の流れの適切
な管理を行なうことの出来るマルチモーダルインタフェ
ースを実現することが可能となる等の、実用上多大な効
果が奏せられる。[0235] In addition, it is possible to re-present information for confirmation, to solve a detected fault, to perform a question-and-request question dialogue to a user, or to appropriately manage the flow of discussion of the dialogue. A large practical effect such as realizing a multi-modal interface can be obtained.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の第１実施形態に係るマルチモーダルイ
ンタフェース装置の構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a multimodal interface device according to a first embodiment of the present invention.

【図２】同第１実施形態のマルチモーダルインタフェー
ス装置で用いられる呼吸状況情報の例を示す図。FIG. 2 is a view showing an example of respiratory condition information used in the multi-modal interface device of the first embodiment.

【図３】同第１実施形態のマルチモーダルインタフェー
ス装置の処理手順（Ａ）の内容を示すフローチャート。FIG. 3 is a flowchart showing the contents of a processing procedure (A) of the multimodal interface device of the first embodiment.

【図４】本発明の第２実施形態に係るマルチモーダルイ
ンタフェース装置の構成を示すブロック図。FIG. 4 is a block diagram showing a configuration of a multimodal interface device according to a second embodiment of the present invention.

【図５】同第２実施形態のマルチモーダルインタフェー
ス装置で用いられる注視対象情報の例を示す図。FIG. 5 is a view showing an example of gaze target information used in the multi-modal interface device of the second embodiment.

【図６】同第２実施形態のマルチモーダルインタフェー
ス装置で用いられるジェスチャ認識情報の例を示す図。FIG. 6 is an exemplary view showing an example of gesture recognition information used in the multi-modal interface device of the second embodiment.

【図７】同第２実施形態のマルチモーダルインタフェー
ス装置に設けられた制御部の内部構成の例を示すブロッ
ク図。FIG. 7 is a block diagram showing an example of the internal configuration of a control unit provided in the multi-modal interface device according to the second embodiment.

【図８】同第２実施形態のマルチモーダルインタフェー
ス装置で用いられる注視解釈規則記憶部の内容の例を示
す図。FIG. 8 is a diagram showing an example of the contents of a gaze interpretation rule storage unit used in the multi-modal interface device of the second embodiment.

【図９】同第２実施形態のマルチモーダルインタフェー
ス装置で用いられる注視状況記憶部の内容の例を示す
図。FIG. 9 is a view showing an example of the contents of a gaze state storage unit used in the multi-modal interface device of the second embodiment.

【図１０】同第２実施形態のマルチモーダルインタフェ
ース装置の処理手順（Ｂ）の内容を示すフローチャー
ト。FIG. 10 is a flowchart showing the contents of a processing procedure (B) of the multi-modal interface device of the second embodiment.

【図１１】同第２実施形態のマルチモーダルインタフェ
ース装置の処理手順（Ｃ）の内容を示すフローチャー
ト。FIG. 11 is a flowchart showing the contents of a processing procedure (C) of the multimodal interface device of the second embodiment.

【図１２】本発明の第３実施形態に係るマルチモーダル
インタフェース装置の構成を示すブロック図。FIG. 12 is a block diagram showing a configuration of a multimodal interface device according to a third embodiment of the present invention.

【図１３】同第３実施形態のマルチモーダルインタフェ
ース装置の処理手順（Ｄ）の内容を示すフローチャー
ト。FIG. 13 is a flowchart showing the contents of a processing procedure (D) of the multimodal interface device of the third embodiment.

【図１４】本発明の各実施形態に係るマルチモーダルイ
ンタフェース装置を実現するコンピュータの構成例を示
すブロック図。FIG. 14 is a block diagram showing a configuration example of a computer for realizing the multi-modal interface device according to each embodiment of the present invention.

【符号の説明】[Explanation of symbols]

１０１…呼吸検出部１０２…音声入力部１０３…制御部１０４…アプリケーション２０１…注視対象検出部２０２…ジェスチャ認識部２０３…制御部２０４…アプリケーション２０３ａ…制御処理部２０３ｂ…注視解釈規則記憶部２０３ｃ…注視状況記憶部３０１…注視対象検出部３０２…入力部３０３…出力部３０４…対話管理部３０５…アプリケーション４０１…ＣＰＵ４０２…メモリ４０３…大容量記憶装置４０４…通信インタフェース４０５ａ〜ｎ…入力デバイス４０６ａ〜ｎ…入力インタフェース４０７ａ〜ｍ…出力デバイス４０８ａ〜ｍ…出力インタフェース 101: Respiration detection unit 102: Voice input unit 103: Control unit 104: Application 201: Gaze target detection unit 202: Gesture recognition unit 203: Control unit 204: Application 203a: Control processing unit 203b: Gaze interpretation rule storage unit 203c: Gaze Situation storage unit 301 ... gaze target detection unit 302 ... input unit 303 ... output unit 304 ... dialogue management unit 305 ... application 401 ... CPU 402 ... memory 403 ... large-capacity storage device 404 ... communication interface 405a-n ... input device 406a-n ... Input interface 407a-m ... Output device 408a-m ... Output interface

Claims

【特許請求の範囲】[Claims]

【請求項１】利用者からの情報の入力を受けつける入
力手段、および利用者への情報の出力を行なう出力手
段、および利用者との対話を管理する対話管理手段の
内、少なくとも一つを利用者インタフェースとして有す
るマルチモーダルインタフェース装置において、利用者の表情、発声、注視、ジェスチャ、姿勢、あるい
は身体動作の少なくとも一つからなる非言語メッセージ
を認識し非言語メッセージ情報として出力する非言語メ
ッセージ認識手段と、前記非言語メッセージ情報に基づいて、前記利用者との
間のインタフェースのために行われる前記入力手段ある
いは前記出力手段あるいは前記対話管理手段の少なくと
も一つの動作を制御する制御手段とを具備したことを特
徴とするマルチモーダルインタフェース装置。At least one of an input unit for receiving input of information from a user, an output unit for outputting information to a user, and a dialog management unit for managing a dialog with the user is used. Non-verbal message recognition means for recognizing a non-verbal message comprising at least one of a user's facial expression, utterance, gaze, gesture, posture, and body movement in a multimodal interface device having a user interface And control means for controlling at least one operation of the input means or the output means or the dialogue management means performed for an interface with the user based on the non-verbal message information. A multimodal interface device, characterized in that:

【請求項２】利用者の呼吸の状況を観察し呼吸状況情
報として出力する呼吸状況認識手段と、利用者の発する音声の、取り込み、あるいは録音、ある
いは加工、あるいは分析、あるいは認識の少なくとも一
つの処理をなう入力音声処理手段と、前記呼吸状況情報に基づいて、前記入力音声処理手段を
制御して、利用者からの音声入力信号の受け付け可否制
御処理、あるいは音声区間の推定処理、あるいは雑音低
減処理、あるいは音声信号変換処理の少なくとも一つの
処理の動作を制御する制御手段とを具備したことを特徴
とするマルチモーダルインタフェース装置。2. A respiratory condition recognizing means for observing a user's respiratory condition and outputting as respiratory condition information, and at least one of capturing, recording, processing, analyzing, or recognizing a voice uttered by the user. Input speech processing means for performing processing, based on the breathing status information, controlling the input speech processing means to control acceptability of a speech input signal from a user, or speech segment estimation processing, or noise. Control means for controlling operation of at least one of the reduction processing and the audio signal conversion processing.

【請求項３】利用者の呼吸の状況を観察し呼吸状況情
報として出力する呼吸状況認識手段と、利用者からの情報入力および利用者への情報出力のタイ
ミングを管理する対話管理手段と、前記呼吸状況情報に基づいて、前記対話管理手段を制御
して、利用者からの入力情報の受け付けタイミング、お
よび利用者への情報出力のタイミングの少なくとも一つ
を調整する制御手段とを具備したことを特徴とするマル
チモーダルインタフェース装置。3. A breathing state recognizing means for observing a user's breathing state and outputting the information as breathing state information; a dialogue managing means for managing timing of information input from the user and output of information to the user; Control means for controlling the dialogue management means based on the respiratory condition information to adjust at least one of a reception timing of input information from a user and a timing of information output to the user. Characteristic multi-modal interface device.

【請求項４】前記呼吸状況認識手段は、利用者の様子を撮影することにより得られた画像情報の
処理、あるいは利用者の身体に装着あるいは近接して配
置したセンサから得られたセンサ情報の処理によって、
利用者の呼吸の状況を観察することを特徴とする請求項
２または３記載のマルチモーダルインタフェース装置。4. The respiratory condition recognizing means processes image information obtained by photographing the state of the user, or obtains sensor information obtained from a sensor attached to or placed close to the user's body. By processing
The multimodal interface device according to claim 2, wherein a state of a user's breathing is observed.

【請求項５】利用者の視線方向を検出し、注視対象情
報として出力する視線検出手段と、利用者の画像情報の処理、あるいは赤外線等の遠隔セン
サあるいは装着センサよって得られる信号の処理によっ
て、利用者の体の一部分あるいは体の全体の動作を解析
し利用者からのジェスチャ入力を認識するジェスチャ認
識手段と、前記視線検出情報に基づいて、ジェスチャ入力の受け付
け可否、あるいはジェスチャ入力の検出あるいは認識に
用いられるパラメータ情報の調整によって、前記ジェス
チャ認識手段の動作を制御する制御手段とを具備したこ
とを特徴とするマルチモーダルインタフェース装置。5. A gaze detecting means for detecting a gaze direction of a user and outputting the information as gaze target information, and processing of image information of the user or processing of a signal obtained by a remote sensor or a wearing sensor such as infrared rays. Gesture recognition means for analyzing a part of the body of the user or the entire body and recognizing a gesture input from the user; and accepting or rejecting the gesture input, or detecting or recognizing the gesture input, based on the line-of-sight detection information. A control unit for controlling the operation of the gesture recognition unit by adjusting parameter information used in the multi-modal interface device.

【請求項６】利用者の視線方向を検出し、注視対象情
報として出力する視線検出手段と、画像ディスプレイ、あるいはスピーカ、あるいは提力装
置（フォースディスプレイ）の少なくとも一つの機器を
通じて、音声情報あるいは画像情報あるいは力情報の少
なくとも一つを利用者への出力として提示する出力手段
と、前記注視対象情報に基づいて、前記出力手段の動作を制
御する制御手段とを具備し、前記出力手段から利用者への情報提示を行なっている途
中あるいは直後に、利用者が前記出力手段を注視した際
に、前記出力情報を再出力するよう前記出力手段を制御
できるようにしたことを特徴とするマルチモーダルイン
タフェース装置。6. A visual line detecting means for detecting a visual line direction of a user and outputting it as gaze target information, and voice information or image through at least one device of an image display, a speaker, or a force-supply device (force display). Output means for presenting at least one of information or force information as an output to a user; and control means for controlling the operation of the output means based on the gaze target information. A multimodal interface characterized in that the output means can be controlled so as to re-output the output information when the user gazes at the output means during or immediately after the information is presented to the user. apparatus.

【請求項７】利用者の視線方向を検出し、注視対象情
報として出力する視線検出手段と、画像ディスプレイ、あるいはスピーカ、あるいは提力装
置（フォースディスプレイ）の少なくとも一つの機器を
通じて、音声情報あるいは画像情報あるいは力情報の少
なくとも一つを利用者への出力として提示する出力手段
と、利用者との対話を管理し、前記出力手段を通じた利用者
への伝達に失敗した情報の再提示、あるいは情報伝達が
達成されたかどうかを確認するための確認の対話の内、
少なくとも一つを行なう対話管理手段と、前記注視対象情報に基づいて、前記対話管理部を制御す
る制御手段とを具備し、前記出力手段から利用者への情報提示を行なっている途
中あるいは直後に、利用者が前記出力手段を注視した際
に、情報の再提示、あるいは情報伝達が達成されたかど
うかを確認するための確認対話あるいは、利用者からの
情報入力を行なうように、前記対話管理手段を制御でき
るようにしたことを特徴とするマルチモーダルインタフ
ェース装置。7. A gaze detecting means for detecting a gaze direction of a user and outputting as gaze target information, and voice information or image through at least one device of an image display, a speaker, or a force-supply device (force display). Output means for presenting at least one of information or force information as an output to a user; managing the dialogue with the user; re-presenting information that failed to be transmitted to the user through the output means; or In the confirmation dialogue to confirm that the communication was achieved,
A dialogue management unit that performs at least one, and a control unit that controls the dialogue management unit based on the gaze target information, and during or immediately after presenting information from the output unit to a user. When the user gazes at the output unit, the dialog management unit is configured to re-present the information or confirm whether the information transmission has been achieved, or to input information from the user. A multi-modal interface device, characterized in that it is possible to control the multi-modal interface.

【請求項８】利用者の視線方向を検出し、注視対象情
報として出力する視線検出手段と、利用者との対話を管理し、利用者からの情報入力タイミ
ング、および利用者への情報出力のタイミングの少なく
とも一方を管理する対話管理手段と、前記注視対象情報に応じて、前記対話管理部を制御し、
対話中の利用者の注視対象が特定の方向あるいは領域に
存在する場合には、利用者からの入力あるいは利用者へ
の出力の開始、あるいは中断、あるいは終了タイミング
の少なくとも一つを調整する制御手段とを具備したこと
を特徴とするマルチモーダルインタフェース装置。8. A gaze detecting means for detecting a gaze direction of a user and outputting the information as gaze target information, managing dialogue with the user, timing of information input from the user, and information output to the user. Dialog management means for managing at least one of the timing, and controls the dialog management unit according to the gaze target information;
A control means for adjusting at least one of start, interruption, or end timing of input from the user or output to the user when the gaze target of the user during the dialogue exists in a specific direction or area; And a multimodal interface device.

【請求項９】前記視線検出手段は、利用者の画像情報を解析処理、あるいは利用者の頭部あ
るいは眼部、あるいは身体に装着あるいは近接して設置
したセンサ情報の解析によって、利用者の視線方向を検
出することを特徴とする請求項５乃至８のいずれか１項
記載のマルチモーダルインタフェース装置。9. The eye gaze detecting means analyzes the user's image information or analyzes sensor information attached to or placed near the user's head or eyes or body, and the user's eye gaze is analyzed. 9. The multimodal interface device according to claim 5, wherein a direction is detected.

【請求項１０】利用者からの音声情報あるいはジェス
チャ情報を利用して、利用者との間の情報の入出力、あ
るいは利用者との間の対話を管理するマルチモーダルイ
ンタフェース装置において、利用者の呼吸状況あるいは利用者の視線方向を、利用者
からの非言語メッセージとして認識する非言語メッセー
ジ認識手段と、この非言語メッセージ認識手段による認識結果に基づい
て、前記利用者との間の情報入出力動作、あるいは利用
者との間の対話動作を制御する制御手段とを具備したこ
とを特徴とするマルチモーダルインタフェース装置。10. A multimodal interface device that manages input / output of information with a user or interaction with a user by using voice information or gesture information from the user. A non-verbal message recognizing means for recognizing a respiratory condition or a user's line of sight as a non-verbal message from the user; and information input / output with the user based on a recognition result by the non-verbal message recognizing means. A multi-modal interface device comprising: control means for controlling an operation or an interactive operation with a user.

【請求項１１】利用者からの情報の入力を受けつける
入力ステップ、および利用者への情報の出力を行なう出
力ステップ、および利用者との対話を管理する対話管理
ステップの内、少なくとも一つのステップを含むマルチ
モーダルインタフェース方法において、利用者の表情、あるいは発声、注視、ジェスチャ、姿
勢、身体動作など、利用者が発する少なくとも一つの非
言語メッセージを認識し非言語メッセージ情報として出
力する非言語メッセージ認識ステップと、前記非言語メッセージ情報に基づいて、前記利用者との
間のインタフェースのために行われる前記入力ステップ
あるいは前記出力ステップあるいは前記対話管理ステッ
プの少なくとも一つの動作を制御することを特徴とする
マルチモーダルインタフェース方法。11. At least one of an input step of receiving information input from a user, an output step of outputting information to a user, and a dialogue management step of managing a dialogue with the user. A multi-modal interface method including a non-verbal message recognition step of recognizing at least one non-verbal message issued by the user, such as a facial expression of the user, utterance, gaze, gesture, posture, body movement, etc., and outputting as non-verbal message information And controlling at least one operation of the input step, the output step, or the dialogue management step performed for an interface with the user based on the non-verbal message information. Modal interface method.

【請求項１２】利用者の発する音声の、取り込み、あ
るいは録音、あるいは加工、あるいは分析、あるいは認
識の少なくとも一つの処理をなう入力音声処理ステップ
と、利用者の呼吸の状況を観察し呼吸状況情報として出力す
る呼吸状況認識ステップと、前記呼吸状況情報に基づいて、前記入力音声処理ステッ
プを制御して、利用者からの音声入力信号の受け付け可
否制御処理、あるいは音声区間の推定処理、あるいは雑
音低減処理、あるいは音声信号変換処理の少なくとも一
つの処理の動作を制御する制御ステップとを具備するこ
とを特徴とするマルチモーダルインタフェース方法。12. An input voice processing step for performing at least one processing of capturing, recording, processing, analyzing, or recognizing a voice uttered by a user, and observing a breathing state of the user to observe a breathing state. Based on the breathing state information, the input voice processing step is controlled based on the breathing state recognition step to output information, and a process of controlling whether or not to accept a voice input signal from a user, or a process of estimating a voice section, or noise. A control step of controlling operation of at least one of a reduction process and an audio signal conversion process.

【請求項１３】利用者からの情報入力と利用者への情
報出力のタイミングを管理する対話管理ステップと、利用者の呼吸の状況を観察し呼吸状況情報として出力す
る呼吸状況認識ステップと、前記呼吸状況情報に基づいて、前記対話管理ステップを
制御して、利用者からの入力情報の受け付けタイミン
グ、および利用者への情報出力のタイミングの少なくと
も一つを調整する制御ステップとを具備することを特徴
とするマルチモーダルインタフェース方法。13. A dialogue management step for managing the timing of information input from the user and output of information to the user; a breathing state recognition step of observing the state of the user's breathing and outputting as breathing state information; A control step of controlling the dialogue management step based on respiratory condition information to adjust at least one of a reception timing of input information from a user and a timing of information output to the user. A featured multi-modal interface method.

【請求項１４】利用者の視線方向を検出し、注視対象
情報として出力する視線検出ステップと、利用者の体の一部分あるいは体の全体の動作を解析し利
用者からのジェスチャ入力を認識するジェスチャ認識ス
テップと、前記視線検出情報に基づいて、ジェスチャ入力の受け付
け可否、あるいはジェスチャ入力の検出あるいは認識に
用いられるパラメータ情報の調整によって、前記ジェス
チャ認識手段の動作を制御する制御ステップとを具備す
ることを特徴とするマルチモーダルインタフェース方
法。14. A gaze detection step of detecting a gaze direction of a user and outputting the information as gaze target information, and a gesture of analyzing a part of or the whole body of the user and recognizing a gesture input from the user. A recognition step, and a control step of controlling the operation of the gesture recognition means by adjusting whether or not to accept a gesture input or adjusting parameter information used for detecting or recognizing the gesture input based on the gaze detection information. A multimodal interface method comprising:

【請求項１５】利用者の視線方向を検出し、注視対象
情報として出力する視線検出ステップと、画像ディスプレイ、あるいはスピーカ、あるいは提力装
置（フォースディスプレイ）など少なくとも一つの機器
を通じて、音声情報あるいは画像情報あるいは力情報の
少なくとも一つを利用者への出力として提示する出力ス
テップと、前記注視対象情報に基づいて、該出力手段の動作を制御
する制御ステップとを具備し、前記出力ステップから利用者への情報提示を行なってい
る途中あるいは直後に、利用者が、その利用者への情報
出力のために用いられるの機器を注視した際に、該出力
情報を再出力するよう出力ステップを制御できるように
したことを特徴とするマルチモーダルインタフェース方
法。15. A gaze detection step of detecting a gaze direction of a user and outputting the information as gaze target information, and voice information or image through at least one device such as an image display, a speaker, or a force-supply device (force display). An output step of presenting at least one of information or force information as an output to a user; and a control step of controlling an operation of the output unit based on the gazing target information. The output step can be controlled to re-output the output information when the user gazes at a device used for outputting information to the user during or immediately after the information is presented to the user. A multi-modal interface method characterized in that:

【請求項１６】利用者の視線方向を検出し、注視対象
情報として出力する視線検出ステップと、画像ディスプレイ、あるいはスピーカ、あるいは提力装
置（フォースディスプレイ）など少なくとも一つの機器
を通じて、音声情報あるいは画像情報あるいは力情報の
少なくとも一つを利用者への出力として提示する出力ス
テップと、利用者との対話を管理し、前記出力ステップを通じた利
用者への伝達に失敗した情報の再提示、あるいは情報伝
達が達成されたかどうかを確認するための確認の対話な
どの内、少なくとも一つを行なう対話管理ステップと、前記注視対象情報に基づいて、前記対話管理ステップを
制御する制御ステップとを具備し、前記出力ステップから利用者への情報提示を行なってい
る途中あるいは直後に、利用者が前記出力ステップによ
る情報出力のために用いられる機器を注視した際に、情
報の再提示、あるいは情報伝達が達成されたかどうかを
確認するための確認対話あるいは、利用者からの情報入
力を行なうように前記対話管理ステップを制御できるよ
うにしたことを特徴とするマルチモーダルインタフェー
ス方法。16. A gaze detection step of detecting a gaze direction of a user and outputting the information as gaze target information, and voice information or image through at least one device such as an image display, a speaker, or a force-supply device (force display). An output step of presenting at least one of information or force information as an output to a user; managing a dialogue with the user; re-presenting information that failed to be transmitted to the user through the output step; or A dialogue management step of performing at least one of confirmation dialogues for confirming whether or not the communication has been achieved, and a control step of controlling the dialogue management step based on the gaze target information, During or immediately after the information is presented to the user from the output step, the user When gazing at a device used for information output by the force step, re-presentation of information, or confirmation dialogue to confirm whether information transmission has been achieved, or information input from a user is performed. A multimodal interface method, wherein a dialog management step can be controlled.

【請求項１７】利用者の視線方向を検出し、注視対象
情報として出力する視線検出ステップと、利用者との対話を管理し、利用者からの情報入力のタイ
ミング、および利用者への情報出力のタイミングの少な
くとも一方を管理する対話管理ステップと、前記注視対象情報に応じて、前記対話管理ステップを制
御し、対話中の利用者の注視対象が特定の方向あるいは
領域に存在する場合には、利用者の入力あるいは利用者
への出力の開始、あるいは中断、あるいは終了タイミン
グの少なくとも一つを調整する制御ステップとを具備し
たことを特徴とするマルチモーダルインタフェース方
法。17. A gaze detection step of detecting a gaze direction of a user and outputting the information as gaze target information, managing dialogue with the user, timing of inputting information from the user, and outputting information to the user. A dialogue management step of managing at least one of the timings, according to the gaze target information, controlling the dialogue management step, and when the gaze target of the user during the dialogue exists in a specific direction or area, A control step of adjusting at least one of a start, an interruption, and an end timing of user input or output to the user.

【請求項１８】利用者からの音声情報あるいはジェス
チャ情報を利用して、利用者との間の情報の入出力、あ
るいは利用者との間の対話を管理するマルチモーダルイ
ンタフェース方法において、利用者の呼吸状況、あるいは利用者の視線方向を、利用
者からの非言語メッセージとして認識する非言語メッセ
ージ認識ステップと、この非言語メッセージ認識ステップによる認識結果に基
づいて、前記利用者との間の情報入出力動作、あるいは
利用者との間の対話動作を制御する制御ステップとを具
備したことを特徴とするマルチモーダルインタフェース
方法。18. A multimodal interface method for managing input / output of information with a user or interaction with a user by using voice information or gesture information from the user. A non-verbal message recognition step of recognizing a respiratory condition or a user's gaze direction as a non-verbal message from the user; A control step of controlling an output operation or an interactive operation with a user.