JP6629172B2

JP6629172B2 - Dialogue control device, its method and program

Info

Publication number: JP6629172B2
Application number: JP2016229908A
Authority: JP
Inventors: 小林　和則; 和則小林; 弘章伊藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-28
Filing date: 2016-11-28
Publication date: 2020-01-15
Anticipated expiration: 2036-11-28
Also published as: JP2018087847A

Description

本発明は、対話型ロボットや音声リモコン等のコンピュータによる対話技術に関する。 The present invention relates to a computer-based interactive technology such as an interactive robot and a voice remote controller.

コンピュータによる対話技術の従来技術として特許文献１が知られている。図１は特許文献１の対話装置の機能ブロック図を示す。応答部９３によって、ユーザ発話の入力文字列に対して、単語列を取得すると共に、抽象化された文字列を取得する。次に、単語パターンデータベース９１又は抽象化パターンデータベース９２に記憶された単語パターン又は抽象化パターンから、取得した単語列と一致すると判断される単語パターン又は抽象化パターンを検索する。一致すると判断される単語パターン又は抽象化パターンが検索されると、応答部９３は、検索された単語パターン又は抽象化パターンの後の発話データの単語列を用いて応答する。 Patent Literature 1 is known as a conventional technique of computer-based interactive technology. FIG. 1 shows a functional block diagram of a dialogue device of Patent Document 1. The response unit 93 obtains a word string and an abstracted character string for the input character string of the user's utterance. Next, from the word patterns or the abstract patterns stored in the word pattern database 91 or the abstract pattern database 92, a word pattern or an abstract pattern determined to match the acquired word string is searched. When a word pattern or an abstract pattern determined to match is retrieved, the response unit 93 responds using the word string of the utterance data after the retrieved word pattern or the abstract pattern.

特開２０１５−４６１８３号公報JP 2015-46183 A

しかしながら、従来技術では、応答をするかしないかの選択しかない。そのため、何らかの音声に対して応答すべきか否かが不確かな場合であっても、全く応答しないか、不確かな情報に基づき応答するかしか選択できない。そのため、誤った応答となってしまう可能性が高い。人間が話しかけに応答する場合、自分に対する話しかけかどうかがあやふやな場合には、質問で聞き返したり、そちらのほうを向いて自分への話しかけであるかを確認したりする。 However, in the prior art, there is no choice but to respond or not. Therefore, even if it is uncertain whether or not to respond to any voice, it is only possible to select whether to respond at all or to respond based on uncertain information. Therefore, there is a high possibility that an incorrect response will result. When a human responds to a talk, if it is unclear whether he is talking to himself or not, he / she asks a question and turns to look at it to see if it is talking to himself.

本発明は、人間の行う上述のような確認動作を行うように対話装置を制御し、対話装置の誤った応答を低減することができる対話制御装置、その方法、及びプログラムを提供することを目的とする。 An object of the present invention is to provide a dialogue control device, a method thereof, and a program which can control a dialogue device so as to perform the above-described confirmation operation performed by a human and reduce an erroneous response of the dialogue device. And

上記の課題を解決するために、本発明の一態様によれば、対話制御装置は、(i)対話装置側から対話の契機となる音声を出力して対話を開始する話しかけシナリオ、(ii)利用者側からの発話に対して応答する応答シナリオ、及び、(iii)利用者に対して対話を開始するか否かを確認する確認シナリオを記憶するシナリオ記憶部と、対話装置側から対話の契機となる音声を出力して対話を開始すべきであるか否かを示す話しかけ開始指標Sと、ある音声に対して応答すべきであるか否かを示す応答開始指標Rとを入力とし、J及びKをそれぞれ1以上の整数の何れかとし、話しかけ開始指標SとJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、及び、応答開始指標RとK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係とに基づき、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択するシナリオ選択部を含む。 In order to solve the above-described problems, according to one aspect of the present invention, a dialogue control apparatus includes: (i) a talking scenario that starts a dialogue by outputting a voice that triggers a dialogue from the dialogue apparatus side; (ii) A scenario storage unit for storing a response scenario for responding to an utterance from the user, and (iii) a confirmation scenario for confirming whether or not to start a dialog with the user; Speaking start index S indicating whether or not to start a dialogue by outputting a voice as a trigger, and a response start index R indicating whether to respond to a certain voice as an input, Each of J and K is an integer of 1 or more _, and a magnitude relationship between a talking start index S and J thresholds Th _{s, 1} , Th _{s, 2} , ..., Th _{s, J} , and a response start index R And the K thresholds Thr _{, 1} , Thr _{, 2} ,..., Thr _{, K} , based on the magnitude relationship between the talking scenario, the response scenario, or And a scenario selection unit for selecting a confirmation scenario.

上記の課題を解決するために、本発明の他の態様によれば、対話制御方法は、シナリオ記憶部には、(i)対話装置側から対話の契機となる音声を出力して対話を開始する話しかけシナリオ、(ii)利用者側からの発話に対して応答する応答シナリオ、及び、(iii)利用者に対して対話を開始するか否かを確認する確認シナリオが記憶されるものとし、シナリオ選択部が、対話装置側から対話の契機となる音声を出力して対話を開始すべきであるか否かを示す話しかけ開始指標Sと、ある音声に対して応答すべきであるか否かを示す応答開始指標Rとを入力とし、J及びKをそれぞれ1以上の整数の何れかとし、話しかけ開始指標SとJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、及び、応答開始指標RとK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係とに基づき、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択するシナリオ選択ステップを含む。 According to another embodiment of the present invention, there is provided a dialogue control method, comprising the steps of: (i) outputting a voice that triggers a dialogue from a dialogue device side to a scenario storage unit to start the dialogue; (Ii) a response scenario for responding to the utterance from the user, and (iii) a confirmation scenario for confirming whether or not to start dialogue with the user. The scenario selection unit outputs a voice that triggers the dialogue from the dialogue device side to indicate whether or not the dialogue should be started, and a talking start index S, and whether to respond to a certain voice. , And J and K are each an integer of 1 or more, and the talking start index S and J thresholds Th _{s, 1} , Th _{s, 2} , ..., Th _{s, J} Based on the magnitude relationship between the response start index R and the K thresholds Thr _{, 1} , Thr _{, 2} , ..., Thr _{, K.} Includes a scenario selection step of selecting a betting scenario, a response scenario, or a confirmation scenario.

本発明によれば、誤った応答を低減することができるという効果を奏する。 ADVANTAGE OF THE INVENTION According to this invention, there exists an effect that an incorrect response can be reduced.

従来技術に係る対話装置の機能ブロック図。FIG. 9 is a functional block diagram of a dialogue device according to the related art. 第一実施形態に係る対話制御装置の機能ブロック図。FIG. 2 is a functional block diagram of the dialogue control device according to the first embodiment. 第一実施形態に係る対話制御装置の処理フローの例を示す図。FIG. 5 is a diagram illustrating an example of a processing flow of the dialog control device according to the first embodiment. 応答決定部の機能ブロック図。FIG. 4 is a functional block diagram of a response determination unit. シナリオの選択基準を説明するための図。The figure for demonstrating the selection standard of a scenario. シナリオの選択基準を説明するための図。The figure for demonstrating the selection standard of a scenario. シナリオ選択部の状態遷移図。The state transition diagram of a scenario selection part. 開始指標計算部の機能ブロック図。FIG. 4 is a functional block diagram of a start index calculation unit. 開始指標計算部の処理フローの例を示す図。The figure which shows the example of the processing flow of a start index calculation part. カメラから見た顔の方向を示す検出結果と音源方向の推定結果との差分と閾値との関係を示す図。The figure which shows the relationship between the threshold value and the difference of the detection result which shows the direction of the face seen from the camera, and the estimation result of the sound source direction. 時間補正部の機能ブロック図。FIG. 3 is a functional block diagram of a time correction unit. 時間補正部の処理例を説明するための図。FIG. 9 is a diagram for describing a processing example of a time correction unit.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, components having the same functions and steps performing the same processing are denoted by the same reference numerals, and redundant description will be omitted.

＜第一実施形態＞
図２は第一実施形態に係る対話制御装置１００の機能ブロック図を、図３はその処理フローを示す。 <First embodiment>
FIG. 2 is a functional block diagram of the dialog control device 100 according to the first embodiment, and FIG. 3 shows a processing flow thereof.

この対話制御装置１００は、CPUと、RAMと、以下の処理を実行するためのプログラムを記録したROMを備えたコンピュータで構成され、機能的には次に示すように構成されている。対話制御装置１００は、開始指標計算部１１０と対応決定部１２０とを含む。 The dialogue control device 100 is configured by a computer including a CPU, a RAM, and a ROM in which a program for executing the following processing is recorded, and is functionally configured as follows. The interaction control device 100 includes a start index calculation unit 110 and a correspondence determination unit 120.

対話制御装置１００は、マイクロホンアレイ６１の出力信号x(t₀)に基づくデータと、カメラ７１のイメージセンサの出力信号y(u)に基づくデータと、人感センサ８１の出力信号I₇(t₇)とを入力とし、図示しない対話装置を動作させるための制御信号z(t₈)を出力する。制御信号z(t₈)は、対話装置に実行させる動作に応じて適宜変更すればよい。例えば、(i)対話装置が音声による対話を行うのであれば、発話に対応する再生信号を制御信号z(t₈)として出力し、対話装置のスピーカで再生すればよいし、(ii)対話装置が非言語コミュニケーション(例えば、身ぶり)により意思表示を行うのであれば、非言語コミュニケーションに対応する駆動信号を制御信号z(t₈)として出力し、対話装置のモータなどを駆動させ機械的動作により所望の非言語コミュニケーションを行えばよいし、(iii)対話制御装置１００が文字やイラスト、所定の意味を持つ光信号により対話を行うのであれば、発話に対応する画像データ、動画データ、LEDの点滅をさせる信号を制御信号z(t₈)として出力し、対話装置のディスプレイやLED等で再生し、文字やイラスト、光信号等を使って対話を実現する。 The dialogue control device 100 outputs data based on the output signal x (t ₀ ) of the microphone array 61, data based on the output signal y (u) of the image sensor of the camera 71, and the output signal I ₇ (t ₇ ), and outputs a control signal z (t ₈ ) for operating an interactive device (not shown). The control signal z (t ₈ ) may be appropriately changed according to the operation to be performed by the interactive device. For example, (i) if the dialogue device performs a voice dialogue, a reproduction signal corresponding to the utterance may be output as a control signal z (t ₈ ) and reproduced by a speaker of the dialogue device; If the device performs an intention display by non-verbal communication (e.g., gesture), a drive signal corresponding to the non-verbal communication is output as a control signal z (t ₈ ), and a motor or the like of the interactive device is driven to perform a mechanical operation. (Iii) If the dialog control device 100 performs a dialogue with characters, illustrations, and optical signals having a predetermined meaning, image data, moving image data, and LEDs corresponding to the utterance Is output as a control signal z (t ₈ ), reproduced on a display, an LED, or the like of the interactive device, and a dialog is realized using characters, illustrations, optical signals, and the like.

なお、t₀,u,t₇はそれぞれマイクロホンアレイ６１、カメラ７１のイメージセンサ、人感センサ８１のサンプル番号またはサンプル番号に対応する時刻を示す。それぞれのサンプリング周期は必ずしも一致しないため、異なるサンプル番号を使って表す。またt₈は出力信号の番号を示す。 Note that t ₀ , u, and t ₇ indicate the sample numbers of the microphone array 61, the image sensor of the camera 71, and the human sensor 81, or the times corresponding to the sample numbers, respectively. Since the sampling periods do not always match, they are represented using different sample numbers. The t ₈ indicate the number of the output signal.

＜マイクロホンアレイ６１の出力信号x(t₀)に基づくデータ＞
例えば、マイクロホンアレイ６１は、N個のマイクロホンからなり、出力信号x(t₀)はx₁(t₀),x₂(t₀),…,x_N(t₀)を含む。例えば、x(t₀)={x₁(t₀),x₂(t₀),…,x_N(t₀)}である。Nは1以上の整数の何れかである。 <Data based on output signal x (t ₀ ) of microphone array 61>
For example, the microphone array 61 includes N microphones, and the output signal x (t ₀ ) includes x ₁ (t ₀ ), x ₂ (t ₀ ),..., X _N (t ₀ ). For example, x (t ₀ ) = {x ₁ (t ₀ ), x ₂ (t ₀ ),..., X _N (t ₀ )}. N is any integer of 1 or more.

発音検出部６２は、出力信号x(t₀)を入力とし、出力信号x(t₀)に含まれる人間の発音を検出し、発音の検出結果I₁(t₀)を出力する。例えば、サンプル時刻t₀において発音がある場合I₁(t₀)=1であり、発音がない場合I₁(t₀)=0である。発音検出の方法は既存のいかなる発音検出技術を用いてもよく、利用環境等に合わせて最適なものを適宜選択すればよい。 The sound detection unit 62 receives the output signal x (t ₀ ) as input, detects the sound of a human included in the output signal x (t ₀ ), and outputs a sound detection result I ₁ (t ₀ ). For example, I ₁ (t ₀ ) = 1 when there is a sound at sample time t ₀ , and I ₁ (t ₀ ) = 0 when there is no sound. Any existing pronunciation detection technology may be used for the pronunciation detection method, and an optimal one may be appropriately selected according to the usage environment and the like.

音源方向推定部６３は、出力信号x(t₀)を入力とし、音源方向を推定し、推定結果x_D(t₀)を出力する。音源方向推定の方法は既存のいかなる音源方向推定技術を用いてもよく、利用環境等に合わせて最適なものを適宜選択すればよい。例えば、特開２０１０−１７５４３１号公報記載の技術により実現する。 The sound source direction estimating unit 63 receives the output signal x (t ₀ ) as an input, estimates the sound source direction, and outputs an estimation result x _D (t ₀ ). Any existing sound source direction estimating technique may be used as a sound source direction estimating method, and an optimum method may be appropriately selected according to a use environment or the like. For example, it is realized by the technology described in Japanese Patent Application Laid-Open No. 2010-175431.

音レベル推定部６４は、出力信号x(t₀)を入力とし、出力信号x(t₀)に含まれる音声のレベルを推定し、推定結果x_L(t₀)を出力する。音声レベル推定の方法は既存のいかなる音声レベル推定技術を用いてもよく、利用環境等に合わせて最適なものを適宜選択すればよい。 The sound level estimating unit 64 receives the output signal x (t ₀ ) as an input, estimates the level of the sound included in the output signal x (t ₀ ), and outputs an estimation result x _L (t ₀ ). As a method of estimating the sound level, any existing sound level estimating technique may be used, and an optimum method may be appropriately selected according to a use environment or the like.

音声認識部６５は、出力信号x(t₀)に対して、音声認識を行い、その結果x_R(t₄)を出力する。音声認識の方法は既存のいかなる音声認識技術を用いてもよく、利用環境等に合わせて最適なものを適宜選択すればよい。例えば、特開２０１５−１６９５号公報記載の技術により実現する。なお、t₄は、音声認識結果の番号を示す。例えば、1つの発話に対する時系列の出力信号x(t₀)(複数個)を入力とし、その発話に対する音声認識結果x_R(t₄)を1つ出力する。なお、本実施形態では、音声認識部６５は、音声信号である出力信号x(t₀)を入力とし、形態素解析され、単語化された文字列を出力する。そのため、対話制御装置１００は、単語化された文字列が入力されるものとする。 The speech recognition unit 65 performs speech recognition on the output signal x (t ₀ ), and outputs x _R (t ₄ ) as a result. Any existing speech recognition technique may be used as the speech recognition method, and an optimal method may be appropriately selected according to the usage environment and the like. For example, it is realized by the technique described in JP-A-2015-1695. Incidentally, t ₄ illustrates a number of speech recognition results. For example, a time-series output signal x (t ₀ ) (plural) for one utterance is input, and one voice recognition result x _R (t ₄ ) for the utterance is output. In the present embodiment, the speech recognition unit 65 receives the output signal x (t ₀ ), which is a speech signal, and outputs a morphologically analyzed and wordized character string. Therefore, it is assumed that the dialogue control device 100 receives a wordized character string.

よって、マイクロホンアレイ６１の出力信号x(t₀)に基づくデータは、例えば、発音の検出結果I₁(t₀)、音源方向の推定結果x_D(t₀)、音声のレベルの推定結果x_L(t₀)、音声認識結果x_R(t₄)を含む。 Therefore, data based on the output signal x (t ₀ ) of the microphone array 61 includes, for example, a sound detection result I ₁ (t ₀ ), a sound source direction estimation result x _D (t ₀ ), and a sound level estimation result x _L (t ₀ ) and the speech recognition result x _R (t ₄ ).

なお、本実施形態では、マイクロホンアレイ６１のサンプリング周期と、発音の検出結果I₁(t₀)、音源方向の推定結果x_D(t₀)及び音声のレベルの推定結果x_L(t₀)との出力の周期とを同一としているが、処理方法に応じてそれぞれ別の周期で出力してもよい。その場合には、ある出力（例えば音声認識結果x_R(t₄)）を基準として、その出力に対して直近の他の出力を用いる構成とすればよい。 In the present embodiment, the sampling period of the microphone array 61, the sound detection result I ₁ (t ₀ ), the sound source direction estimation result x _D (t ₀ ), and the sound level estimation result x _L (t ₀ ) Although the output cycle is the same as the above, the output may be performed at different cycles depending on the processing method. In that case, as a reference output (e.g., voice recognition result x _R (t ₄₎₎ that may be configured to use the most recent other output to the output.

＜カメラ７１のイメージセンサの出力信号y(u)に基づくデータ＞
顔検出部７２は、イメージセンサの出力信号y(u)を入力とし、出力信号y(u)に対応する画像に含まれる顔がカメラ７１から見てどの方向にあるかを求め、求めた方向を検出結果y_D(u)として出力する。 <Data based on output signal y (u) of image sensor of camera 71>
The face detection unit 72 receives the output signal y (u) of the image sensor as an input, determines in which direction the face included in the image corresponding to the output signal y (u) is viewed from the camera 71, and determines the determined direction. Is output as the detection result y _D (u).

顔検出部７３は、イメージセンサの出力信号y(u)を入力とし、出力信号y(u)に対応する画像に含まれる顔の大きさを求め、求めた大きさを検出結果y_S(u)として出力する。顔検出の方法は既存のいかなる顔検出技術を用いてもよく、利用環境等に合わせて最適なものを適宜選択すればよい。 The face detection unit 73 receives the output signal y (u) of the image sensor as an input, obtains the size of the face included in the image corresponding to the output signal y (u), and obtains the obtained size as a detection result y _S (u ) Is output. Any existing face detection technique may be used as the face detection method, and an optimum one may be appropriately selected according to the usage environment and the like.

よって、カメラ７１のイメージセンサの出力信号y(u)に基づくデータは、例えば、カメラから見た顔の方向を示す検出結果y_D(u)、顔の大きさを示す検出結果y_S(u)を含む。 Therefore, data based on the output signal y (u) of the image sensor of the camera 71 includes, for example, a detection result y _D (u) indicating the direction of the face viewed from the camera and a detection result y _S (u) indicating the size of the face. )including.

なお、本実施形態では、カメラ７１のイメージセンサのサンプリング周期と、検出結果y_D(u)及び検出結果y_S(u)との出力の周期とを同一としているが、処理方法に応じてそれぞれ別の周期で出力してもよい。その場合には、何れかの出力を基準として、その出力に対して直近の他の出力を用いる構成とすればよい。 In the present embodiment, the sampling cycle of the image sensor of the camera 71 is the same as the output cycle of the detection result y _D (u) and the detection result y _S (u). The output may be performed at another cycle. In that case, any output may be used as a reference, and another output closest to that output may be used.

＜人感センサ８１＞
人感センサ８１は、例えば、赤外線、超音波、可視光などを用いたセンサであり、人の所在を検知し、検知結果を出力信号I₇(t₇)として出力する。例えば、サンプル時刻t₇において人感センサ８１の感知しうる範囲に人が存在する場合I₇(t₇)=1であり、人が存在しないI₇(t₇)=0である。 <Human sensor 81>
The human sensor 81 is, for example, a sensor using infrared rays, ultrasonic waves, visible light, or the like, detects the location of a person, and outputs a detection result as an output signal I ₇ (t ₇ ). For example, the sample when there is a human in a range capable of sensing motion sensor 81 at time t _{_₇} I ₇ (t ₇₎ a _{_{= 1, I 7 (t 7}} ) the absence of human = 0.

＜開始指標計算部１１０＞
開始指標計算部１１０は、発音の検出結果I₁(t₀)、音源方向の推定結果x_D(t₀)、音声のレベルの推定結果x_L(t₀)、音声認識結果x_R(t₄)、顔の方向を示す検出結果y_D(u)、顔の大きさを示す検出結果y_S(u)、人感センサ８１の出力信号I₇(t₇)を入力とする。開始指標計算部１１０は、これらの入力値を総合的に解析して、話しかけ開始指標S(u)と応答開始指標R(t₄)とを求め（Ｓ１１０）、対応決定部１２０に出力する。 <Start index calculation unit 110>
The start index calculation unit 110 includes a pronunciation detection result I ₁ (t ₀ ), a sound source direction estimation result x _D (t ₀ ), a speech level estimation result x _L (t ₀ ), and a speech recognition result x _R (t ₄ ) The detection result y _D (u) indicating the direction of the face, the detection result y _S (u) indicating the size of the face, and the output signal I ₇ (t ₇ ) of the human sensor 81 are input. The start index calculation unit 110 comprehensively analyzes these input values, obtains a talk start index S (u) and a response start index R (t ₄ ) (S110), and outputs them to the correspondence determination unit 120.

対話を行う際には、対話装置側からきっかけとなる音声を出力して対話を開始する場合と、人間側からきっかけとなる発話をして対話を開始する場合がある。対話装置側からきっかけとなる音声を出力して対話を開始するべきであるか否かを示す指標を「話しかけ開始指標」とする。話しかけ開始指標Sは例えば0〜1の値をとり、1に近いほど話しかけを開始すべきであり、0に近いほど話しかけを開始すべきでないという意味を持つ。「ある音声」に対して、対話装置側が応答すべきであるか否かを示す指標を「応答開始指標」とする。応答開始指標Rは例えば0〜1の値をとり、1に近いほど応答を開始すべきであり、0に近いほど応答を開始すべきでないという意味を持つ。なお、「ある音声」が対話のきっかけとなる人間側からの発話であれば、当然応答を開始すべきであるが、「ある音声」が対話装置に対する発話ではない場合や対話を意図せずTVから発せられた音声である場合など、対話装置に向けられたものでない場合には、応答を開始すべきではないと判断する。 When performing a dialogue, there are a case where the dialogue device outputs a trigger voice to start the dialogue, and a case where the human side speaks as a trigger to start the dialogue. An index indicating whether or not the dialogue should be output from the dialogue apparatus to start the dialogue is referred to as a “talking start index”. The speaking start index S takes a value of, for example, 0 to 1, meaning that speaking should be started closer to 1, and speaking should not be started closer to 0. An index indicating whether or not the interactive device should respond to “a certain voice” is referred to as a “response start index”. The response start index R takes a value of, for example, 0 to 1, meaning that a response should be started closer to 1 and a response should not be started closer to 0. If "a certain voice" is an utterance from the human side that triggers the dialogue, the response should be started naturally, but if the "a certain voice" is not an utterance to the dialogue device or the TV If the voice is not directed to the interactive device, such as when the voice is emitted from, it is determined that the response should not be started.

なお、本実施形態では、顔の方向、大きさを示す検出結果y_D(u)、y_S(u)を取得する度に話しかけ開始指標S(u)を求め、音声認識結果x_R(t₄)を取得する度に応答開始指標R(t₄)を求めるものとし、話しかけ開始指標の番号をuで、応答開始指標の番号をt₄で表す。 In the present embodiment, each time the detection results y _D (u) and y _S (u) indicating the face direction and size are obtained, a talking start index S (u) is obtained, and the speech recognition result x _R (t Each time ₄ ) is obtained, a response start index R (t ₄ ) is obtained. The number of the talk start index is represented by u, and the number of the response start index is represented by t ₄ .

＜対応決定部１２０＞
対応決定部１２０は、話しかけ開始指標S(u)及び応答開始指標R(t₄)を入力とし、これらの指標に基づき、対話装置の動作を決定し（Ｓ１２０）、対話装置を動作させるための制御信号z(t₈)を出力する。なお、話しかけ開始指標S(u)及び応答開始指標R(t₄)は異なるタイミングで対応決定部１２０に入力される。そのため、対応決定部１２０は、話しかけ開始指標S(u)及び応答開始指標R(t₄)の何れかが入力された時点で動作する。 <Correspondence determination unit 120>
The correspondence determination unit 120 receives the talking start index S (u) and the response start index R (t ₄ ), determines the operation of the interactive device based on these indices (S120), and operates the interactive device. The control signal z (t ₈ ) is output. Note that the talk start index S (u) and the response start index R (t ₄ ) are input to the correspondence determination unit 120 at different timings. Therefore, the correspondence determination unit 120 operates when either the talk start index S (u) or the response start index R (t ₄ ) is input.

図４は、対応決定部１２０の機能ブロック図を示す。対応決定部１２０は、シナリオ選択部１２２とシナリオ記憶部１２３とを含む。 FIG. 4 is a functional block diagram of the correspondence determination unit 120. The correspondence determination unit 120 includes a scenario selection unit 122 and a scenario storage unit 123.

（シナリオ記憶部１２３）
シナリオ記憶部１２３には、話しかけシナリオ、応答シナリオ、及び、確認シナリオを利用に先立ち記憶しておく。なお、(i)話しかけシナリオとは、対話装置側から対話の契機となる音声を出力して対話を開始する対話シナリオであり、(ii)応答シナリオとは、話し利用者側からの発話に対して応答する対話シナリオであり、(iii)確認シナリオとは、利用者に対して対話を開始するか否かを確認する対話シナリオである。 (Scenario storage unit 123)
The scenario storage unit 123 stores a talking scenario, a response scenario, and a confirmation scenario before use. Note that (i) a talking scenario is a dialogue scenario in which a dialogue device outputs a sound that triggers a dialogue to start a dialogue, and (ii) a response scenario is a response to an utterance from a talking user. (Iii) The confirmation scenario is a dialog scenario for confirming whether or not to start a dialog with the user.

話しかけシナリオとして、例えば、従来技術のような対話装置側からの発話を用意する。応答シナリオとして、例えば、従来技術のような質問やあいさつに対する直接的な反応を用意する。確認シナリオ１として、例えば、話しかけられたかどうかが、あやふやな場合に「何？」、「何か用ですか？」、「私ですか？」、「ん？」など、自分に話しかけているのかを問いかけるシナリオ（以下「確認シナリオ１」ともいう）や、近くに人はいるが対話を開始すべきか否かがが確実でない場合に、音声は出力せずに顔認識された方向に対話装置の顔を向けるようにモータを動作させることや、「なんかつまんないなー」など独り言を言うなどして、自然な挙動で対話を開始するかを確認するシナリオ（以下「確認シナリオ２」ともいう）を用意する。 As a talking scenario, for example, an utterance from the interactive device side as in the related art is prepared. As a response scenario, for example, a direct response to a question or a greeting as in the related art is prepared. As confirmation scenario 1, for example, if you talked to yourself, you were talking to yourself, such as "What?", "What is it for?", "I am?" (Hereinafter also referred to as “confirmation scenario 1”), or when there is a person nearby but it is not certain whether or not to start the dialogue, the voice of the dialogue device is output in the direction in which the face is recognized without outputting the voice. Prepare a scenario (hereinafter also referred to as "Confirmation Scenario 2") to check whether the dialogue starts with natural behavior, such as operating the motor so that the face is turned, or saying oneself such as "What is it?" I do.

（シナリオ選択部１２２）
シナリオ選択部１２２は、話しかけ開始指標S(u)及び応答開始指標R(t₄)を入力とし、話しかけ開始指標S(u)とJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、及び、応答開始指標R(t₄)とK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係とに基づき、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択し、選択したシナリオに対応して動作させるための制御信号z(t₈)を出力する。なお、本実施形態では、上述の2つの確認シナリオ（確認シナリオ１、確認シナリオ２）を用意する。また、対話装置側からきっかけとなる音声を出力して対話を開始するべきではなく、かつ、対話装置側が応答すべきでない場合のために、「動作無し」というシナリオを用意する。「動作無し」の場合には、制御信号z(t₈)を出力しなくともよいし、動作しないことを示す制御信号z(t₈)を出力してもよい。なお、J及びKは、それぞれ1以上の整数の何れかである。 (Scenario selection unit 122)
The scenario selection unit 122 receives the talking start index S (u) and the response starting index R (t ₄ ), and receives the talking starting index S (u) and J thresholds Th _{s, 1} , Th _{s, 2} ,. Based on the magnitude relationship with Th _{s, J} and the magnitude relationship between the response start index R (t ₄ ) and the K thresholds Thr _{, 1} , Thr _{, 2} , ..., Thr _{, K} , a talking scenario , A response scenario or a confirmation scenario, and outputs a control signal z (t ₈ ) for operating in accordance with the selected scenario. In the present embodiment, the above-described two confirmation scenarios (confirmation scenario 1 and confirmation scenario 2) are prepared. In addition, a scenario of “no operation” is prepared in a case where the dialogue device should not output a trigger voice to start the dialogue and the dialogue device should not respond. In the case of "no operation" may not necessary to output a control signal z (t _8), may output a control signal z (t ₈₎ indicating that no work. Note that J and K are each an integer of 1 or more.

前述の通り、対応決定部１２０は、話しかけ開始指標S(u)及び応答開始指標R(t₄)の何れかが入力された時点で動作する。ここで、通常、ある応答開始指標R(t₄-1)が入力されてから次の応答開始指標R(t₄)が入力されるまでに複数の話しかけ開始指標S(u)が入力される。そこで、シナリオ選択部１２２では、応答開始指標R(t₄)が入力されたとき、最新の話しかけ開始指標S(u)のみを用いてもよいし、ある応答開始指標R(t₄-1)が入力されてから次の応答開始指標R(t₄)が入力されるまでに入力された話しかけ開始指標S(u)の平均値を用いてよいし、最新のN個の話しかけ開始指標S(u),S(u-1),…,S(u-N+1)の平均値を用いて閾値との比較を行ってもよい。なお、話しかけ開始指標S(u)が入力された時点で動作する場合には、直近の応答開始指標R(t₄)を用いて閾値との比較を行えばよい。 As described above, the correspondence determination unit 120 operates when any one of the talk start indicator S (u) and the response start indicator R (t ₄ ) is input. Here, usually, a plurality of talking start indicators S (u) are input after a certain response start indicator R (t ₄ -1) is input and before the next response start indicator R (t ₄ ) is input. . Therefore, when the response start index R (t ₄ ) is input, the scenario selecting unit 122 may use only the latest talk start index S (u) or a certain response start index R (t ₄ −1) May be used, and the average value of the talking start indicators S (u) input until the next response starting indicator R (t ₄ ) is input may be used, or the latest N talking starting indicators S ( u), S (u−1),..., S (u−N + 1) may be used to compare with the threshold. When the operation is performed at the time when the talking start index S (u) is input, a comparison with the threshold value may be performed using the latest response starting index R (t ₄ ).

例えば、シナリオ選択部１２２は、話しかけ開始指標S(u)を、あらかじめ設定された二つの閾値Th_s,1,Th_s,2により(J=2)、高、中、低の3段階に分類する。高は閾値Th_s,1を話しかけ開始指標S(u)が超えた場合(Th_s,1<S(u))、低は話しかけ開始指標S(u)が閾値Th_s,2以下である場合(S(u)≦Th_s,2)、中はこれら以外(Th_s,2<S(u)≦Th_s,1)のように分類する。分類の数は２以上であればいくつでもよい。 For example, the scenario selection unit 122 classifies the talking start index S (u) into three levels of high, medium, and low by two preset thresholds Th _{s, 1} and Th _{s, 2} (J = 2). I do. High is when the talking start index S (u) exceeds the threshold Th _{s, 1} (Th _{s, 1} <S (u)), and low is when the talking start index S (u) is less than or equal to the threshold Th _{s, 2} (S (u) ≦ Ths _{, 2} ), and the others are classified as (Ths _{, 2} <S (u) ≦ Ths _{, 1} ). The number of classifications may be any number as long as it is two or more.

応答開始指標R(t₄)も同様にして、例えば、高、中、低の3段階に分類される(K=2)。なお、応答開始指標R(t₄)の分類のための閾値Th_r,1,Th_r,2は、話しかけ開始指標S(u)の閾値Th_s,1,Th_s,2とは独立に設定される。 Similarly, the response start index R (t ₄ ) is classified into, for example, three stages of high, medium, and low (K = 2). The threshold _{_{Th r, 1, Th r,}} 2 for classification of the response start index R (t ₄₎ are set independently of the threshold _{_{Th s, 1, Th s,}} 2 of talk initiation indicator S (u) Is done.

図５は、シナリオ選択部１２２において話しかけシナリオ、応答シナリオ、または、確認シナリオを選択する際の選択基準を示す。話しかけ開始指標S(u)と応答開始指標R(t₄)をそれぞれ３段階に分類した場合、その組み合わせで９個の分類ができる。それぞれの分類で、どのシナリオを用いるかあらかじめ設定しておき、実際の話しかけ開始指標S(u)と応答開始指標R(t₄)の入力に対してシナリオを選択する。 FIG. 5 shows selection criteria for selecting a talking scenario, a response scenario, or a confirmation scenario in the scenario selection unit 122. When the speaking start index S (u) and the response starting index R (t ₄ ) are classified into three levels, nine classifications can be made by combining them. For each classification, a scenario to be used is set in advance, and a scenario is selected with respect to the input of the actual talk start index S (u) and the response start index R (t ₄ ).

例えば、以下のように選択基準を作成する。
(i)話しかけ開始指標S(u)の値が大きい（対話装置側からきっかけとなる音声を出力して対話を開始するべきである）と話しかけシナリオを選択しやすく、応答開始指標R(t₄)の値が大きい（ある音声に対して、対話装置側が応答すべきである）と応答シナリオを選択しやすくなるように選択基準を作成する。
(ii)対話装置側からきっかけとなる音声を出力して対話を開始するべきか否かが不明確な場合、自然な挙動で対話を開始するかを確認するシナリオ（確認シナリオ２）を選択しやすくなるように選択基準を作成する。また、対話装置側が応答すべきであるか否かが不明確な場合、自分に話しかけているのかを問いかけるシナリオ（確認シナリオ１）を選択しやすくなるように選択基準を作成する。
(iii)対話装置側からきっかけとなる音声を出力して対話を開始するべきではなく、かつ、対話装置側が応答すべきでない場合、対話装置は動作しないように選択基準を作成する。
(iv)より高い区分に分類された指標を優先するように選択基準を作成する。例えば、話しかけ開始指標S(u)が高に分類され、応答開始指標R(t₄)が中や低に分類された場合には、話しかけ開始指標S(u)が高の場合に選択される話しかけシナリオが選択しやすくなるように選択基準を作成する。
(v)応答開始指標R(t₄)と話しかけ開始指標S(u)とに対して同程度の分類がなされた場合、応答開始指標R(t₄)を優先するように選択基準を作成する。例えば、Th_s,1<S(u)かつTh_R,1<R(t₄)の場合には応答シナリオを選択し、Th_s,2<S(u)≦Th_s,1かつTh_R,2<R(t₄)≦Th_R,1の場合には、確認シナリオ１（問いかけ）を選択する。利用者が話しかけているのに応答がない場合（無視された場合）、誤って対話装置が応答しているよりも、対話意欲は削がれると想定し、このような選択基準を作成することで、利用者の対話意欲が削がれることを防ぐ。 For example, a selection criterion is created as follows.
(i) If the value of the talking start index S (u) is large (the dialogue device should output a trigger voice to start the dialogue), it is easy to select a talking scenario, and the response starting index R (t ₄ A selection criterion is created so that if the value of) is large (the dialogue device should respond to a certain voice), it is easy to select a response scenario.
(ii) If it is not clear whether or not to start the dialogue by outputting a trigger voice from the dialogue device side, select a scenario (confirmation scenario 2) for confirming whether to start the dialogue with natural behavior. Create selection criteria to make it easier. If it is not clear whether the dialogue device should respond, a selection criterion is created so that it is easy to select a scenario (confirmation scenario 1) asking whether the user is talking to himself.
(iii) If the dialogue device should not start the dialogue by outputting a trigger voice and the dialogue device should not respond, a selection criterion is created so that the dialogue device does not operate.
(iv) Selection criteria are created to give priority to indicators classified into higher categories. For example, if the speaking start index S (u) is classified as high and the response starting index R (t ₄ ) is classified as medium or low, it is selected when the speaking start index S (u) is high. Create selection criteria to make it easier to select talking scenarios.
(v) When the response start index R (t ₄ ) and the talk start index S (u) are classified at the same level, a selection criterion is created so that the response start index R (t ₄ ) is prioritized. . For example, if Th _{s, 1} <S (u) and Th _{R, 1} <R (t ₄ ), a response scenario is selected, and Th _{s, 2} <S (u) ≦ Th _{s, 1} and Th _{R, If 2} <R (t ₄ ) ≦ Th _{R, 1} , check scenario 1 (question) is selected. Create a selection criterion that assumes that if the user is speaking but there is no response (ignored), the willingness to interact is reduced more than if the interactive device responds by mistake. This prevents the user's willingness to interact.

なお、図５では、話しかけ開始指標S(u)と応答開始指標R(t₄)をそれぞれ３段階に分類しているが、それ以外の分類(JやKが2の場合や、4以上の場合)においても上述の(i)〜(v)の条件を満たすように選択基準を作成すればよい。 In FIG. 5, the talk start index S (u) and the response start index R (t ₄ ) are classified into three stages, respectively. However, other classifications (when J and K are 2 or 4 or more) In this case, the selection criteria may be created so as to satisfy the above-described conditions (i) to (v).

＜効果＞
このような構成により、対話装置への話しかけかどうかあやふやな場合に、質問で聞き返したり、利用者のほうを向いて自分への話しかけであるかを確認したりすることができ、より人間らしいふるまいをすることができる。その結果、誤った応答を低減することができる。 <Effect>
With such a configuration, if it is unclear whether or not the user is talking to the interactive device, he / she can ask questions and return to the user to confirm whether or not the user is talking to the user. can do. As a result, erroneous responses can be reduced.

＜第二実施形態＞
第一実施形態と異なる部分を中心に説明する。 <Second embodiment>
The following description focuses on the differences from the first embodiment.

第一実施形態において、確認シナリオ１が連続して選択されてしまうと、何度も確認行為を行うことになり、不自然な対応となってしまう。これを防ぐために、本実施形態では状態を考慮する。 In the first embodiment, if the confirmation scenario 1 is continuously selected, the confirmation act is performed many times, which is an unnatural response. In order to prevent this, the present embodiment considers the state.

シナリオ選択部１２２は、(I)待ち受け状態、(II)確認シナリオを実行後の状態である確認状態、(III)話しかけシナリオまたは応答シナリオを実行後の状態である対話状態の3つの状態を持つ。シナリオ選択部１２２は、(I)待ち受け状態、(II)確認状態、(III)対話状態の何れかの状態に遷移し、待ち受け状態、確認状態、対話状態の何れかの状態に応じて、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択する際の選択基準を変更する。図６は、各状態における選択基準を示す。 The scenario selection unit 122 has three states: (I) a standby state, (II) a confirmation state after executing a confirmation scenario, and (III) a conversation state after executing a talking scenario or a response scenario. . The scenario selecting unit 122 transits to any one of the (I) standby state, (II) confirmation state, and (III) conversation state, and speaks according to any one of the waiting state, confirmation state, and conversation state. Change the selection criteria when selecting a scenario, response scenario, or confirmation scenario. FIG. 6 shows selection criteria in each state.

シナリオ選択部１２２は、話しかけ開始指標SとJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、及び、応答開始指標RとK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係と、シナリオ選択部１２２の状態に対応する選択基準に基づき、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択し（Ｓ１２２）、選択したシナリオに対応して動作させるための制御信号z(t₈)を出力する。図６ではJ=2,K=2とする。 The scenario selection unit 122 determines the magnitude relationship between the talking start index S and the J thresholds Th _{s, 1} , Th _{s, 2} ,..., Th _{s, J} , and the response start index R and the K thresholds Thr _{, Based on} the magnitude relationship with ₁ , Thr _{, 2} , ..., Thr _{, K,} and a selection criterion corresponding to the state of the scenario selecting unit 122, a talking scenario, a response scenario, or a confirmation scenario is selected (S122). A control signal z (t ₈ ) for operating in accordance with the selected scenario is output. In FIG. 6, J = 2 and K = 2.

図７は、本実施形態の状態遷移図を示す。待ち受け状態を初期状態とする。 FIG. 7 shows a state transition diagram of the present embodiment. Let the waiting state be the initial state.

（待ち受け状態）
待ち受け状態において、シナリオ選択部１２２は、話しかけ開始指標S(u)及び応答開始指標R(t₄)を入力とし、待ち受け状態における判定基準に基づき、応答シナリオ、話しかけシナリオ、確認シナリオ１、確認シナリオ２、動作無しの何れかを選択し、選択したシナリオに対応して動作させるための制御信号z(t₈)を出力する。 (Standby state)
In the standby state, the scenario selection unit 122 receives the talking start index S (u) and the response starting index R (t ₄ ) as input, and based on the determination criteria in the standby state, the response scenario, the talking scenario, the confirmation scenario 1, and the confirmation scenario. 2. Select one of no operation and output a control signal z (t ₈ ) for operating according to the selected scenario.

応答シナリオまたは話しかけシナリオが選択された場合には対話状態に遷移し、確認シナリオ１または確認シナリオ２が選択された場合には確認状態に遷移し、何れのシナリオも選択されなかった場合（動作無しが選択された場合）には待ち受け状態から待ち受け状態に遷移する（待ち受け状態を維持する）。 When the response scenario or the talking scenario is selected, the state transitions to the interactive state. When the confirmation scenario 1 or the confirmation scenario 2 is selected, the state transitions to the confirmation state. When no scenario is selected (no operation) Is selected), transition from the standby state to the standby state (maintain the standby state).

（確認状態）
シナリオ選択部１２２は、話しかけ開始指標S(u)及び応答開始指標R(t₄)を入力とし、確認状態における判定基準に基づき、応答シナリオ、話しかけシナリオ、動作無しの何れかを選択し、選択したシナリオに対応して動作させるための制御信号z(t₈)を出力する。応答シナリオまたは話しかけシナリオが選択された場合には対話状態に遷移し、何れのシナリオも選択されなかった場合（動作無しが選択された場合）には確認状態から確認状態に遷移する。但し、動作無しが選択されつづけ、確認状態のまま一定時間が経過すると(または一定回数の入力S(u),R(t₄)を受け付けると)待ち受け状態に遷移する。 (Confirmation status)
The scenario selecting unit 122 receives the talking start index S (u) and the response starting index R (t ₄ ) as input, and selects any one of a response scenario, a talking scenario, and no action based on the criterion in the confirmation state. A control signal z (t ₈ ) for operating in accordance with the scenario described above is output. When the response scenario or the talking scenario is selected, the state transits to the interactive state. When none of the scenarios is selected (when no operation is selected), the state transits from the confirmation state to the confirmation state. However, if no operation is continuously selected, the state transits to the standby state when a certain time elapses in the confirmation state (or when a certain number of inputs S (u) and R (t ₄ ) are received).

（対話状態）
シナリオ選択部１２２は、話しかけ開始指標S(u)及び応答開始指標R(t₄)を入力とし、対話状態における判定基準に基づき、応答シナリオ、動作無しの何れかを選択し、選択したシナリオに対応して動作させるための制御信号z(t₈)を出力する。この状態では対話状態から対話状態に遷移する。但し、動作無しが選択されつづけ、一定時間が経過すると(または一定回数の入力S(u),R(t₄)を受け付けると)待ち受け状態に遷移する。 (Dialogue state)
The scenario selection unit 122 receives the talk start index S (u) and the response start index R (t ₄ ) as inputs, selects one of a response scenario and no action based on a determination criterion in the dialogue state, and sets the selected scenario to the selected scenario. A control signal z (t ₈ ) for corresponding operation is output. In this state, the state transits from the conversation state to the conversation state. However, no operation continues to be selected, and after a certain period of time (or when a certain number of inputs S (u), R (t ₄ ) are received), the state transits to the standby state.

このように、確認状態では、再度確認シナリオが実行されることがないように、シナリオの選択基準から確認シナリオをなくした選択基準を用い、対話状態では、確認シナリオ及び話しかけシナリオを削除した選択基準を用いる。 In this manner, in the confirmation state, the selection criterion in which the confirmation scenario is eliminated from the selection criterion of the scenario is used so that the confirmation scenario is not executed again. In the dialogue state, the selection criterion in which the confirmation scenario and the talking scenario are deleted is used. Is used.

＜効果＞
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、確認シナリオを連続して実施して不自然な対応となってしまうことを防ぐことができる。 <Effect>
With such a configuration, the same effect as in the first embodiment can be obtained. Further, it is possible to prevent an unnatural response by continuously executing the confirmation scenarios.

＜第三実施形態＞
第一実施形態及び第二実施形態と異なる部分を中心に説明する。 <Third embodiment>
A description will be given focusing on portions different from the first embodiment and the second embodiment.

特許文献１等の従来技術ではユーザ発話に対応する文字列だけを用いてどのような対応を行うかを判断している。そのため、例えば、テレビの音声など、対話装置と対話をするために発せられた音声でなかったとしても、あらかじめ用意した単語パターンと一致する場合は、対話をしてしまう。つまり、従来技術では、対話とは無関係の音声などに反応した誤動作が起こってしまう。 In the related art such as Patent Document 1, it is determined what kind of response is to be performed using only a character string corresponding to a user utterance. Therefore, for example, even if the voice is not a voice, such as a sound of a television, that is emitted to interact with the interactive device, if the word matches a word pattern prepared in advance, the dialogue is performed. That is, in the related art, a malfunction occurs in response to a voice or the like unrelated to the dialogue.

そこで、本実施形態では、音声だけではなく、様々なセンサからの情報に基づき対話音声であるかの確からしさ数値化し、その確からしさに基づいて、入力音声に対する対応を決定する。このような構成により、対話とは無関係の音声などに対して反応することを防ぐことができる。 Therefore, in the present embodiment, a numerical value of the probability of being a dialogue voice is made based on information from various sensors as well as the voice, and a response to the input voice is determined based on the probability. With such a configuration, it is possible to prevent the user from reacting to a voice or the like unrelated to the conversation.

上述の効果を得るために本実施形態では、開始指標計算部１１０における処理を限定する。 In the present embodiment, the processing in the start index calculation unit 110 is limited in order to obtain the above effects.

＜開始指標計算部１１０＞
図８は開始指標計算部１１０の機能ブロック図を、図９はその処理フローの例を示す。 <Start index calculation unit 110>
FIG. 8 is a functional block diagram of the start index calculation unit 110, and FIG. 9 shows an example of the processing flow.

開始指標計算部１１０は、方向一致度計算部１１１、発話距離指標計算部１１２、キーワード検出部１１３、キーワードデータベース１１４、発話頻度計算部１１５、顔の距離指標計算部１１６、応答開始指標計算部１１７及び話しかけ開始指標計算部１１８を含む。 The start index calculation unit 110 includes a direction matching degree calculation unit 111, an utterance distance index calculation unit 112, a keyword detection unit 113, a keyword database 114, an utterance frequency calculation unit 115, a face distance index calculation unit 116, and a response start index calculation unit 117. And a talking start index calculating unit 118.

＜方向一致度計算部１１１＞
方向一致度計算部１１１は、カメラから見た顔の方向を示す検出結果y_D(u)と音源方向の推定結果x_D(t₀)とを入力とし、音源方向の推定結果と映像による顔認識方向の一致度合いI₂(u)を計算し（Ｓ１１１）、出力する。一致度合いI₂(u)は、例えば0.0〜1.0の値をとり1.0に近いほど一致していることを表す指標である。例えば、カメラから見た顔の方向を示す検出結果y_D(u)と音源方向の推定結果x_D(t₀)との差分の絶対値|(x_D(t₀))-(y_D(u))|をとり、その値があらかじめ設定した第１の閾値T₁よりも大きければI₂(u)=0を出力し、あらかじめ設定した第２の閾値T₂よりも小さければI₂(u)=1を出力し、どちらでもなければ以下の式により、差分の絶対値|(x_D(t₀))-(y_D(u))|が第１の閾値T₁の時に0になり、第２の閾値T₂のときに1となる直線上の値を出力する。
I₂(u)={|(x_D(t₀))-(y_D(u))|-(T₁)}/{(T₂)-(T₁)}
この関係をグラフにしたものを図１０に示す。つまり、
I₂(u)=0 if |(x_D(t₀))-(y_D(u))|>T₁
I₂(u)=1 if |(x_D(t₀))-(y_D(u))|<T₂
I₂(u)={|(x_D(t₀))-(y_D(u))|-(T₁)}/{(T₂)-(T₁)} if T₂≦|(x_D(t₀))-(y_D(u))|≦T₁
となる。 <Direction coincidence calculator 111>
The direction matching degree calculation unit 111 receives the detection result y _D (u) indicating the direction of the face viewed from the camera and the estimation result x _D (t ₀ ) of the sound source direction as inputs, and performs the estimation result of the sound source direction and the face by video. The matching degree I ₂ (u) of the recognition direction is calculated (S111) and output. The degree of coincidence I ₂ (u) is, for example, an index that takes a value of 0.0 to 1.0 and indicates that the closer the value is to 1.0, the more coincident. For example, the absolute value of the difference between the detection result y _D (u) indicating the direction of the face viewed from the camera and the estimation result x _D (t ₀ ) of the sound source direction | (x _D (t ₀ ))-(y _D ( u)) |, and outputs I ₂ (u) = 0 if the value is larger than a first threshold T ₁ set in advance, and outputs I ₂ (if it is smaller than a second threshold T ₂ set in advance. u) = 1, otherwise, by the following formula, the absolute value of the difference | (x _D (t ₀ ))-(y _D (u)) | becomes 0 when the first threshold T ₁ it outputs the straight line of the values to be 1 when the second threshold value T _2.
I ₂ (u) = {| (x _D (t ₀ ))-(y _D (u)) |-(T ₁ )} / {(T ₂ )-(T ₁ )}
FIG. 10 shows a graph of this relationship. That is,
I ₂ (u) = 0 if | (x _D (t ₀ ))-(y _D (u)) |> T ₁
I ₂ (u) = 1 if | (x _D (t ₀ ))-(y _D (u)) | <T ₂
I ₂ (u) = {| (x _D (t ₀ ))-(y _D (u)) |-(T ₁ )} / {(T ₂ )-(T ₁ )} if T ₂ ≤ | (x _D (t ₀ ))-(y _D (u)) | ≦ T ₁
It becomes.

＜発話距離指標計算部１１２＞
発話距離指標計算部１１２は、音声のレベルの推定結果x_L(t₀)を入力とし、マイクロホンアレイ６１と発話者と距離に応じて変化する発話距離指標I₃(t₀)を計算し（Ｓ１１２）、出力する。例えば、発話距離指標I₃(t₀)を、マイクロホンアレイ６１に含まれるマイクロホンと発話者との距離が近いほど1.0に近くなり、距離が遠いほど0.0に近くなる指標とする。音は音源から受音位置までの距離に反比例して受音される音の大きさが変化する。よって、マイクロホンで観測された音声のレベルからおおよその距離を推定することができる。例えば1mの位置で標準的な音量で発話したときのマイクロホンの出力のレベルをAとした場合、推定対象音声のマイクロホンの出力のレベルがBであったとすれば、推定対象音声のマイクロホンから音源までの距離は、その比A/B(m)で推定することができる。推定された距離があらかじめ設定した第３の閾値T₃よりも大きければI₃(t₀)=0を出力し、あらかじめ設定した第４の閾値T₄よりも小さければI₃(t₀)=1を出力し、どちらでもなければ以下の式により第３の閾値T₃の時に0になりと第４の閾値T₄のときに1となる直線上の値を出力する。
I₃(t₀)={A/B-(T₃)}/{(T₄)-(T₃)} <Utterance distance index calculation unit 112>
The utterance distance index calculation unit 112 receives the speech level estimation result x _L (t ₀ ) as input, and calculates an utterance distance index I ₃ (t ₀ ) that changes according to the distance between the microphone array 61 and the speaker. S112), and output. For example, the utterance distance index I ₃ (t ₀ ) is an index that becomes closer to 1.0 as the distance between the microphone and the speaker included in the microphone array 61 is shorter, and is closer to 0.0 as the distance is longer. The magnitude of the received sound changes in inverse proportion to the distance from the sound source to the sound receiving position. Therefore, the approximate distance can be estimated from the level of the sound observed by the microphone. For example, if the output level of the microphone when speaking at a standard volume at a position of 1 m is A, and if the output level of the microphone of the estimation target voice is B, from the microphone of the estimation target voice to the sound source, Can be estimated from the ratio A / B (m). If the estimated distance is larger than the third threshold T ₃ set in advance I ₃ (t ₀₎ and outputs a _{_{= 0, I 3 (t 0}} ) is smaller than the fourth threshold value T ₄ of a preset = outputs 1, and outputs the straight line of value that is a 1 when either unless any time by the following equation becomes 0 when the third threshold T ₃ and a fourth threshold value T _4.
I ₃ (t ₀ ) = {A / B- (T ₃ )} / {(T ₄ )-(T ₃ )}

＜キーワード検出部１１３及びキーワードデータベース１１４＞
キーワード検出部１１３は、音声認識結果x_R(t₄)を入力とし、音声認識結果x_R(t₄)に含まれる単語列と、キーワードデータベース１１４に格納されているキーワードとのマッチングを行い、音声認識結果x_R(t₄)に含まれる単語列の何れかがキーワードデータベース１１４にある場合には検出結果I₄(t₄)=1を出力し、無い場合には検出結果I₄(t₄)=0を出力する（Ｓ１１３）。キーワードデータベース１１４に格納されているキーワードは、話しかけるきっかけに良く使われるものである。または、キーワードデータベース１１４に格納されているキーワード毎に0.0〜1.0の数値をあらかじめ指定しておき、そのキーワードが検出された際に対応する数値を検出結果I₄(t₄)として出力する構成としてもよい。数値は、話しかけるきっかけに良く使われるキーワードほど１に近い値をあらかじめ設定しておく。 <Keyword detection unit 113 and keyword database 114>
Keyword detection unit 113 inputs the speech recognition result x _R (t _4), performs a word sequence contained in the speech recognition result x _R (t _4), matching with the keywords stored in the keyword database 114, any word sequence contained in the speech recognition result x _R (t ₄₎ outputs a detection result I _₄ (t ₄₎ = 1 in the case in the keyword database 114, when no detection result I ₄ (t ₄ ) 0 is output (S113). The keywords stored in the keyword database 114 are frequently used to start talking. Alternatively, a numerical value of 0.0 to 1.0 is specified in advance for each keyword stored in the keyword database 114, and when the keyword is detected, the corresponding numerical value is output as the detection result I ₄ (t ₄ ). Is also good. For the numerical value, a value that is closer to 1 is set in advance for a keyword that is frequently used as a trigger for talking.

＜発話頻度計算部１１５＞
発話頻度計算部１１５は、発音の検出結果I₁(t₀)と音源方向の推定結果x_D(t₀)とを入力とし、同一の方向からの発話が過去T秒の間にどのくらいあったかを計算する（Ｓ１１５）。例えば、過去T秒の間に音源方向の推定結果x_D(t₀)がθであり、かつ、発音があった時間(I₁(t₀)=1)の合計をA(θ)秒とすれば、θ方向の発音頻度を、それらの比D(θ)=A(θ)/Tとして求めることができる。発話頻度計算部１１５は、この頻度D(θ)を現時点t₀の推定結果(音源方向)x_D(t₀)について求める。例えば音源がテレビや音楽受聴用のスピーカであった場合、これらは長時間の間ほとんど無音になることなく、同じ方向から音が到来し続けることとなる。このような音源がθ方向にあった場合、発音頻度D(θ)は1に近い大きな値をとることになる。発話頻度計算部１１５は、発音頻度D(θ)があらかじめ設定した第７の閾値T₇よりも大きければ発話頻度指標I₅=0を出力し、あらかじめ設定した第８の閾値T₈よりも小さければ発話頻度指標I₅=1を出力し、どちらでもなければ以下の式により第７の閾値T₇の時にI₅=0になりと第８の閾値T₈のときにI₅=1となる直線上の値を出力する。
I₅(t₀)={D(θ)-(T₇)}/{(T₈)-(T₇)} <Speech frequency calculation unit 115>
The speech frequency calculation unit 115 receives the pronunciation detection result I ₁ (t ₀ ) and the sound source direction estimation result x _D (t ₀ ) as inputs, and determines how many speeches from the same direction have occurred during the past T seconds. The calculation is performed (S115). For example, during the past T seconds, the estimation result x _D (t ₀ ) of the sound source direction is θ, and the total of the sounding times (I ₁ (t ₀ ) = 1) is A (θ) seconds. Then, the pronunciation frequency in the θ direction can be obtained as their ratio D (θ) = A (θ) / T. The utterance frequency calculation unit 115 obtains the frequency D (θ) for the estimation result (sound source direction) x _D (t ₀ ) at the present time t ₀ . For example, when the sound source is a television or a speaker for listening to music, the sound will continue to come from the same direction with almost no silence for a long time. When such a sound source exists in the θ direction, the sound generation frequency D (θ) takes a large value close to 1. The utterance frequency calculation unit 115 outputs the utterance frequency index I ₅ = 0 if the pronunciation frequency D (θ) is greater than a preset seventh threshold T ₇ , and is smaller than the preset eighth threshold T _8. if outputs speech frequency index I ₅ = 1, the I ₅ = 1 when and becomes I ₅ = 0 threshold T ₈ of the eighth through either if any following equation when the threshold T ₇ of the seventh Output the value on a straight line.
I ₅ (t ₀ ) = {D (θ)-(T ₇ )} / {(T ₈ )-(T ₇ )}

＜顔の距離指標計算部１１６＞
顔の距離指標計算部１１６は、顔の大きさを示す検出結果y_S(u)を入力とし、この値を用いて、利用者とカメラ７１との距離を示す距離指標I₆(u)を計算し（Ｓ１１６）、出力する。例えば、距離指標I₆(u)は、利用者とカメラ７１との距離が近いほど1.0に近くなり、距離が遠いほど0.0に近くなる指標である。 <Face distance index calculation unit 116>
The face distance index calculation unit 116 receives the detection result y _S (u) indicating the face size as an input, and uses this value to calculate a distance index I ₆ (u) indicating the distance between the user and the camera 71. Calculate (S116) and output. For example, the distance index I ₆ (u) is an index that is closer to 1.0 as the distance between the user and the camera 71 is shorter, and is closer to 0.0 as the distance is longer.

顔が近いほど大きく映像に映るので、検出された顔の大きさから距離を推定することができる。例えば1mの位置で標準的な大きさの顔が顔認識で認識された際の大きさをFとした場合、検出結果y_S(u)の大きさがGであったとすれば、顔までの距離は、その比F/G(m)で推定することができる。推定された距離があらかじめ設定した第５の閾値T₅よりも大きければI₆(u)=0を出力し、あらかじめ設定した第６の閾値よりも小さければI₆(u)=1を出力し、どちらでもなければ以下の式により第５の閾値の時に0になりと第６の閾値のときに1となる直線上の値を出力する。
I₆(u)={F/G-(T₅)}/{(T₆)-(T₅)} Since the closer the face is, the larger the image is displayed on the image, the distance can be estimated from the size of the detected face. For example, if the size of a face of standard size at the position of 1 m is recognized by face recognition is F, and if the size of the detection result y _S (u) is G, the The distance can be estimated by the ratio F / G (m). If the estimated distance is larger than a preset fifth threshold value T ₅ , I ₆ (u) = 0 is output, and if it is smaller than a preset sixth threshold value, I ₆ (u) = 1 is output. Otherwise, a value on a straight line that becomes 0 at the fifth threshold and 1 at the sixth threshold is output by the following equation.
I ₆ (u) = {F / G- (T ₅ )} / {(T ₆ )-(T ₅ )}

＜応答開始指標計算部１１７＞
応答開始指標計算部１１７は、発音の検出結果I₁(t₀),一致度合いI₂(u),発話距離指標I₃(t₀),検出結果I₄(t₄),発話頻度指標I₅(t₀),距離指標I₆(u),人感センサ８１の出力信号I₇(t₇)を入力とし、これらの情報の全てを使って、応答するか否かを判定するための指標である応答開始指標R(t₄)を計算し（Ｓ１１７）、出力する。 <Response start index calculation unit 117>
The response start index calculation unit 117 calculates the pronunciation detection result I ₁ (t ₀ ), degree of coincidence I ₂ (u), utterance distance index I ₃ (t ₀ ), detection result I ₄ (t ₄ ), utterance frequency index I ₅ (t ₀ ), the distance index I ₆ (u), and the output signal I ₇ (t ₇ ) of the human sensor 81, and use all of these information to determine whether or not to respond. A response start index R (t ₄ ), which is an index, is calculated (S117) and output.

前述の通り、発音の検出結果I₁(t₀)は、発音有の場合1となり、発音なしの場合0となる。ただし、t₀はマイクロホンアレイ６１のサンプル番号またはサンプル番号に対応する時刻を表す。一致度合いI₂(u)は、0〜1の値をとり、音による音源方向の推定結果と映像による顔認識結果が一致するほど1に近い値となる。ただし、uはカメラ７１のイメージセンサのサンプル番号またはサンプル番号に対応する時刻を表す。発話距離指標I₃(t₀)は、0〜1の値をとり、利用者とマイクロホンアレイ６１との距離が近いほど１に近い値となる。検出結果I₄(t₄)は、話しかけるきっかけに良く使われるキーワードを検出した場合1となり、検出できなかった場合0となる。ただし、t₄は音声認識結果の番号を表す。発話頻度指標I₅(t₀)は、0〜1の値をとり、過去の同一方向の発話頻度が低いほど１に近い値となる。距離指標I₆(u)は、0〜1の値をとり、利用者とカメラ７１との距離が近いほど１に近い値となる。人感センサ８１の出力信号I₇(t₇)は、人検出有の場合1となり、人検出なしの場合0となる。ただし、t₇は、人感センサ８１のサンプル番号またはサンプル番号に対応する時刻を表す。 As described above, the sound detection result I ₁ (t ₀ ) is 1 when there is a sound, and is ₀ when there is no sound. Here, t ₀ represents a sample number of the microphone array 61 or a time corresponding to the sample number. The matching degree I ₂ (u) takes a value of 0 to 1, and becomes closer to 1 as the result of estimation of the sound source direction by sound and the result of face recognition by video match. Here, u represents a sample number of the image sensor of the camera 71 or a time corresponding to the sample number. The utterance distance index I ₃ (t ₀ ) takes a value of 0 to 1, and becomes closer to 1 as the distance between the user and the microphone array 61 is shorter. The detection result I ₄ (t ₄ ) is 1 when a keyword frequently used to trigger a conversation is detected, and is 0 when the keyword is not detected. However, t ₄ represents the number of speech recognition results. The utterance frequency index I ₅ (t ₀ ) takes a value of 0 to 1, and becomes closer to 1 as the past utterance frequency in the same direction is lower. The distance index I ₆ (u) takes a value from 0 to 1, and becomes closer to 1 as the distance between the user and the camera 71 is shorter. The output signal I ₇ (t ₇ ) of the human sensor 81 becomes 1 when there is a human detection, and becomes 0 when there is no human detection. Here, t ₇ represents a sample number of the human sensor 81 or a time corresponding to the sample number.

応答開始指標計算部１１７の入出力間の関係式を関数Fとすれば、次式で応答開始指標R(t₄)を計算できる。
R(t₄)=F{I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)} If the relational expression between the input and output of the response start index calculation unit 117 is a function F, the response start index R (t ₄ ) can be calculated by the following equation.
R (t ₄ ) = F {I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u), I ₇ (t ₇ )}

関数Fは、例えば一次方程式とすることができ、各入力I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)にあらかじめ設定した重みW_nを乗じて加算した総和にあらかじめ設定した定数Cを加算した次式が用いられる。 The function F can be, for example, a linear equation, and each input I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u), the following equation obtained by adding a constant C that is previously set to a sum obtained by adding multiplied by the weight W _n set in advance to I _₇ (t ₇₎ is used.

ただし、I₁(t₄),I₂(t₄),I₃(t₄),I₅(t₄),I₆(u),I₇(t₄)は、I₄(t₄)の取得時からみて直近のI₁(t₀),I₂(u),I₃(t₀),I₅(t₀),I₆(u),I₇(t₇)である。音声認識結果を出力するタイミングと他の出力値が出力される周期とは、通常、一致しない。応答開始指標R(t₄)は、複数の入力値の中で、音声認識結果x_R(t₄)から得られる検出結果I₄(t₄)の影響を最も受けると考えられる。そこで、応答開始指標R(t₄)は音声認識結果x_R(t₄)の入力を契機に、その時刻t₄に最も近い他の指標をバッファから読みだして処理を実行する。 Where I ₁ (t ₄ ), I ₂ (t ₄ ), I ₃ (t ₄ ), I ₅ (t ₄ ), I ₆ (u), I ₇ (t ₄ ) are I ₄ (t ₄ ) Are the most recent I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₅ (t ₀ ), I ₆ (u), and I ₇ (t ₇ ) from the time of acquisition. The timing at which the speech recognition result is output and the cycle at which another output value is output usually do not match. It is considered that the response start index R (t ₄ ) is most affected by the detection result I ₄ (t ₄ ) obtained from the speech recognition result x _R (t ₄ ) among a plurality of input values. Therefore, the response start index R (t ₄ ) is read out from the buffer and executed by inputting the speech recognition result x _R (t ₄ ), reading another index closest to the time t ₄ .

関数Fは、二次方程式でもよい。その場合、各入力I_n(t₄)にあらかじめ設定した重みW_nを乗じて加算した総和と、入力の２つを乗じた値I_n(t₄)I_m(t₄)にあらかじめ設定した重みV_n,mを乗じて加算した総和と、あらかじめ設定した定数Cとを加算した次式が用いられる。 The function F may be a quadratic equation. In that case, the sum obtained by adding multiplied by the weight W _n set in advance for each input I _n (t _4), is set in advance to a value obtained by multiplying the two input _{_{_{I n (t 4) I m}}} (t 4) The following equation is used in which the sum total obtained by multiplying by the weight V _{n, m} and a preset constant C are added.

関数Fは、一次方程式や二次方程式で重み付の加算値を計算した後で、0〜1でクリッピングする関数をかけることで０〜1の間の出力値となるように制限しても良い（次式）。クリッピングをする関数はシグモイド関数G(x)などが用いられる。 The function F may be limited to an output value between 0 and 1 by calculating a weighted addition value with a linear equation or a quadratic equation, and then multiplying the function by clipping at 0 to 1. (The following equation). A sigmoid function G (x) or the like is used as the clipping function.

ただし、a、bは予め設定される定数である。 Here, a and b are constants set in advance.

＜話しかけ開始指標計算部１１８＞
話しかけ開始指標計算部１１８は、上述のI₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)を入力とし、これらの情報の全てを使って、話しかけをするか否かを判定するための指標である話しかけ開始指標S(u)を計算し（Ｓ１１８）、出力する。話しかけ開始指標計算部１１８は、応答開始指標計算部１１７と同様の方法で話しかけ開始指標S(u)を計算することができる。ただし、あらかじめ設定した重みW_nやV_n,mの値は応答開始指標計算部１１７とは異なる数値で設定される。また、話しかけ開始指標S(u)は、外部からの話しかけがない場合に大きな値をとるので、発音の検出結果I₁(t₀)、一致度合いI₂(u)、発話距離指標I₃(t₀)、キーワード検出部１１３の出力値I₄(t₄)を、それぞれ、1から減算した値を入力するように置きなおしてもよい。つまり、I₁(t₀)を1-I₁(t₀)に、I₂(u)を1-I₂(u)に、I₃(t₀)を1-I₃(t₀)に、I₄(t₄)を1-I₄(t₄)に置き換えてもよい。 <Talking start index calculation unit 118>
The talking start index calculation unit 118 calculates the above I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u), Using I ₇ (t ₇ ) as an input, a speech start index S (u), which is an index for determining whether or not to speak, is calculated using all of the information (S118) and output. The talking start index calculation unit 118 can calculate the talking start index S (u) in the same manner as the response start index calculation unit 117. However, the values of the weights W _n and V _{n, m} set in advance are set to numerical values different from those of the response start index calculation unit 117. Further, since the speaking start index S (u) takes a large value when there is no external speaking, the pronunciation detection result I ₁ (t ₀ ), the degree of coincidence I ₂ (u), the utterance distance index I ₃ ( t ₀ ), the output value I ₄ (t ₄ ) of the keyword detection unit 113 may be rearranged such that a value obtained by subtracting 1 from the input is input. That is, I ₁ (t ₀ ) is 1-I ₁ (t ₀ ), I ₂ (u) is 1-I ₂ (u), I ₃ (t ₀ ) is 1-I ₃ (t ₀ ) , I ₄ (t ₄ ) may be replaced with 1-I ₄ (t ₄ ).

なお、話しかけ開始指標S(u)は、複数の入力値の中で、顔の方向を示す検出結果y_D(u)及び顔の大きさを示す検出結果y_S(u)の影響を最も受けると考えられる。そこで、話しかけ開始指標S(u)は顔の方向を示す検出結果y_D(u)及び顔の大きさを示す検出結果y_S(u)の入力を契機に、その時刻uに最も近い他の指標をバッファから読みだして処理を実行する。 The talking start index S (u) is most affected by the detection result y _D (u) indicating the direction of the face and the detection result y _S (u) indicating the size of the face among a plurality of input values. it is conceivable that. Therefore, the talking start index S (u) is triggered by input of the detection result y _D (u) indicating the direction of the face and the detection result y _S (u) indicating the size of the face. Read the index from the buffer and execute the process.

＜効果＞
このような構成により、様々なセンサの情報から、外部からの音に対して応答するか否かの指標である応答開始指標Rと、対話装置側から会話を開始すべきか否かの指標である話しかけ開始指標Sを求め、これに基づいて対話の開始の制御をすることができ、対話とは無関係の音声などに対して反応することを防ぐことができる。 <Effect>
With such a configuration, based on information from various sensors, a response start index R that is an index of whether or not to respond to an external sound and an index of whether or not to start a conversation from the interactive device side. The talk start index S is obtained, and the start of the dialogue can be controlled based on this, and it is possible to prevent the user from reacting to a voice unrelated to the dialogue.

＜変形例＞
本実施形態の開始指標計算部１１０は、話しかけ開始指標S(u)と応答開始指標R(t₄)とを求め、出力しているが、何れか一方の指標のみを求める構成としてもよい。その場合、他方の指標は、本実施形態とは異なる方法を用いて求めればよい。または、シナリオ選択部１２２は、話しかけ開始指標S(u)または応答開始指標R(t₄)を入力とし、話しかけ開始指標S(u)とJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、または、応答開始指標R(t₄)とK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係とに基づき、(A)話しかけシナリオ若しくは確認シナリオ（例えば確認シナリオ２（動作、独り言））、または、(B)応答シナリオ若しくは確認シナリオ（例えば確認シナリオ１（問いかけ））を選択し、選択したシナリオに対応して動作させるための制御信号z(t₈)を出力する。 <Modification>
Although the start index calculation unit 110 of the present embodiment obtains and outputs the talk start index S (u) and the response start index R (t ₄ ), it may be configured to obtain only one of the indexes. In that case, the other index may be obtained using a method different from that of the present embodiment. Alternatively, the scenario selection unit 122 receives the talking start index S (u) or the response starting index R (t ₄ ) as input, and sets the talking starting index S (u) and the J thresholds Th _{s, 1} , Th _{s, 2} , , Th _{s, J} or the magnitude relationship between the response start index R (t ₄ ) and the K thresholds Thr _{, 1} , Thr _{, 2} , ..., Thr _{, K} (A) Talking scenario or confirmation scenario (for example, confirmation scenario 2 (action, self-contained)) or (B) Response scenario or confirmation scenario (for example, confirmation scenario 1 (question)) is selected, and corresponding to the selected scenario. A control signal z (t ₈ ) for operation is output.

本実施形態では、マイクロホンアレイ６１の出力信号x(t₀)に基づくデータと、カメラ７１のイメージセンサの出力信号y(u)に基づくデータと、人感センサ８１の出力信号I₇(t₇)とを入力としているが、必要に応じて、マイクロホンアレイ６１の出力信号x(t₀)とカメラ７１のイメージセンサの出力信号y(u)と人感センサ８１の出力信号I₇(t₇)との3つの出力信号のうちの2つの出力信号を用いればよい。そのような構成とすることで、音声だけではなく、様々なセンサからの情報に基づき対話音声であるかの確からしさ数値化することができる。 In the present embodiment, data based on the output signal x (t ₀ ) of the microphone array 61, data based on the output signal y (u) of the image sensor of the camera 71, and the output signal I ₇ (t ₇ ) Are input, but the output signal x (t ₀ ) of the microphone array 61, the output signal y (u) of the image sensor of the camera 71, and the output signal I ₇ (t ₇ ) May be used as two of the three output signals. With such a configuration, it is possible to quantify the likelihood of a conversation voice based on information from various sensors as well as voice.

本実施形態では、I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)を全て使って、話しかけ開始指標S(u)と応答開始指標R(t₄)とを求めているが、必ずしも全て使う必要はなく、話しかけ開始指標S(u)と応答開始指標R(t₄)を求める際に影響が大きいものを適宜選択してもよい。例えば、話しかけ開始指標S(u)は、顔の方向を示す検出結果y_D(u)及び顔の大きさを示す検出結果y_S(u)の影響を大きく受けると考えられるため、y_D(u)またはy_S(u)を使って求めることが望ましい。よって、話しかけ開始指標計算部１１８は、マイクロホンアレイ６１の出力信号x(t₀)及び人感センサ８１の出力信号I₇(t₇)のうちの少なくとも１つの出力信号とカメラ７１のイメージセンサの出力信号y(u)とに基づき、話しかけ開始指標S(u)を計算する。要は、y_D(u)またはy_S(u)に基づき得られるI₂(u)またはI₆(u)と、それ以外のI₁(t₀),I₃(t₀),I₄(t₄),I₅(t₀),I₇(t₇)の中から1つ以上を用いて話しかけ開始指標S(u)を計算すればよい。一方、応答開始指標R(t₄)は、音声認識結果x_R(t₄)の影響を大きく受けると考えられるため、x_R(t₄)を使って求めることが望ましい。よって、応答開始指標計算部１１７は、カメラ７１のイメージセンサの出力信号y(u)及び人感センサ８１の出力信号I₇(t₇)のうちの少なくとも１つの出力信号とマイクロホンアレイ６１の出力信号x(t₀)とに基づき、応答開始指標R(t₄)を計算する。要は、x_R(t₄)に基づき得られるI₄(t₄)と、カメラ７１のイメージセンサの出力信号y(u)及び人感センサ８１の出力信号I₇(t₇)のうちの少なくとも１つの出力信号に基づくI₂(u),I₆(u),I₇(t₇)の中から1つ以上を用いて応答開始指標R(t₄)を計算すればよい。この場合にも、必要な重みW_n,V_n,m、定数Cを予め設定すればよい。 In the present embodiment, I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u), I ₇ (t ₇ ) Are used to obtain the talk start index S (u) and the response start index R (t ₄ ), but it is not necessary to use all of them, and the talk start index S (u) and the response start index R (t ₄ ) The one having a large influence when obtaining the above may be appropriately selected. For example, since the talking start index S (u) is considered to be greatly influenced by the detection result y _D (u) indicating the direction of the face and the detection result y _S (u) indicating the size of the face, y _D ( It is desirable to use u) or y _S (u). Therefore, the talking start index calculating unit 118 calculates at least one of the output signal x (t ₀ ) of the microphone array 61 and the output signal I ₇ (t ₇ ) of the human sensor 81 and the output signal of the image sensor of the camera 71. Based on the output signal y (u), a talking start index S (u) is calculated. In short, I ₂ (u) or I ₆ (u) obtained based on y _D (u) or y _S (u), and other I ₁ (t ₀ ), I ₃ (t ₀ ), I ₄ The talking start index S (u) may be calculated using at least one of (t ₄ ), I ₅ (t ₀ ), and I ₇ (t ₇ ). On the other hand, response start index R (t _4), since that would greatly affected by the speech recognition result x _R (t _4), it is desirable that found using x _R (t _4). Accordingly, the response start index calculation unit 117 calculates at least one of the output signal y (u) of the image sensor of the camera 71 and the output signal I ₇ (t ₇ ) of the human sensor 81 and the output of the microphone array 61. A response start index R (t ₄ ) is calculated based on the signal x (t ₀ ). In short, I ₄ (t ₄ ) obtained based on x _R (t ₄ ) and the output signal y (u) of the image sensor of the camera 71 and the output signal I ₇ (t ₇ ) of the motion sensor 81 The response start index R (t ₄ ) may be calculated using at least one of I ₂ (u), I ₆ (u), and I ₇ (t ₇ ) based on at least one output signal. Also in this case, necessary weights W _n , V _{n, m} and a constant C may be set in advance.

要は、マイクロホンアレイ６１、カメラ７１のイメージセンサ、人感センサ８１の3つのセンサ情報のうち2つ以上が含まれるように入力信号を設定することで、様々なセンサからの情報に基づき対話音声らしさを数値化でき性能が良くなる。 In short, by setting the input signal so that two or more of the three sensor information of the microphone array 61, the image sensor of the camera 71, and the human sensor 81 are included, the dialogue voice based on information from various sensors is provided. The likelihood can be quantified and the performance improves.

＜第四実施形態＞
第三実施形態と異なる部分を中心に説明する。 <Fourth embodiment>
The following description focuses on the differences from the third embodiment.

第三実施形態の対話制御装置１００の応答開始指標計算部１１７において応答開始指標モデルを用いて応答開始指標R(t₄)を計算し（Ｓ１１７）、話しかけ開始指標計算部１１８において話しかけ開始指標モデルを用いて話しかけ開始指標S(u)を計算する（Ｓ１１８）。 The response start index calculation unit 117 of the dialog control device 100 of the third embodiment calculates the response start index R (t ₄ ) using the response start index model (S117), and the talking start index calculation unit 118 calculates the talking start index model. Is used to calculate a talking start index S (u) (S118).

本実施形態では、応答開始指標モデルを学習する応答開始指標モデル学習部２１１と、話しかけ開始指標モデルを学習する話しかけ開始指標モデル学習部２１２とを追加した構成である（図８中、破線で示す）。 In this embodiment, a response start index model learning unit 211 for learning a response start index model and a talk start index model learning unit 212 for learning a talk start index model are added (shown by a broken line in FIG. 8). ).

応答開始指標モデル学習部２１１は、応答開始指標計算モデルの入力信号I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)と、応答開始指標R(t₄)との関係を事前に学習データを用いて学習する。学習データは、実環境で取得した実入力データI₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)に対し、人手で正解の応答開始指標R(t₄)の値を付与したものなどを用いる。このようなデータから、機械学習の手法を用いて入出力の関係が近くなるようにモデルを学習する。例えばニューラルネットワークで構成されたモデルをバックプロパゲーションの手法を使って学習するなどの手法が用いられる。 The response start index model learning unit 211 includes input signals I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), and I ₅ (t ₀ ) of the response start index calculation model. , I ₆ (u), I ₇ (t ₇ ) and the response start index R (t ₄ ) are learned in advance using the learning data. The training data is real input data I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u ), I ₇ (t ₇ ), or the like, to which the value of the correct response start index R (t ₄ ) is manually added. From such data, a model is learned by using a machine learning technique so that the input / output relationship is close. For example, a technique of learning a model configured by a neural network using a back propagation technique is used.

話しかけ開始指標モデル学習部２１２も同様にして、入力信号I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)と、話しかけ開始指標S(u)との関係を事前に学習データを用いて学習する。 Similarly, the talking start index model learning unit 212 also receives input signals I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I ₄ (t ₄ ), I ₅ (t ₀ ), and I ₆ The relationship between (u), I ₇ (t ₇ ) and the talking start index S (u) is learned in advance using the learning data.

応答開始指標計算部１１７は、応答開始指標モデル学習部２１１で学習された応答開始指標モデルを用いて、入力信号I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)から応答開始指標R(t₄)を計算する。 The response start index calculating unit 117 uses the response start index model learned by the response start index model learning unit 211 to input the input signals I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I _The response start index R (t ₄ ) is calculated from ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u), and I ₇ (t ₇ ).

話しかけ開始指標計算部１１８は、話しかけ開始指標モデル学習部２１２で学習された話しかけ開始指標モデルを用いて、入力信号I₁(t₀),I₂(u),I₃(t₀),I₄(t₄),I₅(t₀),I₆(u),I₇(t₇)から話しかけ開始指標S(u)を計算する。 The talking start index calculation unit 118 uses the talking start index model learned by the talking start index model learning unit 212 to input signals I ₁ (t ₀ ), I ₂ (u), I ₃ (t ₀ ), I Talking start index S (u) is calculated from ₄ (t ₄ ), I ₅ (t ₀ ), I ₆ (u), and I ₇ (t ₇ ).

＜効果＞
このような構成により、第三実施形態では人手により設定されていた計算式や重みW_nやV_n,mを、実データを用いて最適に自動設定することが可能となり、より精度の高い指標の出力が可能となる。 <Effect>
With such a configuration, the calculation formula and the weights W _n and V _{n, m} that were manually set in the third embodiment can be optimally automatically set using actual data, and a more accurate index Can be output.

＜第五実施形態＞
第三実施形態及び第四実施形態と異なる部分を中心に説明する。 <Fifth embodiment>
The following description focuses on the differences from the third embodiment and the fourth embodiment.

第三実施形態または第四実施形態の対話制御装置１００に時間補正部３１０を追加した構成である（図８中、破線で示す）。音声認識や顔検出は処理に遅延が生じ、また処理遅延が一定しないため、この遅延時間の補正を行わないと、様々なセンサからの情報が異なる時刻の情報となってしまい、誤った応答開始指標R(t₄)や話しかけ開始指標S(u)を出力してしまう可能性がある。これを防ぐために、時間補正部３１０を追加し、各センサからの情報を時刻とともにバッファリングし、最も遅延の大きい情報に合わせて、情報の読み出し位置を決定する。 This is a configuration in which a time correction unit 310 is added to the interaction control device 100 of the third embodiment or the fourth embodiment (shown by a broken line in FIG. 8). In voice recognition and face detection, processing delays occur, and processing delays are not constant. Unless this delay time is corrected, information from various sensors becomes information at different times, and an incorrect response is started. There is a possibility that the index R (t ₄ ) and the talking start index S (u) will be output. In order to prevent this, a time correction unit 310 is added, information from each sensor is buffered with time, and the information reading position is determined according to the information with the largest delay.

＜時間補正部３１０＞
図１１は、時間補正部３１０の機能ブロック図を示す。 <Time correction unit 310>
FIG. 11 is a functional block diagram of the time correction unit 310.

時間補正部３１０は、発音の検出結果I₁(t₀)、音源方向の推定結果x_D(t₀)、音声のレベルの推定結果x_L(t₀)、音声認識結果x_R(t₄)、顔の方向を示す検出結果y_D(u)、顔の大きさを示す検出結果y_S(u)、人感センサ８１の出力信号I₇(t₇)をそれぞれ格納する格納する７つのバッファ３１１−ｎと、対応時刻選択部３１２とを含む。 The time correction unit 310 generates a sound detection result I ₁ (t ₀ ), a sound source direction estimation result x _D (t ₀ ), a speech level estimation result x _L (t ₀ ), and a speech recognition result x _R (t _4). ), The detection result y _D (u) indicating the face direction, the detection result y _S (u) indicating the face size, and the output signal I ₇ (t ₇ ) of the human sensor 81. It includes a buffer 311-n and a corresponding time selection unit 312.

各バッファ３１１−ｎには、それぞれ各入力信号がFIFO（先入れ先出し）によりバッファリングされる。バッファ３１１−ｎには入力信号のデータと、そのデータの時刻とがともに記憶される。 Each input signal is buffered in each buffer 311-n by FIFO (first in first out). The buffer 311-n stores both the data of the input signal and the time of the data.

対応時刻選択部３１２は、FIFOの出力のうち最も新しい時刻(遅い時刻、遅延の大きい時刻)を探索し、その時刻に最も近い時刻に対応するデータを、それぞれのFIFOから読み出し出力する。また、その読み出しデータよりも古いデータはバッファより破棄する。例えば、図１２の場合、まず、最も新しい時刻のデータを探索し、時刻(00：04)のデータx_R(1)を得る。次に、時刻(00：04)に最も近い時刻に対応するデータであるI₁(3)(時刻00：05)、x_D(3)(時刻00：05)、x_L(3)(時刻00：05)、y_D(2)(時刻00：05)、y_S(2)(時刻00：05)、I₇(4)(時刻00：04)を読み出し、出力する。そして、その読み出しデータよりも古いデータを破棄する。さらに、読み出しデータの次のFIFOの出力のうち最も新しい時刻を探索するという動作を繰り返す。 The corresponding time selection unit 312 searches for the newest time (latest time, time with a large delay) from among the outputs of the FIFO, and reads out and outputs data corresponding to the time closest to that time from each FIFO. Data older than the read data is discarded from the buffer. For example, in the case of FIG. 12, first, searches the data of the most recent time, obtain a time (00:04) data x _R (1) of. Next, I ₁ (3) (time 00:05), x _D (3) (time 00:05), x _L (3) (time 00:05), y _D (2) (time 00:05), y _S (2) (time 00:05), and I ₇ (4) (time 00:04) are output. Then, data older than the read data is discarded. Further, the operation of searching for the newest time in the output of the FIFO following the read data is repeated.

このようにすることにより、もっとも遅延の大きいデータと同時刻のデータをそれぞれ出力することができ、時間ずれによる誤動作を防ぐことができる。 By doing so, it is possible to output the data with the largest delay and the data at the same time, respectively, and it is possible to prevent a malfunction due to a time lag.

なお、本実施形態では、開始指標計算部１１０の入力値に対してバッファリングを行っているが、応答開始指標計算部１１７及び話しかけ開始指標計算部１１８の入力値I₁〜I₇に対してバッファリングを行っても同様の効果を得ることができる。 In the present embodiment, the input value of the start index calculation unit 110 is buffered, but the input values I _{1 to} I _{7 of} the response start index calculation unit 117 and the talking start index calculation unit 118 are buffered. The same effect can be obtained by performing buffering.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other modifications>
The present invention is not limited to the above embodiments and modified examples. For example, the above-described various processes may be executed not only in chronological order as described, but also in parallel or individually according to the processing capability of an apparatus that executes the processes or as necessary. In addition, changes can be made as appropriate without departing from the spirit of the present invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing content of the function that each device should have is described by a program. By executing this program on a computer, various processing functions of the above-described devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program describing this processing content can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The distribution of the program is carried out, for example, by selling, transferring, lending, or the like, a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage unit. Then, when executing the process, the computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of the program, the computer may directly read the program from the portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be sequentially performed. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service which realizes a processing function only by executing an instruction and acquiring a result without transferring a program from the server computer to the computer. It may be. It should be noted that the program includes information used for processing by the computer and which is similar to the program (data that is not a direct command to the computer but has characteristics that define the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, each device is configured by executing a predetermined program on a computer. However, at least a part of the processing contents may be realized by hardware.

Claims

(i)対話装置側から対話の契機となる音声を出力して対話を開始する話しかけシナリオ、(ii)利用者側からの発話に対して応答する応答シナリオ、及び、(iii)利用者に対して対話を開始するか否かを確認する確認シナリオを記憶するシナリオ記憶部と、
対話装置側から対話の契機となる音声を出力して対話を開始すべきであるか否かを示す話しかけ開始指標Sと、ある音声に対して応答すべきであるか否かを示す応答開始指標Rとを入力とし、J及びKをそれぞれ1以上の整数の何れかとし、前記話しかけ開始指標SとJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、及び、前記応答開始指標RとK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係とに基づき、前記話しかけシナリオ、前記応答シナリオ、または、前記確認シナリオを選択するシナリオ選択部を含む、
対話制御装置。 (i) a talking scenario in which a dialogue device side outputs a voice that triggers a dialogue to start a dialogue, (ii) a response scenario in response to an utterance from a user side, and (iii) a user A scenario storage unit for storing a confirmation scenario for confirming whether or not to start a dialogue;
A speech start indicator S indicating whether or not the dialogue should be started by outputting a speech which triggers the dialogue from the dialogue device, and a response start indicator indicating whether or not to respond to a certain voice R as an input, J and K each being an integer of 1 or more _, and a magnitude relationship between the talking start index S and J thresholds Th _{s, 1} , Th _{s, 2} , ..., Th _{s, J} , And based on the magnitude relationship between the response start index R and the K thresholds Thr _{, 1} , Thr _{, 2} , ..., Thr _{, K} , the talking scenario, the response scenario, or the confirmation scenario Including a scenario selection unit for selecting
Dialogue control device.

請求項１の対話制御装置であって、
前記シナリオ選択部は、(I)待ち受け状態、(II)前記確認シナリオを実行後の状態である確認状態、(III)前記話しかけシナリオまたは前記応答シナリオを実行後の状態である対話状態の何れかの状態に遷移し、前記待ち受け状態、前記確認状態、前記対話状態の何れかの状態に応じて、前記話しかけシナリオ、前記応答シナリオ、または、前記確認シナリオを選択する際の選択基準を変更する、
対話制御装置。 The dialogue control device according to claim 1, wherein
The scenario selection unit may be any one of (I) a standby state, (II) a confirmation state after execution of the confirmation scenario, and (III) an interaction state after execution of the talking scenario or the response scenario. Transition to the state, the standby state, the confirmation state, according to any one of the dialogue state, the talking scenario, the response scenario, or, change the selection criteria when selecting the confirmation scenario,
Dialogue control device.

請求項１または請求項２の対話制御装置であって、
マイクロホンの出力信号及び人感センサの出力信号のうちの少なくとも１つの出力信号とイメージセンサの出力信号とに基づき、前記話しかけ開始指標を計算する話しかけ開始指標計算部を含み、
前記話しかけ開始指標計算部は、前記イメージセンサの出力信号を用いて得られる顔の方向を示す検出結果及び顔の大きさを示す検出結果の少なくとも何れかを用いて、前記話しかけ開始指標を計算する、
対話制御装置。 The interactive control device according to claim 1 or 2, wherein:
Based on at least one output signal of the output signal of the microphone and the output signal of the human sensor and the output signal of the image sensor, including a talking start index calculating unit that calculates the talking start index,
The talking start index calculation unit calculates the talking start index using at least one of a detection result indicating a face direction and a detection result indicating a face size obtained using an output signal of the image sensor. ,
Dialogue control device.

請求項１から請求項３の何れかの対話制御装置であって、
イメージセンサ及び人感センサの出力信号のうちの少なくとも１つの出力信号とマイクロホンの出力信号とに基づき、前記応答開始指標を計算する応答開始指標計算部を含み、
前記応答開始指標計算部は、前記マイクロホンの出力信号を用いて得られる音声認識結果を用いて応答開始指標を計算する、
対話制御装置。 The interactive control device according to any one of claims 1 to 3, wherein
Based on at least one output signal of the output signal of the image sensor and the human sensor and the output signal of the microphone, a response start index calculation unit that calculates the response start index,
The response start index calculation unit calculates a response start index using a speech recognition result obtained using the output signal of the microphone,
Dialogue control device.

シナリオ記憶部には、(i)対話装置側から対話の契機となる音声を出力して対話を開始する話しかけシナリオ、(ii)利用者側からの発話に対して応答する応答シナリオ、及び、(iii)利用者に対して対話を開始するか否かを確認する確認シナリオが記憶されるものとし、
シナリオ選択部が、対話装置側から対話の契機となる音声を出力して対話を開始すべきであるか否かを示す話しかけ開始指標Sと、ある音声に対して応答すべきであるか否かを示す応答開始指標Rとを入力とし、J及びKをそれぞれ1以上の整数の何れかとし、前記話しかけ開始指標SとJ個の閾値Th_s,1,Th_s,2,…,Th_s,Jとの大小関係、及び、前記応答開始指標RとK個の閾値Th_r,1,Th_r,2,…,Th_r,Kとの大小関係とに基づき、前記話しかけシナリオ、前記応答シナリオ、または、前記確認シナリオを選択するシナリオ選択ステップを含む、
対話制御方法。 In the scenario storage unit, (i) a talking scenario in which a dialogue device side outputs a voice that triggers the dialogue to start the dialogue, (ii) a response scenario in response to an utterance from the user side, and ( iii) A confirmation scenario for confirming whether or not to start dialogue with the user shall be stored,
The scenario selection unit outputs a voice that triggers the dialogue from the dialogue device side to indicate whether or not the dialogue should be started, and a talking start index S, and whether to respond to a certain voice. , And J and K are each an integer of 1 or more, and the talking start index S and J thresholds Th _{s, 1} , Th _{s, 2} , ..., Th _{s, J,} and based on the magnitude relationship between the response start index R and the K thresholds Thr _{, 1} , Thr _{, 2} , ..., Thr _{, K} , the talking scenario, the response scenario, Or including a scenario selection step of selecting the confirmation scenario,
Dialogue control method.

請求項５の対話制御方法であって、
話しかけ開始指標計算部が、マイクロホンの出力信号及び人感センサの出力信号のうちの少なくとも１つの出力信号とイメージセンサの出力信号とに基づき、前記話しかけ開始指標を計算する話しかけ開始指標計算ステップを含み、
前記話しかけ開始指標計算ステップは、前記イメージセンサの出力信号を用いて得られる顔の方向を示す検出結果及び顔の大きさを示す検出結果の少なくとも何れかを用いて、前記話しかけ開始指標を計算する、
対話制御方法。 The interaction control method according to claim 5, wherein
The speaking start index calculating unit includes a speaking start index calculating step of calculating the speaking start index based on at least one of an output signal of the microphone and an output signal of the human sensor and an output signal of the image sensor. ,
The talking start index calculating step calculates the talking start index using at least one of a detection result indicating a face direction and a detection result indicating a face size obtained using an output signal of the image sensor. ,
Dialogue control method.

請求項５または請求項６の対話制御方法であって、
応答開始指標計算部が、イメージセンサ及び人感センサの出力信号のうちの少なくとも１つの出力信号とマイクロホンの出力信号とに基づき、前記応答開始指標を計算する応答開始指標計算ステップを含み、
前記応答開始指標計算ステップは、前記マイクロホンの出力信号を用いて得られる音声認識結果を用いて応答開始指標を計算する、
対話制御方法。 The interactive control method according to claim 5 or 6, wherein:
A response start index calculating unit that calculates the response start index based on at least one output signal of the output signals of the image sensor and the human sensor and the output signal of the microphone,
The response start index calculation step calculates a response start index using a speech recognition result obtained using an output signal of the microphone,
Dialogue control method.

請求項１から請求項４の対話制御装置として、コンピュータを機能させるためのプログラム。 A program for causing a computer to function as the interactive control device according to claim 1.