JP2018094645A

JP2018094645A - Behavior command generation system, response system and behavior command generation method

Info

Publication number: JP2018094645A
Application number: JP2016238910A
Authority: JP
Inventors: チャリュウチュン; Changliu Chung; フェアチャイルドグラスディラン; Fairchild Glass Dylan; 神田　崇行; Takayuki Kanda; 崇行神田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2016-12-08
Filing date: 2016-12-08
Publication date: 2018-06-21
Anticipated expiration: 2036-12-08
Also published as: JP6886651B2

Abstract

PROBLEM TO BE SOLVED: To provide a system which generates behavior commands to a robot for man-robot interactions under a specific environment by data-driven type.SOLUTION: A behavior command generation system which enables a robot 1000 to make behavior communication with a customer, on the basis of data acquired and stored in a memory unit 300 under a situation that a store owner and a customer make behavior communication, generates such a robot behavior command as the robot 1000 takes behaviors in place of the store owner. A coupling state vector is generated from a state of a customer and the store owner on the basis of time-sequence data between clustering results and behaviors, so that each behavior vector corresponds to a coupling state vector to represent a store owner's succeeding representative behavior. A predictor is trained to input the coupling state vector, and to output a behavior vector.SELECTED DRAWING: Figure 5

Description

この発明は、ロボットなどの機器と人間との間で、自然なコミュニケーションを可能とするための行動コマンド生成システム、応答システムおよび行動コマンド生成方法に関する。 The present invention relates to a behavior command generation system, a response system, and a behavior command generation method for enabling natural communication between a device such as a robot and a human.

ロボットが現代に、より普及するようになるとともに、人間・ロボット間のインタラクション(HRI：Human Robot Interface)の分野では、ロボットを日常生活へ溶け込ませていくことが必要となる。 As robots become more popular in modern times, it is necessary to integrate robots into daily life in the field of human-robot interaction (HRI).

このようないわゆる「サービス・ロボット」は、博物館、オフィス、老人介護、ショッピング・モールおよびヘルスケア設備において、存在感を得るようになってきている。 Such so-called “service robots” are gaining presence in museums, offices, elderly care, shopping malls and healthcare facilities.

たとえば、店舗業務の支援ロボットは、顧客に挨拶し、受け答えができ、商品を推薦し、様々な商品を説明し、様々な状況で顧客を支援できることが必要になる。 For example, a store support robot needs to be able to greet customers, accept and answer, recommend products, explain various products, and support customers in various situations.

このように、ロボットと人とのインタラクションについてのロジックを設計する一つのアプローチは、ロボットが実行するべき振る舞い、環境からのの予期される入力およびロボットが従うべき行動ルールを明示的にプログラムすることである。 Thus, one approach to designing logic for robot-human interaction is to explicitly program the behavior to be performed by the robot, the expected input from the environment, and the behavior rules to be followed by the robot. It is.

たとえば、特許文献１には、ロボットおよびモーションキャプチャシステムを含むロボットシステムが開示されている。このロボットシステムでは、ロボットおよび人の位置、体の向きおよび視線方向を用いてロボットと人との空間陣形を分析し、ロボットおよび人の対話参加状態を認識する。そして、認識した対話参加状態に応じて、ロボットが適宜の行動を取ることによって、ロボットおよび人の双方が対話参加状態になるようにした後、ロボットが人に対して挨拶発話を行う。 For example, Patent Document 1 discloses a robot system including a robot and a motion capture system. In this robot system, the spatial configuration of the robot and the person is analyzed using the position of the robot and the person, the body direction, and the line-of-sight direction, and the dialog participation state of the robot and the person is recognized. Then, according to the recognized dialog participation state, the robot takes appropriate actions so that both the robot and the person enter the dialog participation state, and then the robot greets and speaks to the person.

ここでは、ロボットおよび人の対話参加状態を認識すると、その認識結果に応じてロボットの行動（動作）が決定される。すなわち、ロボットは、双方参加状態となる所定の空間陣形（つまり対話を始めるための空間陣形）を形成するための行動を行うこととされているが、そのような空間陣形をとった場合に、どのような発話を行うかは、予め決められたシナリオに基づく。 Here, when the dialog participation state of the robot and the person is recognized, the action (motion) of the robot is determined according to the recognition result. That is, the robot is supposed to perform an action to form a predetermined space formation (that is, a space formation for starting a conversation) that is in a state where both parties participate, but when taking such a space formation, What kind of utterance is performed is based on a predetermined scenario.

また、特許文献２には、待機時間の長さに応じた場つなぎ動作を実行することで、ユーザと音声対話装置とのコミュニケーションの柔軟性を向上させる音声対話制御装置が開示されている。この音声対話制御装置は、待機時間を予測する待機時間予測部と、上記待機時間と、上記音声対話装置が実行可能な動作を示す複数の動作候補それぞれの実行に要する動作時間とに基づいて、場つなぎ動作を選択する場つなぎ動作決定部と、を備える。この場合も、このような動作候補は、予めテーブルとして用意されていることが想定されている。 Patent Document 2 discloses a voice dialogue control device that improves the flexibility of communication between the user and the voice dialogue device by executing a joint operation according to the length of the standby time. This voice interaction control device is based on the standby time prediction unit that predicts the standby time, the standby time, and the operation time required to execute each of a plurality of operation candidates indicating operations that can be executed by the voice interaction device. A field connection operation determination unit that selects a field connection operation. Also in this case, it is assumed that such motion candidates are prepared as a table in advance.

しかしながら、このように環境からのの予期される入力およびロボットが従うべき行動ルールを明示的にプログラムすることは、一般には、困難なプロセスであり、例えば、人々がロボットに尋ねる可能性のある質問をすべて予想して、様々な社会的な状況を想像し、かつ、明瞭に表現するのが難しいようなロボットに対する社会的行動および実行規則を指定するためには、自身の直観を使用するデザイナーの能力に、大きく依存するものとなる。 However, explicitly programming the expected input from the environment and the behavior rules that the robot should follow is generally a difficult process, for example, questions that people might ask the robot In order to specify all the social behaviors and execution rules for robots that are difficult to express and clearly imagine various social situations, the designers who use their own intuition It depends heavily on ability.

したがって、このプロセスはまさに労働集約的な作業になり、また、センサー・ノイズによる誤差や、人間行動の自然な多様性が考慮される場合には、ロバストなインタラクションを作成することはさらに困難になる。 This process is therefore very labor intensive, and it is even more difficult to create robust interactions when sensor noise errors and the natural diversity of human behavior are taken into account. .

そこで、ロボットが人に対してとるべき行動を、何らかのソースから取得してこれを利用しようとする技術も存在する。 Therefore, there is also a technology for acquiring an action that the robot should take on a person from some source and using it.

たとえば、特許文献３には、ソーシャルメディアサーバ７から視聴中の番組に関するコメントを取得し、ロボットに設定されたパーソナリティと一致するパーソナリティ一致話者のコメントからロボットに発話させる発話内容を決定するとともに、発話内容の対話状態とロボットの感情状態に基づいてロボットに実行させるアクション内容をアクションデータベースから抽出するシステムが開示されている。これにより、視聴中の番組の内容に応じたアクションをロボットに実行させることが可能となる。発話内容を決定する際に、ロボットに設定されたパーソナリティでコメントを絞り込むことで、一貫性のある発話・アクションをロボットに実行させることが可能となる。 For example, in Patent Document 3, a comment regarding a program being viewed is acquired from the social media server 7, and the utterance content to be uttered by the robot is determined from the comment of the personality matching speaker that matches the personality set in the robot. There is disclosed a system that extracts action contents to be executed by a robot from an action database based on a conversation state of utterance contents and an emotion state of the robot. This makes it possible for the robot to execute an action corresponding to the content of the program being viewed. When determining the utterance content, it is possible to cause the robot to execute consistent utterances and actions by narrowing down the comments by the personality set in the robot.

特開２０１２−１６１８５１号JP 2012-161851 A 特開２０１６−１２６２９３号Japanese Patent Application Laid-Open No. 2006-126293 特開２０１５−１４８７０１号JP2015-148701A

しかしながら、特許文献３に開示された技術は、あくまで、ロボットの発話内容について、ソーシャルメディアにおいて、放送中の番組対するコメントのうち、ロボットパーソナリティが一致するコメント情報を、ロボットに発話させるというものである。このため、現在の状況に合わせて、他の人間がコメントを発信していることが前提であり、かつ、ロボットのアクションについては、事前のシナリオに沿ったものとなっている。このため、ロボットに接客をさせるというような用途には、適さない。 However, the technique disclosed in Patent Document 3 is to allow the robot to utter the comment information that matches the robot personality among the comments on the program being broadcast on social media with respect to the utterance content of the robot. . For this reason, it is premised that other people are sending comments in accordance with the current situation, and the actions of the robot are in accordance with a prior scenario. For this reason, it is not suitable for an application in which a robot is allowed to serve customers.

この発明の目的は、上記のような問題点を解決するためになされたものであって、特定の環境において、人・ロボット間のインタラクションを行うためのロボットへの行動コマンドをデータ駆動型で生成するシステムを提供することである。 An object of the present invention is to solve the above-described problems, and in a specific environment, a behavior command to a robot for performing a human-robot interaction is generated in a data driven type. It is to provide a system that does.

この発明の１つの局面に従うと、第１の状況において装置が第１の参加者と行動によるコミュニケーションを可能とするための行動コマンド生成システムであって、第２の参加者および第３の参加者が行動によるコミュニケーションをとる第２の状況において取得されたデータに基づき、装置は、第１の状況において第２の参加者の代わりとして行動するものであり、人の行動に関する時系列データを収集するための複数のセンサと、第２の状況において、行動の時系列データをクラスタリングして、各クラスタごとに代表行動を決定する行動パターンクラスタ化手段と、結合状態ベクトルと行動ベクトルとをそれぞれ関連付けるためのベクトル生成手段とを備え、結合状態ベクトルは、第２の状況において、クラスタリングの結果と行動の時系列データに基づき、第３の参加者の状態と第２の参加者の状態とから生成され、各行動ベクトルは、結合状態ベクトルに対応し第２の参加者の後続する代表行動を表し、結合状態ベクトルを入力とし、行動ベクトルを出力とする予測器を生成するための予測器生成手段と、第１の状況において、生成された予測器により予測された、第１の参加者の行動に応答する行動ベクトルに応じて、装置へのコマンドを生成するためのコマンド生成手段とを備える。 According to one aspect of the present invention, there is provided a behavior command generation system for enabling a device to communicate with a first participant by behavior in a first situation, wherein the device is a second participant and a third participant. Based on the data acquired in the second situation in which the person communicates by action, the device acts as a substitute for the second participant in the first situation and collects time-series data on the person's action For associating a combined state vector and an action vector with each of a plurality of sensors for action, action pattern clustering means for clustering action time-series data in the second situation, and determining representative actions for each cluster A combined state vector in the second situation, the clustering result and the action Based on the series data, generated from the state of the third participant and the state of the second participant, each action vector corresponding to the combined state vector represents a subsequent representative action of the second participant, A predictor generating means for generating a predictor having a state vector as an input and an action vector as an output, and responding to the action of the first participant predicted by the generated predictor in the first situation Command generating means for generating a command to the device according to the action vector to be performed.

好ましくは、代表行動は、代表発話と代表運動とを含む。 Preferably, the representative action includes a representative utterance and a representative exercise.

好ましくは、行動パターンクラスタ化手段は、観測された第２の参加者の発話を発話クラスタに分類する発話クラスタ化手段と、クラスタ内で最も多くの他の発話と語彙上の類似度が最高レベルである発話を選ぶことで、発話クラスタごとに１つの代表発話を選択する典型発話抽出手段とを含む。 Preferably, the behavior pattern clustering means includes an utterance clustering means for classifying the observed second participant's utterance into an utterance cluster, and the highest lexical similarity with the most other utterances in the cluster. And a typical utterance extracting unit that selects one representative utterance for each utterance cluster.

好ましくは、ベクトル生成手段は、第２および第３の参加者の行動の区切りを検出して、行動の時系列データを離散化するための離散化手段と、区切られた第３の参加者の行動を検出したことに応じて、第３の参加者の状態と第２の参加者の状態とを結合状態ベクトルとして抽出する結合状態抽出手段と、抽出された結合状態ベクトルに対応する第２の参加者の後続する代表行動を行動ベクトルとして抽出するための行動ベクトル抽出手段と、を含む。 Preferably, the vector generating means detects the action delimiters of the second and third participants and discretizes the action time-series data, and the delimited third participant's action In response to detecting the action, a combined state extracting means for extracting the state of the third participant and the state of the second participant as a combined state vector, and a second corresponding to the extracted combined state vector Action vector extracting means for extracting a representative action following the participant as an action vector.

好ましくは、結合状態ベクトルにおける第２または第３の参加者の状態は、第２の参加者の空間状態と、第３の参加者の空間状態と、２人の人間間についての所定の共通の近接配置のうちの１つを含む。 Preferably, the state of the second or third participant in the combined state vector is a predetermined common state between the second participant's spatial state, the third participant's spatial state, and the two humans. Including one of the proximity arrangements.

好ましくは、行動パターンクラスタ化手段は、第２または第３の参加者の観測された軌道を、停止セグメントと移動セグメントにセグメント化する軌道セグメント化手段と、停止セグメントを停止クラスタにクラスタ化する空間クラスタ化手段と、対応する停止クラスタを各々代表する停止位置を特定する停止位置抽出手段とを含む。 Preferably, the behavior pattern clustering means includes trajectory segmenting means for segmenting the observed trajectory of the second or third participant into stop segments and moving segments, and space for clustering the stop segments into stop clusters. Clustering means and stop position extracting means for specifying stop positions that respectively represent the corresponding stop clusters are included.

好ましくは、行動パターンクラスタ化手段は、移動セグメントを移動クラスタにクラスタ化する軌道クラスタ化手段と、対応する移動クラスタを各々代表する軌道を特定する典型軌道抽出手段とを含む。 Preferably, the behavior pattern clustering means includes trajectory clustering means for clustering the moving segments into moving clusters, and typical trajectory extracting means for specifying trajectories each representing the corresponding moving cluster.

好ましくは、行動ベクトルは、第２の参加者の認識された発話を含む発話クラスタを特定するための情報を含む。 Preferably, the behavior vector includes information for specifying an utterance cluster including the recognized utterance of the second participant.

好ましくは、行動ベクトルは、行動ベクトルは、２人の人間間についての所定の共通の近接配置を含み、コマンド生成手段は、共通の近接配置にそれぞれ対応する生成モデルに基づいて、コマンドを生成する。 Preferably, the behavior vector includes a predetermined common proximity arrangement between two humans, and the command generation unit generates the command based on a generation model corresponding to each common proximity arrangement. .

この発明の他の局面に従うと、第１の参加者と行動によるコミュニケーションを可能とするための応答システムであって、第１の状況において、複数のセンサにより収集された第１の参加者の行動に関する時系列データに基づき、人に類似の行動を第１の参加者に提示するための装置を備え、装置は、第２の参加者および第３の参加者が行動によるコミュニケーションをとる第２の状況において取得されたデータに基づき、第１の状況において第２の参加者の代わりとして行動するものであり、装置は、第２の状況において取得されたデータに基づき生成された結合状態ベクトルと第２の参加者の代表行動に対応する行動ベクトルとを関連付けて格納するための記憶装置と、結合状態ベクトルを入力とし、行動ベクトルを出力とする予測器と、第１の状況において、生成された予測器により予測された、第１の参加者の行動に応答する行動ベクトルに応じて、装置の行動コマンドを生成するためのコマンド生成手段とを含み、代表行動は、第２の状況において、時系列データをクラスタリングして、各クラスタごとに離散化された単位行動として決定されたものであり、結合状態ベクトルは、第２および第３の参加者の行動の区切りを検出し行動の時系列データを離散化して、区切られた第３の参加者の行動を検索キーとして、第３の参加者の状態と第２の参加者の状態との結合として決定されたものである。 According to another aspect of the present invention, a response system for enabling communication by behavior with a first participant, wherein the behavior of the first participant collected by a plurality of sensors in the first situation Based on time-series data regarding the second participant and the third participant communicate with each other by a device for presenting a behavior similar to a person to a first participant. Based on the data acquired in the situation, acting on behalf of the second participant in the first situation, the device is configured to generate a combined state vector generated based on the data obtained in the second situation A storage device for associating and storing an action vector corresponding to the representative action of the two participants, a predictor having the coupled state vector as an input and the action vector as an output; Command generating means for generating an action command of the device according to an action vector responsive to the action of the first participant predicted by the generated predictor in the first situation, Is determined as a unit action discretized for each cluster by clustering time-series data in the second situation, and the combined state vector is the action of the second and third participants. It is determined as a combination of the state of the third participant and the state of the second participant by detecting the break and discretizing the time series data of the action and using the separated third participant's action as a search key. It is a thing.

この発明のさらに他の局面に従うと、第１の状況において装置が第１の参加者と行動によるコミュニケーションを可能とするための行動コマンド生成方法であって、第２の参加者および第３の参加者が行動によるコミュニケーションをとる第２の状況において、人の行動に関する時系列データを収集するステップと、第２の状況において、行動の時系列データをクラスタリングして、各クラスタごとに代表行動を決定するステップと、結合状態ベクトルと行動ベクトルとをそれぞれ関連付けるステップとを備え、結合状態ベクトルは、第２の状況において、クラスタリングの結果と行動の時系列データに基づき、第３の参加者の状態と第２の参加者の状態とから生成され、各行動ベクトルは、結合状態ベクトルに対応し第２の参加者の後続する代表行動を表し、結合状態ベクトルを入力とし、行動ベクトルを出力とする予測器を生成するステップと、第１の状況において、生成された予測器により予測された、第１の参加者の行動に応答する行動ベクトルに応じて、装置が、第１の状況において第２の参加者の代わりとして行動するように、装置へのコマンドを生成するステップとを備える。 According to yet another aspect of the present invention, there is provided a behavior command generation method for enabling an apparatus to communicate with a first participant by behavior in a first situation, the second participant and the third participation. In the second situation where the person communicates by action, the step of collecting time series data on the action of the person, and in the second situation, the action time series data is clustered to determine the representative action for each cluster. And a step of associating the combined state vector and the action vector with each other, and the combined state vector is based on the result of clustering and the time series data of the action in the second situation, and the state of the third participant And each action vector corresponds to a combined state vector followed by the second participant's state. A step of generating a predictor representing a representative action, having a combined state vector as an input and an action vector as an output; and in a first situation, the action of the first participant predicted by the generated predictor Generating a command to the device to act on behalf of the second participant in the first situation in response to the responding action vector.

この発明によれば、特定の環境において、実際に観測された人・人間のインタラクション行動のデータに基づいて、同様の環境下で、人・ロボット間のインタラクションを行うためのロボットへの行動コマンドを生成することができる。 According to the present invention, an action command to a robot for performing an interaction between a person and a robot under a similar environment is obtained based on actually observed data of the person / human interaction action in a specific environment. Can be generated.

また、この発明によれば、システムの設計者がシナリオを作成する必要がないため、ロボットの行動生成のための設計者の負荷を大幅に低減できる。 Further, according to the present invention, since it is not necessary for the system designer to create a scenario, the load on the designer for generating the robot behavior can be greatly reduced.

また、この発明によれば、人間行動の自然な多様性が考慮される場合にも、ロバストなインタラクションのための行動コマンドを作成することが可能である。 Further, according to the present invention, it is possible to create a behavior command for robust interaction even when natural diversity of human behavior is considered.

本実施の形態において、人と人の間の「発話と行動によるコミュニケーション」のデータを取得する空間を示す概念図である。In this Embodiment, it is a conceptual diagram which shows the space which acquires the data of "communication by speech and action" between people. 図１に示す領域の天井部分に配置される３Ｄレンジセンサ３２．１〜３２．ｍ（ｍ：２以上の自然数）の配置を上面から見た状態を示す図である。3D range sensors 32.1 to 32.32 arranged on the ceiling portion of the region shown in FIG. It is a figure which shows the state which looked at arrangement | positioning of m (m: 2 or more natural numbers) from the upper surface. 天井に配置される３Ｄレンジセンサの一例を示す図である。It is a figure which shows an example of the 3D range sensor arrange | positioned on a ceiling. 演算装置１００のハードウェア構成を説明するためのブロック図である。2 is a block diagram for explaining a hardware configuration of an arithmetic device 100. FIG. 本実施の形態の演算装置１００において、上述したＣＰＵ５６がソフトウェアを実行するにより実現する機能を示す機能ブロック図である。It is a functional block diagram which shows the function implement | achieved when the CPU56 mentioned above performs software in the arithmetic unit 100 of this Embodiment. インタラクション・ロジックを自動生成する手順を説明するためのフローチャートである。It is a flowchart for demonstrating the procedure which produces | generates interaction logic automatically. 参加者の発話を自動的にクラスタリングする処理を実行する構成を説明するための機能ブロック図である。It is a functional block diagram for demonstrating the structure which performs the process which clusters an utterance of a participant automatically. 得られたクラスタのうちの１つのクラスタについての発話の分布を説明する図である。It is a figure explaining distribution of utterance about one cluster among obtained clusters. 典型発話抽出部２２１０の実行する処理を説明するための概念図である。It is a conceptual diagram for explaining the processing executed by a typical utterance extraction unit 2210. 参加者の運動要素を離散化およびクラスタ化により抽象化するための運動抽象化処理部２３０の処理に対する機能ブロック図である。It is a functional block diagram with respect to the process of the motion abstraction process part 230 for abstracting a participant's motion element by discretization and clustering. 特定された「停止位置」を示す図である。It is a figure which shows the identified "stop position". 軌道クラスタの例を示す図である。It is a figure which shows the example of a track cluster. 軌道クラスタの例を示す図である。It is a figure which shows the example of a track cluster. 軌道クラスタの例を示す図である。It is a figure which shows the example of a track cluster. インタラクション状態のうち、「製品の提示状態」を示す概念図である。It is a conceptual diagram which shows "the presentation state of a product" among interaction states. 行動離散化部４２０の処理を説明するための機能ブロック図である。It is a functional block diagram for demonstrating the process of the action discretization part 420. FIG. 結合状態ベクトル生成部４３０とロボット行動生成部４４０の動作を説明するための機能ブロック図である。It is a functional block diagram for demonstrating operation | movement of the combined state vector production | generation part 430 and the robot action production | generation part 440. FIG. 行動ペア特定部４３０２による行動の特定処理を説明するための概念図である。It is a conceptual diagram for demonstrating the action specific process by the action pair specific | specification part 4302. FIG. 結合状態ベクトルにおける特徴量およびロボット行動ベクトルにおける特徴量を示す概念図である。It is a conceptual diagram which shows the feature-value in a joint state vector, and the feature-value in a robot action vector. 予測器訓練部４５０の動作を説明するための機能ブロック図である。It is a functional block diagram for demonstrating operation | movement of the predictor training part 450. FIG. 結合状態ベクトルにおける特徴量がとり得る特徴値の一例を示す図である。It is a figure which shows an example of the feature value which the feature-value in a joint state vector can take. オンライン処理部５００の動作を説明するための機能ブロック図である。6 is a functional block diagram for explaining the operation of the online processing unit 500. FIG. 人・人間のインタラクション状態の例を示す概念図である。It is a conceptual diagram which shows the example of a person-human interaction state. 「提示状態」について、結合状態ベクトルで表現される位置関係と対応するロボットへの行動生成との関係を示す概念図である。It is a conceptual diagram which shows the relationship between the positional relationship expressed with a coupling | bonding state vector, and the action production | generation to the corresponding robot regarding "presentation state". 「認識モデル」と「生成モデル」とを対比して説明する概念図である。It is a conceptual diagram explaining a "recognition model" and a "generation model" by contrast. 「対面状態」について、結合状態ベクトルで表現される位置関係と対応するロボットへの行動生成との関係を示す概念図である。It is a conceptual diagram which shows the relationship between the positional relationship expressed with a coupling | bonding state vector, and the action production | generation to the corresponding robot regarding "facing state". 「待機状態」について、結合状態ベクトルで表現される位置関係と対応するロボットへの行動生成との関係を示す概念図である。It is a conceptual diagram which shows the relationship between the positional relationship expressed with a coupling | bonding state vector, and the action production | generation to the corresponding robot regarding "standby state". 実際に観測された人・人間の位置関係およびそれに対応する人・ロボットの位置関係の図である。It is a figure of the positional relationship of the actually observed person-human and the positional relationship of the human / robot corresponding to it.

以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。 In the following embodiments, components and processing steps with the same reference numerals are the same or equivalent, and description thereof will not be repeated unless necessary.

また、以下の説明では、距離センサとして、３次元的距離センサ（以下、３Ｄレンジセンサ）として、たとえば、マイクロソフトKinect（登録商標）のような３Ｄレンジセンサを想定する。また、２次元的距離センサ（以下、２Ｄレンジセンサ）としては、２次元的なスキャンを行うレーザレンジファインダのような２Ｄレンジセンサを例として説明するが、本発明は、このような距離センサに限らず、他の方式により、対象物までの距離を２次元的または３次元的に測定するための測距センサについても適用可能である。
（実施の形態の説明）
本実施の形態では、ロボットと人とのコミュニケーション態様の設計（以下、「インタラクション設計」と呼ぶ）へ、以下に説明するようなデータ駆動型のアプローチをとることで、従来の問題への解決策を提供する。 Further, in the following description, a 3D range sensor such as Microsoft Kinect (registered trademark) is assumed as a three-dimensional distance sensor (hereinafter referred to as a 3D range sensor) as a distance sensor. In addition, as a two-dimensional distance sensor (hereinafter, referred to as a 2D range sensor), a 2D range sensor such as a laser range finder that performs two-dimensional scanning will be described as an example. The present invention is not limited to this and can be applied to a distance measuring sensor for measuring a distance to an object two-dimensionally or three-dimensionally by another method.
(Description of Embodiment)
In the present embodiment, a solution to the conventional problem is taken by adopting a data driven approach as described below for designing a communication mode between a robot and a person (hereinafter referred to as “interaction design”). I will provide a.

すなわち、多くの現実の現場において、人と人との間のインタラクション（以下、「人・人間インタラクション」と呼ぶ）から、発話、社会状況および遷移規則のような行動要素を直接取得することによって、ロボットにおいて使用することができる１セットの行動およびインタラクション・ロジックを自動的に収集する。 That is, in many real-world situations, by directly acquiring behavioral elements such as utterances, social situations, and transition rules from human-human interaction (hereinafter referred to as “human-human interaction”), Automatically collects a set of behavior and interaction logic that can be used in the robot.

これは、インタラクション設計の難しさおよび負担を軽減し、センサー・エラーおよび行動の多様性が暗黙に考慮されるので、よりロバストなインタラクション・ロジックの生成を可能にする。 This alleviates the difficulty and burden of interaction design and allows for the generation of more robust interaction logic since sensor error and behavioral diversity is implicitly taken into account.

環境下に配置された各種センサ（人の発話を収集するセンサ、人の位置・姿勢を検出するセンサなど）を利用することで、現実世界のインタラクションに基づいたデータ駆動型のインタラクション設計を実行する。 Perform data-driven interaction design based on real-world interaction by using various sensors (sensors that collect human utterances, sensors that detect human position / posture, etc.) placed in the environment .

ここで、人の位置の検出については、高精度なトラッキングシステムが、公共空間に設置され、自然な人間のインタラクション・データの受動的な収集を可能にしている。 Here, for the detection of the position of a person, a highly accurate tracking system is installed in a public space, enabling passive collection of natural human interaction data.

このようなトラッキングシステムとしては、例えば、以下の文献に開示がある。 Such tracking systems are disclosed in the following documents, for example.

文献１：D. Brscic, T. Kanda, T. Ikeda, and T. Miyashita, “Person Tracking in Large Public Spaces Using 3-D Range Sensors，” Human-Machine Systems, IEEE Transactions on, vol. 43, pp. 522-534, 2013.
また、マイクロホンアレイのような技術は、ノイズの多い現実世界の環境の中で使用可能な音源定位および音声認識を提供することができる。このようなマイクロホンアレイについては、以下の文献に開示がある。 Reference 1: D. Brscic, T. Kanda, T. Ikeda, and T. Miyashita, “Person Tracking in Large Public Spaces Using 3-D Range Sensors,” Human-Machine Systems, IEEE Transactions on, vol. 43, pp. 522-534, 2013.
Also, technologies such as microphone arrays can provide sound source localization and speech recognition that can be used in noisy real-world environments. Such a microphone array is disclosed in the following document.

文献２：特開2016-50872号公報
図１は、本実施の形態において、人と人の間の「発話と行動によるコミュニケーション」（以下、ロボットと人との間も含めて、「インタラクション」と呼ぶ）のデータを取得する空間を示す概念図である。 Reference 2: Japanese Patent Laid-Open No. 2016-50872 FIG. 1 shows “communication based on speech and action” (hereinafter referred to as “interaction” including between a robot and a person) in this embodiment. It is a conceptual diagram which shows the space which acquires data.

ここでは、「インタラクション」として、店舗における「店主」と「顧客」の間でやり取りされる行動および発話を一例として説明する。 Here, as “interaction”, actions and utterances exchanged between “store owner” and “customer” in a store will be described as an example.

図１は、店舗（たとえば、カメラ店）内の平面図であり、店主ｐ１と顧客ｐ２がインタラクションを行うものとする。 FIG. 1 is a plan view of a store (for example, a camera store), and it is assumed that the store owner p1 and the customer p2 interact.

店舗内には、サービスカウンタと、異なるカメラのブランド（ブランドＡ，ブランドＢ、ブランドＣ）の陳列場所があるものとする。また、顧客は、ドアから入店し、同一のドアから退出するものとする。 It is assumed that there are service counters and display locations for different camera brands (brand A, brand B, brand C) in the store. In addition, the customer enters the store from the door and exits from the same door.

なお、図１はあくまで例示であって、店側の店員の数および顧客の数は、より多くてもよい。 Note that FIG. 1 is merely an example, and the number of shop assistants and customers may be larger.

図２は、図１に示す領域の天井部分に配置される３Ｄレンジセンサ３２．１〜３２．ｍ（ｍ：２以上の自然数）の配置を上面から見た状態を示す図である。 2 shows 3D range sensors 32.1 to 32.3 arranged on the ceiling portion of the region shown in FIG. It is a figure which shows the state which looked at arrangement | positioning of m (m: 2 or more natural numbers) from the upper surface.

図２に示されるように、典型的には、上下逆さまの３Ｄレンジセンサを、図２に示すような交互の方向を向くように、列状に整列して配置する。また、レンジセンサの配置される領域外に、人物（および／または移動体）のトラッキング処理をするためのコンピュータ１００も配置される。 As shown in FIG. 2, typically, upside down 3D range sensors are arranged in a line so as to face alternate directions as shown in FIG. 2. Further, a computer 100 for performing tracking processing of a person (and / or a moving body) is also arranged outside the area where the range sensor is arranged.

センサは相互の干渉を最小化し、かつカバーする領域を最大化するように配置される。 The sensors are arranged to minimize mutual interference and maximize the area covered.

図３は、天井に配置される３Ｄレンジセンサの一例を示す図である。 FIG. 3 is a diagram illustrating an example of a 3D range sensor arranged on a ceiling.

図３に明示的に示されるように、３Ｄレンジセンサは、上下逆さまに天井に据え付けられる。 As explicitly shown in FIG. 3, the 3D range sensor is installed on the ceiling upside down.

特に限定されないが、たとえば、これらの３Ｄレンジセンサは、人を全体としてトラッキングするように使用されるのではなく、人々の頭頂部を検知するために使用される。 Although not particularly limited, for example, these 3D range sensors are not used to track a person as a whole, but are used to detect the top of a person's head.

頭頂部の検知アルゴリズムの詳細は、たとえば、以下の文献に記載されている。 Details of the detection algorithm for the top of the head are described in, for example, the following documents.

文献１：特開平２０１２‐２１５５５５号
文献２：D. Brscic, T. Kanda, T. Ikeda, and T. Miyashita, "Person Tracking in Large Public Spaces Using 3-D Range Sensors," Human-Machine Systems, IEEE Transactions on, vol. 43, pp. 522-534, 2013.
頭頂領域を最適にカバーするためには、３Ｄセンサは、水平からおよそ３０−６０度の角度を見込み、特定の部屋およびセンサ構成に適合するように選ばれた正確な角度となるように、手動で向きを調節される。 Reference 1: JP-A-2012-215555 Reference 2: D. Brscic, T. Kanda, T. Ikeda, and T. Miyashita, "Person Tracking in Large Public Spaces Using 3-D Range Sensors," Human-Machine Systems, IEEE Transactions on, vol. 43, pp. 522-534, 2013.
For optimal coverage of the parietal region, the 3D sensor is expected to be approximately 30-60 degrees from the horizontal and manually adjusted to an accurate angle chosen to fit a particular room and sensor configuration. The direction is adjusted with.

この角度が(水平近くになり)浅すぎれば、センサは人々の頭頂部を観測することができないし、一方で、角度が深すぎれば、有効な検知領域は非常に小さいものになってしまう。 If this angle is too shallow (close to the horizon), the sensor cannot observe the top of the person's head, while if the angle is too deep, the effective detection area will be very small.

図１に示した室内には、室内の人の位置をトラッキングするための２Ｄ距離センサとして２Ｄレーザーレンジファインダーが、たとえば、高さ８６ｃｍ金属柱の上にマウントされる。 In the room shown in FIG. 1, a 2D laser range finder is mounted on a metal column having a height of 86 cm, for example, as a 2D distance sensor for tracking the position of a person in the room.

この高さは、最適な視程のために選ばれたものである。すなわち、観測対象の人物の腰を検出対象としており、人物の腰は、脚より大きなターゲットであり、より大きな距離でより容易に識別されるものだからである。 This height was chosen for optimal visibility. In other words, the waist of the person to be observed is the detection target, and the person's waist is a larger target than the leg, and is easily identified at a greater distance.

２Ｄセンサが水平な床に置かれる限り、ピッチとロールを有効にゼロに固定し、センサの走査が水平面をカバーするように、センサはしっかりと固定されて配置される。 As long as the 2D sensor is placed on a horizontal floor, the sensor is placed firmly fixed so that the pitch and roll are effectively fixed to zero and the scan of the sensor covers the horizontal plane.

なお、適切な形状モデルと２Ｄレーザーレンジファインダーおよび／または３Ｄレーザーレンジファインダーにより計測された対象物の形状とのマッチングをとることにより、人の向き・姿勢を検知することも可能である。このような方法については、たとえば、上述した文献１に開示がある。 It is also possible to detect the orientation / posture of a person by matching an appropriate shape model with the shape of the object measured by the 2D laser range finder and / or the 3D laser range finder. Such a method is disclosed in, for example, Document 1 described above.

[ハードウェアブロック]
図４は、演算装置１００のハードウェア構成を説明するためのブロック図である。 [Hardware block]
FIG. 4 is a block diagram for explaining a hardware configuration of the arithmetic device 100.

演算装置１００としては、たとえば、パーソナルコンピュータなどを使用することができる。 As the arithmetic device 100, for example, a personal computer or the like can be used.

図４に示されるように、演算装置１００は、外部記録媒体６４に記録されたデータを読み取ることができるドライブ装置５２と、バス６６に接続された中央演算装置（ＣＰＵ：Central Processing Unit）５６と、ＲＯＭ（Read Only Memory) ５８と、ＲＡＭ（Random Access Memory）６０と、不揮発性記憶装置３００と、２Ｄレンジセンサ３０．１〜３０．ｍおよび３Ｄレンジセンサ３２．１〜３２．ｍからの測距データや、スマートフォン３４．１〜３４．ｐからの音声テキストデータ、図示しない入力装置からの入力データを取込むためのデータ入力インタフェース（以下、データ入力Ｉ／Ｆ）６８とを含んでいる。 As shown in FIG. 4, the arithmetic device 100 includes a drive device 52 that can read data recorded on the external recording medium 64, a central processing unit (CPU) 56 connected to a bus 66, and the like. , ROM (Read Only Memory) 58, RAM (Random Access Memory) 60, nonvolatile storage device 300, 2D range sensors 30.1-30. m and 3D range sensors 32.1 to 32. distance measurement data and smartphones 34.1 to 34.34. A voice input data from p and a data input interface (hereinafter referred to as a data input I / F) 68 for taking in input data from an input device (not shown) are included.

外部記録媒体６４としては、たとえば、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭのような光ディスクやメモリカードを使用することができる。ただし、記録媒体ドライブ５２の機能を実現する装置は、光ディスクやフラッシュメモリなどの不揮発性の記録媒体に記憶されたデータを読み出せる装置であれば、対象となる記録媒体は、これらに限定されない。また、不揮発性記憶装置３００の機能を実現する装置も、不揮発的にデータを記憶し、かつ、ランダムアクセスできる装置であれば、ハードディスクのような磁気記憶装置を使用してもよいし、フラッシュメモリなどの不揮発性半導体メモリを記憶装置として用いるソリッドステートドライブ（ＳＳＤ：Solid State Drive）を用いることもできる。 As the external recording medium 64, for example, an optical disk such as a CD-ROM or a DVD-ROM or a memory card can be used. However, the target recording medium is not limited to this as long as the device that realizes the function of the recording medium drive 52 is a device that can read data stored in a nonvolatile recording medium such as an optical disk or a flash memory. In addition, a device that realizes the function of the nonvolatile storage device 300 may use a magnetic storage device such as a hard disk or a flash memory as long as it can store data in a nonvolatile manner and can be accessed randomly. A solid state drive (SSD) that uses a nonvolatile semiconductor memory such as a storage device can also be used.

このような演算装置１００の主要部は、コンピュータハードウェアと、ＣＰＵ５６により実行されるソフトウェアとにより実現される。一般的にこうしたソフトウェアは、マスクＲＯＭやプログラマブルＲＯＭなどにより、演算装置１００製造時に記録されており、これが実行時にＲＡＭ６０に読みだされる構成としてもよいし、ドライブ装置５２により記録媒体６４から読取られて不揮発性記憶装置３００に一旦格納され、実行時にＲＡＭ６０に読みだされる構成としてもよい。または、当該装置がネットワークに接続されている場合には、ネットワーク上のサーバから、一旦、不揮発性記憶装置３００にコピーされ、不揮発性記憶装置３００からＲＡＭ６０に読出されてＣＰＵ５６により実行される構成であってもよい。 The main part of such an arithmetic device 100 is realized by computer hardware and software executed by the CPU 56. In general, such software is recorded at the time of manufacturing the arithmetic device 100 using a mask ROM, a programmable ROM, or the like, and may be read into the RAM 60 at the time of execution, or may be read from the recording medium 64 by the drive device 52. Alternatively, the data may be temporarily stored in the nonvolatile storage device 300 and read into the RAM 60 at the time of execution. Alternatively, when the device is connected to the network, the server is temporarily copied from the server on the network to the nonvolatile storage device 300, read from the nonvolatile storage device 300 to the RAM 60, and executed by the CPU 56. There may be.

図４に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分の１つは、不揮発性記憶装置３００等の記録媒体に記憶されたソフトウェアである。 The computer hardware itself and its operating principle shown in FIG. 4 are general. Therefore, one of the most essential parts of the present invention is software stored in a recording medium such as the nonvolatile storage device 300.

以下で説明するシステムは、データ駆動型のロジック生成により、第１の状況において、装置（たとえば、ロボット）が第１の参加者と行動によるコミュニケーションを可能とするための装置対するコマンドを生成するための行動コマンド生成システムである。行動コマンド生成システムは、時間軸の少なくとも一部において第１の状況に先行して、第２の参加者（店主）および第３の参加者（顧客）が行動によるコミュニケーションをとる第２の状況において取得されたデータに基づき、第１の状況において第２の参加者の代わりとして装置が行動するためのコマンドを、機械学習により学習（訓練）した予測器の出力に基づいて、生成可能とするものである。 The system described below generates data-driven logic generation to generate commands for the device to enable the device (eg, robot) to communicate with the first participant in action in the first situation. It is a behavior command generation system. In the second situation where the second participant (store owner) and the third participant (customer) communicate by action, the behavior command generation system precedes the first situation in at least a part of the time axis. Based on the acquired data, a command for the device to act as a substitute for the second participant in the first situation can be generated based on the output of the predictor learned (trained) by machine learning It is.

［システムの機能ブロック］
図５は、本実施の形態の演算装置１００において、上述したＣＰＵ５６がソフトウェアを実行するにより実現する機能を示す機能ブロック図である。 [System functional blocks]
FIG. 5 is a functional block diagram showing functions realized by the above-described CPU 56 executing software in the arithmetic device 100 according to the present embodiment.

以下に説明するとおり、本実施の形態の演算装置１００では、２Ｄレーザレンジファインダ３０．１〜３０．ｎからの出力（２Ｄ距離データ）および３Ｄレンジセンサ３２．１〜３２．ｍからの出力（３Ｄ測距データ）ならびに人の発話を収集するセンサ（後述するように、たとえば、スマートホンでもよい）からのデータに基づいて、データ収集モジュール２００が、時間と同期させて、室内の人の位置・姿勢および発話の情報を収集する。 As described below, in arithmetic device 100 of the present embodiment, 2D laser range finder 30.1-30. output (2D distance data) and 3D range sensors 32.1 to 32.n. Based on the output from m (3D ranging data) as well as data from a sensor that collects human utterances (which may be, for example, a smart phone as described below), the data collection module 200 synchronizes with time, Collect information on the position and posture of people in the room and utterances.

ここでは、人・人間インタラクションのデータについて、まず、インタラクション・ロジックを自動的に収集する過程（状況）を、「学習データ収集過程」と呼び、データ駆動型のアプローチにより、ロボット・人間インタラクション・ロジックをコンピュータが実行可能な形式に組み立てる過程（状況）を、「ロジック学習過程」と呼ぶことにする。 Here, with regard to human-human interaction data, the process (situation) of automatically collecting interaction logic is called the “learning data collection process”, and the robot-human interaction logic is based on a data-driven approach. The process (situation) for assembling a computer into a computer-executable format is called a “logic learning process”.

そして、組み立てられたロボット・インタラクション・ロジックに基づいて、人に対するロボットの行動を制御する過程（状況）を、「オンライン処理過程」と呼ぶ。 A process (situation) for controlling the behavior of the robot with respect to a person based on the assembled robot interaction logic is called an “online processing process”.

学習データ収集過程とロジック学習過程の双方において、運動トラッキング部２０４は、対象物（例示的には、人）の位置のトラッキングおよび当該対象物の姿勢のデータの取得を行う。この対象物のトラッキングは、たとえば、パーティクルフィルタなどの技術を用いて、対象の位置および速度を推定することにより実行される。 In both the learning data collection process and the logic learning process, the motion tracking unit 204 tracks the position of an object (for example, a person) and acquires posture data of the object. The tracking of the object is executed by estimating the position and speed of the object using a technique such as a particle filter.

このような対象物（人）のトラッキングの処理については、たとえば、上述の文献１の他、以下の文献に開示があるとおり、周知な方法で実行することが可能である。 Such tracking processing of an object (person) can be performed by a well-known method as disclosed in the following document in addition to the above document 1, for example.

文献３：特開２０１３−６４６７１号公報明細書
以下では、このような対象物（人）のトラッキングの処理を行う前提として、複数の３Ｄおよび２Ｄ距離センサの位置および向きは、グローバル座標系において、較正されているものとする。 Document 3: JP 2013-64671 A In the following, as a premise for performing tracking processing of such an object (person), the positions and orientations of a plurality of 3D and 2D distance sensors are expressed in a global coordinate system. It shall be calibrated.

また、後述するように、学習データ収集過程とロジック学習過程の双方において、人の発話のデータは、マイクロフォンにより収集され、テキストデータに変換されるものとする。このような音声データをテキストデータに変換する音声認識部２０２については、ローカルに演算装置１００がその機能を実行するものとしてもよい。ただし、たとえば、マイクロフォンとしては、人（参加者）が、各々保持するスマートフォンに装着されているものを使用することとし、音声データをテキストデータに変換する音声認識部２０２は、ネットワーク越しに、スマートフォンからの音声特徴データをサーバが受信して、サーバが変換したテキストデータをスマートフォンに返信する構成であってもよい。以下では、基本的に、音声認識処理は、ネットワークを介したサーバ側で実施されるものとして説明する。 As will be described later, in both the learning data collection process and the logic learning process, human speech data is collected by a microphone and converted to text data. Regarding the voice recognition unit 202 that converts such voice data into text data, the arithmetic device 100 may execute the function locally. However, for example, as a microphone, a person (participant) uses what is attached to each smartphone held, and the voice recognition unit 202 that converts voice data into text data is connected to the smartphone over the network. The voice feature data from the server may be received by the server, and the text data converted by the server may be returned to the smartphone. In the following description, it is assumed that the voice recognition process is basically performed on the server side via the network.

なお、図５に示した機能ブロックのうちのＣＰＵ５６が実現する機能ブロックとしては、ソフトウェアでの処理に限定されるものではなく、その一部または全部がハードウェアにより実現されてもよい。 Of the functional blocks shown in FIG. 5, the functional blocks realized by the CPU 56 are not limited to software processing, and part or all of the functional blocks may be realized by hardware.

図５を参照して、２Ｄレーザレンジファインダ３０．１〜３０．ｎからの測距信号および３Ｄレンジファインダ３２．１〜３２．ｍからの測距信号、ならびに、スマートフォン３４．１〜３４．ｐ（ｎ，ｍ，ｐは自然数）からの音声テキストデータ（または音声データ）は、データ収集モジュール２００により制御されてデジタルデータとして入力され、不揮発性記憶装置３００のような記憶装置に、各レーザレンジファインダごとならびにスマートフォンごとに時系列のデータとして格納される。時系列にデータを取得する際に、演算装置１００の制御の下に、データの取り込みが行われる時刻を示すデータを、以下では、「タイムステップ」と呼ぶ。特に限定されないが、タイムステップのデータは、２Ｄレーザレンジファインダ３０．１〜３０．ｎからの測距データおよび３Ｄレンジファインダ３２．１〜３２．ｍ、音声テキストデータの各々に関連付けられて、不揮発性記憶装置３００に格納される。また、後述するように、各参加者は、音声認識を実行するためのスマートフォンを各自保持しているものとする。 Referring to FIG. 5, 2D laser range finder 30.1-30. n ranging signal and 3D range finder 32.1-32. m ranging signal and smartphone 34.1-34. Speech text data (or speech data) from p (n, m, p are natural numbers) is controlled by the data collection module 200 and input as digital data, and each laser is stored in a storage device such as the nonvolatile storage device 300. Stored as time-series data for each range finder and each smartphone. When acquiring data in time series, data indicating the time at which data is taken under the control of the arithmetic device 100 is hereinafter referred to as “time step”. Although not particularly limited, the time step data is 2D laser range finder 30.1-30. distance measurement data and 3D range finder 32.1 to 32.n. m and the voice text data are stored in the nonvolatile storage device 300 in association with each other. Further, as will be described later, each participant is assumed to hold a smartphone for executing voice recognition.

学習データ収集過程において、データ収集モジュール２００から記憶装置３００に格納された時系列データに対して、抽象化処理部２１０は、以下に説明すように、人の発話や、人の位置、人・人間の近接配置について、それぞれ、発話抽象化部２２０、運動抽象化部２３０、近接配置抽象化部２４０が、抽象化の処理を実行する。 In the learning data collection process, the abstraction processing unit 210 performs time-series data stored in the storage device 300 from the data collection module 200, as will be described below. The utterance abstraction unit 220, the motion abstraction unit 230, and the proximity arrangement abstraction unit 240 execute the abstraction processing for the human proximity arrangement, respectively.

ここで、「抽象化」とは、観測されたデータをクラスタ化し、各クラスタについて、代表発話、代表運動を抽出する処理をいう。ここでは、一例として、近接配置抽象化部２４０については、たとえば、ユーザからの指示に基づいて所定の近接配置のモデルを複数個選択して近接配置モデルとして使用するものとする。 Here, “abstraction” refers to a process of clustering observed data and extracting representative utterances and representative motions for each cluster. Here, as an example, for the proximity arrangement abstraction unit 240, for example, a plurality of predetermined proximity arrangement models are selected and used as the proximity arrangement model based on an instruction from the user.

抽象化処理の結果、抽象化処理部２１０から記憶装置３００にそれぞれ、代表発話データ３１０、停止位置データ３１２、代表軌道データ３１４、近接配置モデルデータ３１６が格納される。 As a result of the abstraction processing, representative speech data 310, stop position data 312, representative trajectory data 314, and proximity arrangement model data 316 are stored in the storage device 300 from the abstraction processing unit 210, respectively.

次に、ロジック学習過程においては、訓練処理部４００において、動作要素抽出部４１０が、観測された訓練データ（時系列データ）から、人が移動しているか、停止しているかを識別し、発話要素データ３２０、運動要素データ３２２、インタラクション要素データ３２４を抽出する。 Next, in the logic learning process, in the training processing unit 400, the motion element extraction unit 410 identifies whether the person is moving or stopped from the observed training data (time series data), and utters Element data 320, motion element data 322, and interaction element data 324 are extracted.

続いて、行動離散化部４２０は、抽象化処理部２１０によるクラスタリングによって特定された停止位置を基準にして、対象（たとえば、店主と顧客）の行動を離散化する。離散化された行動は、第１対象行動データ（たとえば、顧客行動データ）３３０と第２対象行動データ（たとえば、店主行動データ）３３２として記憶装置３００に格納される。 Subsequently, the behavior discretization unit 420 discretizes the behavior of the target (for example, the store owner and the customer) based on the stop position specified by the clustering by the abstraction processing unit 210. The discretized behavior is stored in the storage device 300 as first target behavior data (for example, customer behavior data) 330 and second target behavior data (for example, storekeeper behavior data) 332.

結合状態ベクトル生成部４３０は、後にロボットがその代りに行動するように制御されることになる第２対象（「代替行動対象」と呼ぶ：たとえば、店主）の相手である第１対象（「ロボットインタラクション対象」と呼ぶ：たとえば、顧客）の行動が検出されることに応じて、双方の対象の状態を結合状態ベクトルとして特定する。結合状態ベクトル３４０は、記憶装置３００に格納される。 The combined state vector generation unit 430 is a first target (“robot”) that is a partner of a second target (referred to as “alternative action target”: for example, a shop owner) that will be controlled so that the robot will act instead. Called “interaction object”: for example, in response to detecting the behavior of the customer), the states of both objects are identified as combined state vectors. The combined state vector 340 is stored in the storage device 300.

ここで、「結合状態ベクトル」とは、ロボットインタラクション対象の発話を特定する情報（発話ベクトル）、ロボットインタラクション対象の空間状態を特定する情報、代替行動対象の空間状態を特定する情報、インタラクション状態を特定する情報の組を意味する。 Here, the “combined state vector” is information that specifies the utterance of the robot interaction target (utterance vector), information that specifies the spatial state of the robot interaction target, information that specifies the spatial state of the alternative action target, and the interaction state. A set of information to be identified.

発話ベクトルは、発話を特定する特定情報（ＩＤ情報）とキーワード情報とを含む。空間状態を特定する情報は、「現在位置」、「運動起点」および「運動目標位置」（いずれもクラスタ化により得られた停止位置のいずれか）を含む。インタラクション状態を特定する情報（インタラクション状態）は、両対象の空間配置を特定する情報と、インタラクションする両対象以外に、そのインタラクション状態に関わり当該インタラクション状態を定義するオブジェクトを特定する情報とを含む。 The utterance vector includes specific information (ID information) for specifying the utterance and keyword information. The information specifying the spatial state includes “current position”, “movement start point”, and “movement target position” (all of which are stop positions obtained by clustering). The information for specifying the interaction state (interaction state) includes information for specifying the spatial arrangement of both objects, and information for specifying an object that defines the interaction state related to the interaction state in addition to the both objects to be interacted with.

ロボット行動生成部４４０は、訓練データ中において、ロボットインタラクション対象となるべき人（たとえば、顧客）に対応する行動が検知されたときに、それと対となる代替行動対象となるべき人（たとえば、店主）の行動（「ロボット行動」と呼ぶ）を特定する。ロボット行動を特定する情報であるロボット行動ベクトル３４２は、記憶装置３００に格納される。 When an action corresponding to a person (for example, a customer) to be a robot interaction target is detected in the training data, the robot action generation unit 440 is a person (for example, a store owner) to be a substitute action target to be detected. ) Action (referred to as “robot action”). A robot action vector 342 that is information for specifying the robot action is stored in the storage device 300.

予測器学習部４５０は、結合状態ベクトルを入力とし、ロボット行動ベクトルを出力とするように、の機械学習により予測器を生成する。生成された予測器を特定するためのパラメータ等の情報は、予測器特定情報３５０として、記憶装置３００に格納される。 The predictor learning unit 450 generates a predictor by machine learning so that the combined state vector is an input and the robot behavior vector is an output. Information such as parameters for specifying the generated predictor is stored in the storage device 300 as predictor specifying information 350.

次に、オンライン処理過程においては、オンライン処理部５００の動作要素抽出部５１０は、データ収集モジュール２００からの時系列データを基に、代替行動対象として行動するロボットと、ロボットインタラクション対象として行動する人について、発話および運動の時系列データから、発話要素、運動要素、インタラクション要素を抽出する。 Next, in the online processing process, the motion element extraction unit 510 of the online processing unit 500 is based on the time series data from the data collection module 200 and the robot acting as the alternative action target and the person acting as the robot interaction target. , The utterance element, the movement element, and the interaction element are extracted from the time series data of the utterance and movement.

ロボットインタラクション対象として行動する人の所定の行動が検出されることに応じて、現在のロボットとロボットインタラクション対象との状態を結合状態ベクトルとして表現された入力を受けて、予測器特定情報３５０により特定される予測器５２０は、ロボット行動を予測する。 In response to detection of a predetermined behavior of a person acting as a robot interaction target, the state of the current robot and the robot interaction target is received as an input expressed as a combined state vector, and specified by the predictor specifying information 350 The predicted unit 520 predicts the robot behavior.

ロボット行動生成部５３０は、予測器５２０から出力されるロボット行動と、現在のロボットの状態とを比較することにより、ロボット行動コマンドを生成して、ロボット１０００に出力する。 The robot behavior generation unit 530 generates a robot behavior command by comparing the robot behavior output from the predictor 520 with the current robot state, and outputs the robot behavior command to the robot 1000.

なお、オンライン処理部５００により制御される対象は、発話と行動の双方が可能なロボットである必要は必ずしもなく、たとえば、発話のみを行うような機器であってもよい。 Note that the target controlled by the online processing unit 500 is not necessarily a robot that can both speak and act, and may be a device that only speaks, for example.

以下、学習データ収集過程、ロジック学習過程、オンライン処理過程について、さらに詳しく説明する。
［学習データ収集過程］
以下では、例として、図１に示した空間的な環境下で、店主と顧客との行動について収集した訓練データにより、ロボットと顧客との間のインタラクションを可能とする構成について説明する。
（センサー環境）
ロボットが人とのインタラクションに対する学習用の人・人間のインタラクション・データを集める環境（「学習データ収集環境」）のために、人の運動および発話が、システム１０よりキャプチャされる。ここでは、学習データ収集環境において、人・人間のインタラクションを実行する主体を「参加者」と呼ぶことにする。 Hereinafter, the learning data collection process, the logic learning process, and the online processing process will be described in more detail.
[Learning data collection process]
Hereinafter, as an example, a configuration that enables interaction between a robot and a customer based on training data collected on the behavior of the store owner and the customer in the spatial environment illustrated in FIG. 1 will be described.
(Sensor environment)
Human motion and utterances are captured by the system 10 for an environment in which the robot collects human-human interaction data for learning with respect to human interaction (“learning data collection environment”). Here, a subject that performs human-human interaction in a learning data collection environment is referred to as a “participant”.

そして、学習データ収集環境においては、上述のとおり、参加者の位置および姿勢をトラッキングするための、２Ｄレンジセンサ３０．１〜３０．ｎおよび３Ｄレンジセンサ３２．１〜３２．ｍと、各参加者が保持して自身の発話の音声認識を実行するためのセンサとしてのスマートフォン３４．１〜３４．ｐを含むセンサネットワークを備えたデータ収集環境が準備される。 In the learning data collection environment, as described above, the 2D range sensors 30.1 to 30. n and 3D range sensors 32.1 to 32. m and smartphones 34.1 to 34.34 as sensors for holding each participant to perform speech recognition of their utterances. A data collection environment with a sensor network including p is prepared.

位置トラッキングシステムは、列に整列した、１６個の天井にマウントされた３Ｄレンジセンサが使用される。このようなレンジセンサ（測距センサ）としては、特に限定されないが、たとえば、マイクロソフトＫｉｎｅｃｔ（登録商標）センサーを使用することができる。 The position tracking system uses 16 ceiling mounted 3D range sensors aligned in a row. Such a range sensor (ranging sensor) is not particularly limited. For example, a Microsoft Kinect (registered trademark) sensor can be used.

位置トラッキングのためには、パーティクル・フィルタ技術が室内の各人の位置および体の向きを推測するために使用される。 For position tracking, particle filter technology is used to infer the position and body orientation of each person in the room.

各参加者の発話のデータを収集する方法としては、たとえば、マイクロホンアレイ技術を用いて、各人の発話を分離して、受動的に収集できるシステムを用いることも可能である。しかし、以下の説明では、バックグラウンドの騒音が環境に存在する場合にロバストに発話データを収集するために、各参加者が携帯電話（スマートフォン）を保持する構成とする。 As a method for collecting utterance data of each participant, for example, a system capable of passively collecting utterances of each person using microphone array technology can be used. However, in the following description, each participant holds a mobile phone (smart phone) in order to collect utterance data robustly when background noise exists in the environment.

すなわち、スマートフォン３４．１〜３４．ｐにインストールしたアプリケーションソフトウェアにより、ハンズフリーのヘッドセットから発話を直接キャプチャし、かつ、無線ＬＡＮによってサーバー（図示せず）へ音声特徴データを送って、発話を認識してテキストデータに変換する音声認識ＡＰＩを使用する。 That is, the smartphones 34.1 to 34. Voice that captures utterances directly from a hands-free headset using application software installed on p, and sends voice feature data to a server (not shown) via wireless LAN to recognize utterances and convert them to text data Use the recognition API.

ユーザーはハンズフリーのヘッドセットを着用していて、自身の発話の始まりおよび終わりを示すためにモバイルのスクリーンのどこかに触れる。したがって、視覚的な注意は必要なく、アイコンタクトを壊さずに、自然な顔合わせのインタラクションを行なうことを可能にする。
（データ駆動型のインタラクションロジックの生成の概要）
以下に説明する通り、サービス・ロボットが、人・人間のインタラクションからキャプチャされたデータを使用して、人間行動を再現することを可能にするために、システム１０が、完全にデータ駆動型で、インタラクション・ロジックを生成する。 The user is wearing a hands-free headset and touches somewhere on the mobile screen to indicate the beginning and end of his speech. Therefore, visual attention is not necessary, and natural face-to-face interaction can be performed without breaking the eye contact.
(Overview of generating data-driven interaction logic)
As described below, to enable service robots to reproduce human behavior using data captured from human-human interactions, system 10 is fully data driven, Generate interaction logic.

システム１０は、以下のように、いくつかの抽象化によってインタラクション・ロジックを表現する。 The system 10 represents the interaction logic with several abstractions as follows.

なお、ここで、「インタラクション・ロジック」とは、ロボットと人とが、所定の環境下で、インタラクションをする場合に、ロボット側の制御において使用するデータであり、その所定の環境下で起こり得る様々なインタラクションを、以下のようにしてデータベース化したものである。そして、システム１０は、このようなデータベースを、現実に人・人間で行われるインタラクションの観測結果に基づいて、自動的に生成する。 Here, the “interaction logic” is data used in the control on the robot side when the robot and the person interact with each other in a predetermined environment, and may occur in the predetermined environment. Various interactions are databased as follows. The system 10 automatically generates such a database based on the observation result of the interaction actually performed by a person or a person.

図６は、インタラクション・ロジックを自動生成する手順を説明するためのフローチャートである。 FIG. 6 is a flowchart for explaining the procedure for automatically generating the interaction logic.

１）まず、システム１０は、学習データ収集過程において観測され収集された時系列の訓練データに対して、抽象化処理を実行する（Ｓ１０２）。ここでは、全ての訓練データに対するバッチ処理により、代表発話、停止位置、代表軌道などの要素が特定される。 1) First, the system 10 performs an abstraction process on time-series training data observed and collected in the learning data collection process (S102). Here, elements such as a representative utterance, a stop position, and a representative trajectory are specified by batch processing for all training data.

２）次に、システム１０は、参加者の行動を離散化する（Ｓ１０４）。 2) Next, the system 10 discretizes the behavior of the participant (S104).

２−１）顧客としての参加者（以下、単に「顧客」と呼ぶ）の「発話」データは、潜在意味解析（ＬＳＡ：Latent Semantic Analysis）および他の文章処理技術を使用して、ベクトル化される。 2-1) “Speech” data of participants as customers (hereinafter simply referred to as “customers”) is vectorized using Latent Semantic Analysis (LSA) and other text processing techniques. The

２−２）店主としての参加者（以下、単に「店主」と呼ぶ）の「発話」も、同様の手順によりベクトル化され、さらに、語彙的に同様の離散的な発話を表現する発話クラスタへ分類される。 2-2) “Speech” of a store owner participant (hereinafter simply referred to as “store owner”) is also vectorized by the same procedure, and further to an utterance cluster that expresses lexically similar discrete utterances. being classified.

２−３）顧客と店主の軌道は、停止セグメントと移動セグメントへ分けられ、その後、典型的な停止位置および典型的な運動軌道を識別するためにクラスタに分けられる。 2-3) The customer and shopkeeper trajectories are divided into stop segments and moving segments, and then divided into clusters to identify typical stop positions and typical motion trajectories.

２−４）インタラクション状態（空間配置）は、他のヒューマンロボットインタラクションの研究および近接学研究から得られた、１組の２人の空間配置に基づいた、顧客と店主の相対的位置のモデルに基づいて定義される。 2-4) The interaction state (spatial layout) is a model of the relative position of the customer and the shop owner based on the spatial layout of a set of two people obtained from other human robot interaction studies and proximity studies. Defined based on.

２−５）その後、顧客か店主の発話および／または移動で構成される、離散的な行動を識別するために訓練データを分析し、観察された顧客行動の入力に引き続いて起こるような適切な店主行動出力を予測するように、予測器（機械学習分類器）を訓練する。 2-5) The training data is then analyzed to identify discrete behavior, consisting of customer or shopkeeper utterances and / or movements, and appropriate as occurs following the input of observed customer behavior Train predictors (machine learning classifiers) to predict storekeeper behavior output.

３）続いて、ロジック学習過程において、システム１０は、予測器として動作する分類器を機械学習により訓練する（Ｓ１０６）。分類器への入力は、訓練データを処理して生成される結合状態ベクトルである。ここで、結合状態ベクトルは、顧客の発話ベクトル、顧客および店主に対する空間的な状態、および顧客と店主の現在のインタラクション状態から成るベクトルである。 3) Subsequently, in the logic learning process, the system 10 trains a classifier operating as a predictor by machine learning (S106). The input to the classifier is a combined state vector generated by processing the training data. Here, the combined state vector is a vector composed of the customer's utterance vector, the spatial state with respect to the customer and the store owner, and the current interaction state between the customer and the store owner.

なお、ここで、「ベクトル」とは、一定の関連性のあるデータを１まとまりのグループとして表現したデータ群を、コンピュータの処理に適した形式とした「１次元の配列として表現されるデータ構造」のことをいうものとする。また、１次元の配列のデータ構造に等価な構成を有するデータ群であれば、「ベクトル」と呼ぶことにする。 Here, the “vector” is a data structure expressed as a “one-dimensional array” in which a data group in which certain related data is expressed as one group is in a format suitable for computer processing. ". A data group having a configuration equivalent to the data structure of a one-dimensional array will be referred to as a “vector”.

４）オンライン処理過程において、予測器からの出力に基づいて、ロボットの行動を制御するコマンドが、システム１０からロボットに出力される。予測器からの出力は、後述するように、発話クラスタ、空間配置、状態ターゲットで構成される、離散的な店主行為に対応するロボット行動である。 4) In the online processing process, a command for controlling the behavior of the robot is output from the system 10 to the robot based on the output from the predictor. As will be described later, the output from the predictor is a robot action corresponding to a discrete shopkeeper action composed of an utterance cluster, a spatial arrangement, and a state target.

なお、ここでは、店主と顧客は、仮想的に、実験参加者が役割を演じることで、学習データ収集環境を構成するものとして以下説明する。ただし、たとえば、センサ群を現実の店舗に配置することで、学習データ収集環境を構築することも可能である。
（学習データ収集環境でのインタラクションの具体例）
図１に示したように、学習データ収集環境の一例として、カメラ店の設定における買い物をするシナリオを選び、参加者のうちの一人に店主としてロールプレイをすることを依頼し、参加者の他の一人には顧客としてロールプレイすることが依頼された。 In the following description, it is assumed that the store owner and the customer configure the learning data collection environment by virtually playing the role of the experiment participant. However, for example, it is possible to construct a learning data collection environment by arranging sensor groups in an actual store.
(Specific examples of interaction in the learning data collection environment)
As shown in FIG. 1, as an example of the learning data collection environment, a scenario for shopping in the setting of a camera store is selected, and one of the participants is asked to perform role play as a store owner. One of them was asked to role-play as a customer.

１組のセットの訓練するインタラクションを作成するために、図１に示される店舗スペースで、たとえば、３つの製品展示をセット・アップし、異なるディジタル・カメラ・モデルを表現した。 To create a set of training interactions, for example, in the store space shown in FIG. 1, three product exhibits were set up to represent different digital camera models.

各製品展示には、それぞれ「光学ズーム」あるいは「メガピクセル」のようなカメラに関連する特徴のショートリストを記載した特徴シートがつけらている。 Each product display has a feature sheet with a short list of features related to the camera, such as “optical zoom” or “megapixel”.

さらに、サービスカウンターをセット・アップし、店主ｐ１は、各インタラクションの最初にこの位置に立つように命じられた。 In addition, a service counter was set up and store owner p1 was ordered to stand in this position at the beginning of each interaction.

ここでは、例として、参加者は、英語で互いに対話するものとして説明する。 Here, as an example, participants will be described as interacting with each other in English.

以下の説明では、流暢な英語の４人の話者が店主としてロールプレーする状況であるものとする。 In the following description, it is assumed that four fluent English speakers are role-playing as shopkeepers.

また、７人の流暢な英語の話者を含む１０人の参加者が、顧客ｐ２の役割を果たすものとする。 Also, 10 participants including 7 fluent English speakers shall play the role of customer p2.

顧客ｐ２は、それぞれ合計１７８の試行に対して、１０〜２０のインタラクションに参加した。 Customer p2 participated in 10-20 interactions for a total of 178 trials each.

各試行では、顧客ｐ２は次のシナリオのうちの１つの中でロールプレイをするように指示された:
（１）特定の特徴を持ったカメラを探している顧客（４つの試行）、
（２）多数のカメラに興味を持っている好奇心の強い顧客（４つの試行）あるいは
（３）一人で見て回るのが好きなウィンドウショッピングの顧客（２つの試行）。 In each trial, customer p2 was instructed to play role-playing in one of the following scenarios:
(1) Customers looking for a camera with specific characteristics (4 trials),
(2) A curious customer who is interested in many cameras (4 trials) or (3) A window shopping customer who likes to look around alone (2 trials).

参加者が自然に特定のタイプの顧客としてのロールプレイするのを助けるために、各試行で、顧客に異なる特徴をもたせるようにした。 To help participants play role-playing as a specific type of customer naturally, each trial was given different characteristics to the customer.

選ばれたシナリオのことが店主には知らされず、顧客が商品を見て回るのはそのままにし、顧客からのどんな質問にも答えて、適切なときに、丁寧に製品を紹介するように指示された。 The store owner is not informed of the chosen scenario, leaves the customer looking around the product, answers any questions from the customer, and directs the product to be introduced carefully when appropriate It was done.

実験の前に、参加者は、アンドロイドの電話機を使用するように訓練され、尋ねるべきカメラの特徴のリストを与えられた。 Prior to the experiment, participants were trained to use an Android phone and given a list of camera features to ask.

店主は、各カメラの１セットの特徴仕様を含んでいる仕様説明書を与えられた。 The shopkeeper was given a specification that included a set of feature specifications for each camera.

練習試行は、スマートフォンの使用に慣れた参加者を助け、またインタラクション・シナリオ間の違いを示せるように設計された。 The practice trial was designed to help participants familiar with the use of smartphones and to show differences between interaction scenarios.

データ収集の目標は、反復可能なインタラクションをキャプチャすることであり、カメラに関する情報の提供に焦点を合せるようにシナリオの範囲を制限した。 The goal of data collection was to capture repeatable interactions, limiting the scope of the scenario to focus on providing information about the camera.

この理由で、カメラの価格交渉（例えば、「まけてください。」）をするような他のトピックを回避するように、インタラクションを単純にしておいてくれるように参加者に依頼した。 For this reason, we asked participants to keep the interaction simple, avoiding other topics such as negotiating camera prices (for example, “Please make money”).

更に、シナリオに存在しなかった新しい情報を作り出さないことを参加者に思い起こさせることが必要であった。 Furthermore, it was necessary to remind the participants not to create new information that did not exist in the scenario.

例えば、店主役の参加者が、シナリオ中で定義されていない「どんな種類の保証をしてくれるのか。」と尋ねられたならば、彼らは答えをアドリブで作らなければならないことになる。 For example, if storekeeper participants are asked, “What kind of guarantees do you offer?” That is not defined in the scenario, they will have to make an answer ad lib.

これらの即興での応対は、時間上の不整合のために学習するのには役立たない（事前の試行では、ある店主参加者が、この店は１年の保証をいたしますといったが、後の試行では、５年の保証であると言ったりした）。 These improvisational responses are not useful for learning due to time inconsistencies (in the previous trial, a store owner participated in this store with a one-year warranty, but later In trials, it was a 5 year warranty).

（人・人間のインタラクション）
定義されたシナリオ内では、参加者は、自然な会話形言語の使用し、自由形式で対話した。人々の用語や語法の合理的な多様性の変化が観察された。 (Human-human interaction)
Within the defined scenario, participants interacted in a free form using a natural conversational language. Changes in the rational diversity of people's terms and language were observed.

たとえば、このような多様性は、同じ参加者による以下の２つの試行から観測される。
（１）大きなメモリを備えたカメラを探す顧客、および
（２）バッテリの持ちがよいカメラに興味を持っている好奇心の強い顧客。
［システム１０の処理の詳細］
以下では、システム１０が行う処理をさらに詳しく説明する。
（抽象化）
人・人間のインタラクションから学習するために、データ駆動型のアプローチを使用するときの課題の一つは、仮に、簡単のために、視線、身振りおよび顔の表情のような社会的行動についての考慮を省略し、単に発話および移動のみを考慮するとした場合でさえ、対象となる人間行動は、非常に高い次元の特徴空間を占めるということである。 For example, such diversity is observed from the following two trials by the same participant.
(1) A customer looking for a camera with a large memory, and (2) a curious customer interested in a camera with a good battery.
[Details of processing of system 10]
Hereinafter, the processing performed by the system 10 will be described in more detail.
(Abstraction)
One of the challenges when using a data-driven approach to learning from human-human interaction is, for simplicity, to consider social behavior such as gaze, gestures and facial expressions. Even if only the utterance and movement are considered, the target human behavior occupies a very high-dimensional feature space.

しかしながら、人間行動の多様性は、この高い次元の空間内の小さな領域を占めるに過ぎない。すなわち、人々は、通常予測可能な方法で行動し、共通パターンに従っている。 However, the diversity of human behavior occupies only a small area within this higher dimensional space. That is, people usually act in a predictable way and follow a common pattern.

ここで、学習の問題の次元の数を縮小し、かつセンサー・ノイズの影響を縮小するために、これらのパターンをキャプチャするために、以下に説明するような「抽象化技術」を導入する。 Here, in order to reduce the number of dimensions of the learning problem and reduce the influence of sensor noise, an “abstraction technique” as described below is introduced to capture these patterns.

第１に、訓練データ中の典型的な行動（アクション）の組を識別するために教師なしクラスタリングを行なう。 First, unsupervised clustering is performed to identify typical sets of actions in the training data.

第２に、クラスタリングは、音声認識に関連する大きなノイズに対処するために会話データに対して実行されるとともに、また、トラッキングシステムによって観察された運動軌道が、環境下で典型的な停止位置および運動軌跡を識別するために、実行される。 Second, clustering is performed on the conversation data to deal with the loud noise associated with speech recognition, and the motion trajectory observed by the tracking system shows that the typical stop position and It is executed to identify the motion trajectory.

その上で、静的な複数の「インタラクション状態」が時系列として並んだものとして、店主と顧客との各インタラクションをモデル化する。ここで、「インタラクション状態」とは、「向かいあって話す」、「製品を示す」というような個別の店主と顧客との空間配置によって認識可能である対話状態において、いくつかの発話のやり取りの間続くものである。 Then, each interaction between the store owner and the customer is modeled on the assumption that a plurality of static “interaction states” are arranged in time series. Here, the “interaction state” refers to the exchange of several utterances in the conversation state that can be recognized by the spatial arrangement of individual store owners and customers, such as “speak face to face” and “show products”. It lasts for a while.

インタラクション状態のモデリングは、安定した方法で移動を生成し、詳細なレベルでロボット近接行動を指定し、より多くのロバストな行動予測にコンテキストを提供するのを助ける。
１）発話クラスタリング
図７は、参加者の発話を自動的にクラスタリングする処理を実行する構成を説明するための機能ブロック図である。 Interaction state modeling helps generate movement in a stable manner, specify robot proximity behavior at a detailed level, and provide context for more robust behavior prediction.
1) Utterance Clustering FIG. 7 is a functional block diagram for explaining a configuration for executing processing for automatically clustering utterances of participants.

多くの多様性は、訓練データ中でキャプチャされた発話の中にあり、たとえば、「価格はいくらですか」に対して「値段はいくらですか」のような相互に代替可能なフレーズであったり、「このカメラ（this camera）の値段はいくら？」ではなく、「その詐欺師（the scammer）の値段はいくら？」というような音声認識エラーを同様に含む。 There is a lot of diversity in the utterances captured in the training data, for example, mutually interchangeable phrases like “how much is the price” or “how much is the price” , As well as “how much is the price of this camera?”, It also includes a speech recognition error such as “how much is the price of the scammer?”

音声処理のクラスタリングとは、意味論的な意味を有する句の間の類似点を保持するような方法で、これらの発話を表現することである。 Speech processing clustering refers to expressing these utterances in such a way as to preserve similarities between phrases having semantic meaning.

発話がキャプチャされると、音声データファイル３０２として記憶装置３００に格納され、音声認識部２２０２により、音声認識が行なわれる。上述したとおり、音声認識部２２０２の処理は、外部のサーバ上で実行されてもよい。 When the utterance is captured, the speech data file 302 is stored in the storage device 300, and the speech recognition unit 2202 performs speech recognition. As described above, the processing of the voice recognition unit 2202 may be executed on an external server.

次に、キーワード抽出部２２０４は、キーワードを抽出し、また潜在意味解析部２２０６は、発話結果およびキーワードのベクトル化された表現を作成し、発話ベクトル３０４として記憶装置３０４に格納する。 Next, the keyword extraction unit 2204 extracts a keyword, and the latent meaning analysis unit 2206 creates an utterance result and a vectorized expression of the keyword, and stores them in the storage device 304 as an utterance vector 304.

発話のベクトル化の後に、クラスタ化部２２０８は、同様の発話のクラスタへグループ化するために教師なしクラスタリングを実行し、発話をクラスタに分類するための情報である発話クラスタ情報３０６を記憶装置３００に格納する。そして、代表発話抽出部２２１０は、合成音声出力の内容として使用されるために、典型的な発話を各クラスタから選択する。
（音声認識）
ここで、音声認識部２２０２が実行する自動音声認識については、既存の発話ＡＰＩを使用することができる。 After vectorization of utterances, the clustering unit 2208 performs unsupervised clustering to group into clusters of similar utterances, and stores utterance cluster information 306 that is information for classifying utterances into clusters. To store. Then, the representative utterance extraction unit 2210 selects a typical utterance from each cluster to be used as the content of the synthesized speech output.
(voice recognition)
Here, the existing speech API can be used for automatic speech recognition performed by the speech recognition unit 2202.

具体的に実施した結果では、たとえば、トレーニング・インタラクションからの４００個の発話の解析は、53%は正確に、そして、30%は、たとえば”can it shoot video”を”can it should video”とするというような些細なエラーで認識されたことを示し、１７％は、例えば、”is the lens include North Florida.”というように、完全に無意味な結果であった。
（キーワード抽出）
「私は大きなメモリー容量のカメラを捜している。」また、「私は大きなLCDサイズのカメラを捜している」というような句は、語彙の類似性にもかかわらず異なる意味を持っている。 Specifically, for example, analysis of 400 utterances from training interactions shows that 53% are accurate and 30% say “can it shoot video” is “can it should video”, for example. 17% were completely meaningless results, for example, “is the lens include North Florida.”
(Keyword extraction)
Phrases such as "I'm looking for a camera with a large memory capacity" and "I'm looking for a camera with a large LCD size" have different meanings despite lexical similarity.

キーワード抽出については、たとえば、以下の文献に開示されているような周知の技術を利用することが可能である。 For keyword extraction, for example, a well-known technique as disclosed in the following documents can be used.

文献：特開２０１５−２００９６２号
あるいは、クラウドサービスとして公開されている、句の中のキーワードをキャプチャするためAlchemyAPIなどを使用することもできる。このＡＰＩは、ディープラーニングに基づいたテキスト解析のためのクラウドに基づいたサービスである。
（潜在意味解析）
潜在意味解析部２２０６は、潜在意味解析(LSA)を使用して、各発話を表わすためのベクトルを作成する。ここで、ＬＳＡは、テキストマイニング用途において、ドキュメントの類似性の分類のために一般的に使用される技術である。 Document: Japanese Patent Application Laid-Open No. 2015-200962 Alternatively, AlchemyAPI or the like can be used to capture keywords in phrases published as a cloud service. This API is a cloud-based service for text analysis based on deep learning.
(Latent semantic analysis)
The latent semantic analysis unit 2206 uses latent semantic analysis (LSA) to create a vector for representing each utterance. Here, LSA is a technique commonly used for classification of document similarity in text mining applications.

これを達成するために、文章に対する処理では、たとえば、以下の文献に開示されているような標準的な数ステップを行なう構成とすることができる。 In order to achieve this, the processing for the sentence can be configured to perform several standard steps as disclosed in the following document, for example.

文献：M. F. Porter, "An algorithm for suffix stripping," Program: electronic library and information systems, vol. 14, pp. 130-137, 1980.
潜在意味解析部２２０６は、各発話ごとに返されたキーワードのリストに対して、ＬＳＡの特徴ベクトル（発話ベクトル３０４）を生成する。
（店主発話のクラスタリング）
クラスタ化処理部２２０８は、観察された店主発話をユニークな発話エレメントを表わすクラスタ（発話クラスタ３０６）へグループ化する。このような処理のためには、たとえば、以下の文献に開示されているダイナミックな階層的クラスタ分割を使用することができる。 Literature: MF Porter, "An algorithm for suffix stripping," Program: electronic library and information systems, vol. 14, pp. 130-137, 1980.
The latent meaning analysis unit 2206 generates an LSA feature vector (speech vector 304) for the keyword list returned for each utterance.
(Clustering of shopkeeper utterances)
The clustering processing unit 2208 groups the observed storekeeper utterances into clusters (utterance clusters 306) representing unique utterance elements. For such processing, for example, dynamic hierarchical cluster division disclosed in the following document can be used.

文献：P. Langfelder, B. Zhang, and S. Horvath, "Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R," Bioinformatics, vol. 24, pp. 719-720, 2008.
図８は、得られたクラスタのうちの１つのクラスタについての発話の分布を説明する図である。 Literature: P. Langfelder, B. Zhang, and S. Horvath, "Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R," Bioinformatics, vol. 24, pp. 719-720, 2008.
FIG. 8 is a diagram illustrating the utterance distribution for one of the obtained clusters.

図８に示すように、実験では、１６６のクラスタが得られ、たとえば、交換レンズに関する発話のクラスタ中には、以下のようなものがある。 As shown in FIG. 8, in the experiment, 166 clusters are obtained. For example, among the utterance clusters related to the interchangeable lens, there are the following.

「このカメラでは２８種のレンズを使うことができます。ですから、おそらく、お客様が撮ろうとされるどんなタイプの画像も撮影することができます。（well you can use 28 lenses with this camera so you probably can shoot any type of images you are looking for）」
（典型的な発話抽出）
典型発話抽出部２２１０は、各店主の発話クラスタから、１つの発話を、典型発話として、行動生成で使用するために選択する。 “You can use 28 different lenses with this camera, so you can probably take any type of image you would like to take (well you can use 28 lenses with this camera so you probably can shoot any type of images you are looking for) "
(Typical utterance extraction)
The typical utterance extraction unit 2210 selects one utterance as a typical utterance for use in action generation from the utterance cluster of each store owner.

ここで、単純に、クラスタの重心に近い発話を単に選ぶのでは、多くの場合問題がある。すなわち、しばしば、図８に示されるように、このベクトルはクラスタ内で他の発話に実際上、語彙的に類似しておらず、多くのエラーを含んでいる場合がある。 Here, simply selecting an utterance close to the center of gravity of the cluster is often problematic. That is, often as shown in FIG. 8, this vector is practically not lexically similar to other utterances in the cluster and may contain many errors.

図９は、典型発話抽出部２２１０の実行する処理を説明するための概念図である。 FIG. 9 is a conceptual diagram for explaining the processing executed by the typical utterance extraction unit 2210.

典型発話抽出部２２１０は、クラスタ内で最も多くの他の発話と語彙上の類似度が最高レベルである発話を選択する。このような選択により、この発話は、ランダムな誤りを含む可能性が最も小さいと考えられる。 The typical utterance extraction unit 2210 selects an utterance having the highest lexical similarity with the most other utterances in the cluster. With such a selection, this utterance is considered to be least likely to contain random errors.

類似度に関しては、各発話に対して、同じクラスタ内の１発話ごとのその用語頻度ベクトルのコサイン類似度を計算し、これらの類似度を合計する。 Regarding the similarity, for each utterance, the cosine similarity of the term frequency vector for each utterance in the same cluster is calculated, and these similarities are summed.

ここで、「コサイン類似度」とは、文書についての「ベクトル空間モデル」において、文書同士を比較する際に用いられる類似度計算手法である。 Here, “cosine similarity” is a similarity calculation method used when comparing documents in a “vector space model” for documents.

たとえば、２つの文書がベクトル空間内で、ベクトルｐとベクトルｑとして表現される場合は、コサイン類似度は、以下の数式で定義される。なお、ベクトルｘという場合、ｘの頭部に→を付して表し、記号・は、内積を表す。 For example, when two documents are expressed as a vector p and a vector q in the vector space, the cosine similarity is defined by the following equation. In addition, in the case of the vector x, the head of x is represented by →, and the symbol • represents an inner product.

そして、周りの文（発話）に対して最も高い積算類似度を有する発話が典型的な発話として選択される。 Then, an utterance having the highest accumulated similarity with respect to surrounding sentences (utterances) is selected as a typical utterance.

（運動クラスタリング）
以上の説明は、参加者の発話のクラスタリングと、各クラスタについての代表発話の選択について説明した。 (Motion clustering)
In the above description, the clustering of the participant's utterance and the selection of the representative utterance for each cluster are explained.

以下では、参加者の空間的な運動についてのクラスタリングと、各クラスタについての代表運動について説明する。 Below, the clustering about a participant's spatial exercise | movement and the representative exercise | movement about each cluster are demonstrated.

なお、以下では、「行動」とは、「発話」と「運動」とを含むものとする。したがって、「代表行動」には、「代表発話」と「代表運動」とが含まれる。 In the following, “behavior” includes “speech” and “exercise”. Therefore, “representative behavior” includes “representative utterance” and “representative exercise”.

図１０は、参加者の運動要素を離散化およびクラスタ化により抽象化するための運動抽象化処理部２３０の処理に対する機能ブロック図である。 FIG. 10 is a functional block diagram for processing of the motion abstraction processing unit 230 for abstracting the motion elements of participants by discretization and clustering.

参加者の運動要素を抽象化するために、以下の観点から処理を実行する。 In order to abstract the movement elements of the participants, processing is performed from the following viewpoints.

（１）参加者の運動の表現を離散化できるように対象空間（たとえば、店舗内）での共通の停止位置を識別すること。このような停止位置は、後述する「結合状態ベクトル」の中で使用される。 (1) Identifying a common stop position in the target space (eg, in a store) so that the expression of the participant's motion can be discretized. Such a stop position is used in a “combined state vector” described later.

（２）参加者の運動目標位置を評価することができるように典型的な軌道形を識別すること。 (2) Identify typical trajectory shapes so that participants' motion target positions can be evaluated.

そこで、システム１０の運動抽象化ユニット２３０は、運動トラッキング部２０４からの運動データを、運動データファイル３１１０として、一旦、記憶装置３００に格納し、運動ファイル３１１０中において、センサにより検知された参加者の運動のデータに存在する停止位置および運動軌道の全体的な組を特徴づけるために、動きデータを分析しクラスタに分ける。 Therefore, the motion abstraction unit 230 of the system 10 temporarily stores the motion data from the motion tracking unit 204 as the motion data file 3110 in the storage device 300, and the participant detected by the sensor in the motion file 3110. The motion data is analyzed and clustered to characterize the overall set of stop positions and motion trajectories present in the motion data.

軌道離散化処理部２３０２は、運動データ・セットにおける軌道の分布を分析し、一定のしきい値速度を設定して、軌道をセグメントに分ける。ここで、たとえば、しきい速度としては、０．５５ｍ／ｓを選択することができる。軌道離散化処理部２３０２は、運動ファイル３１１０中のデータ中の観察されたすべての運動の軌道を「停止セグメント」および「移動セグメント」に分類し、停止セグメントデータ３１１２および移動セグメントデータ３１１４として、記憶装置３００に格納する。
（停止位置）
続いて、空間クラスタリング部２３０４は、停止セグメントをクラスタリングして、各クラスタごとに、停止位置を抽出し、停止位置データ３１２として、記憶装置３００に格納する。ここでは、各クラスタの重心を「停止位置」として定義する。 The trajectory discretization processing unit 2302 analyzes the trajectory distribution in the motion data set, sets a constant threshold velocity, and divides the trajectory into segments. Here, for example, 0.55 m / s can be selected as the threshold speed. The trajectory discretization processing unit 2302 classifies all observed motion trajectories in the data in the motion file 3110 into “stop segments” and “moving segments”, and stores them as stop segment data 3112 and moving segment data 3114. Store in device 300.
(Stop position)
Subsequently, the spatial clustering unit 2304 clusters stop segments, extracts stop positions for each cluster, and stores them in the storage device 300 as stop position data 312. Here, the center of gravity of each cluster is defined as the “stop position”.

特に限定されないが、たとえば、停止セグメントは、k-ミーンズクラスタリング法により空間的にクラスタに分けられ、例示した具体例では、顧客に対しては６か所の、店主に対しては５か所の、典型的な停止位置を特定された。 Although not particularly limited, for example, stop segments are spatially divided into clusters by the k-means clustering method. In the illustrated example, there are 6 locations for customers and 5 locations for shopkeepers. A typical stop position was identified.

図１１は、特定された「停止位置」を示す図である。 FIG. 11 is a diagram showing the specified “stop position”.

通常、これらの停止位置はカメラかサービスカウンターのような、対象空間の中で人が立ち止まる「重要な場所」に相当する。図１に示した例では、説明の簡単のために、図１に示される名前でこれらの位置を参照する。
（軌道クラスタ）
また、軌道クラスタリング部２３０６は、訓練データの中にある典型的な運動軌道を識別するためにそれらのセグメントをクラスタに分ける。 Normally, these stop positions correspond to “important places” where people stop in the target space, such as cameras or service counters. In the example shown in FIG. 1, these positions are referred to by the names shown in FIG. 1 for the sake of simplicity.
(Orbit cluster)
The trajectory clustering unit 2306 divides the segments into clusters in order to identify typical motion trajectories in the training data.

図１２、図１３、図１４は、軌道クラスタの例を示す図である。 12, 13, and 14 are diagrams showing examples of trajectory clusters.

図１２に示すように、クラスタ２４は、ブランドＣ（運動起点）からブランドＢ（運動目標位置）へ向かう軌道である。 As shown in FIG. 12, the cluster 24 is a trajectory from brand C (movement start point) to brand B (movement target position).

図１３に示すように、クラスタ２７は、ブランドＡ（運動起点）からブランドＣ（運動目標位置）へ向かう軌道である。 As shown in FIG. 13, the cluster 27 is a trajectory from the brand A (movement start point) to the brand C (movement target position).

図１４に示すように、クラスタ３５は、サービスカウンタ（運動起点）からブランドＣ（運動目標位置）へ向かう軌道である。 As shown in FIG. 14, the cluster 35 is a trajectory from the service counter (movement start point) to the brand C (movement target position).

軌道クラスタ部２３０６は、たとえば、動的時間伸縮法（ダイナミック時間ワーピング(ＤＴＷ：dynamic time warping)）を使用して軌道間で計算された距離に基づき、k-medoidクラスタリング法を使用して、店主と顧客に対してそれぞれ個別に、移動セグメントを５０個の軌道クラスタに分ける。 For example, the trajectory cluster unit 2306 uses the k-medoid clustering method based on the distance calculated between the trajectories using the dynamic time warping method (dynamic time warping (DTW)). The moving segment is divided into 50 trajectory clusters individually for each customer.

k-means法では、クラスタの中心（centroid）を代表(represented object)とするのに対し、k-medoids法では“medoid”をクラスタの代表とする。“medoid”とは、クラスタ内の点で、その点以外のクラスタ内の点との非類似度の総和が最小になる点である。k-means法は、クラスタを代表するセントロイドとクラスタ内のデータ点の距離の二乗の総和を最小にする。一方、k-medoids法では、medoidとデータ点の距離の総和（二乗の総和ではない）を最小化する。 In the k-means method, the center (centroid) of the cluster is represented (represented object), whereas in the k-medoids method, “medoid” is represented as the cluster representative. “Medoid” is a point in a cluster where the sum of dissimilarities with points in the cluster other than that point is minimized. The k-means method minimizes the sum of the squares of the distances between the centroid representing the cluster and the data points in the cluster. On the other hand, the k-medoids method minimizes the sum of distances between medoid and data points (not the sum of squares).

また、ここで、ＤＴＷについては、たとえば、以下の文献に開示がある。 Here, DTW is disclosed in, for example, the following documents.

文献：特開２０１６−１６２０７５号
各クラスタのメドイド（medoid）軌道は、軌道クラスタ特徴抽出部２３０８により「典型的な軌道」（代表軌道データ３１４）として指定され、その典型的な軌道のスタートポイントおよびエンドポイントへの最も近い停止位置（運動起点データ３１４６および運動目標位置データ３１４８）が特定される。
（インタラクション状態）
次に、人・人間のインタラクション状態の抽象化処理について説明する。 Document: JP-A-2006-162075 The medoid trajectory of each cluster is designated as a “typical trajectory” (representative trajectory data 314) by the trajectory cluster feature extraction unit 2308, and the start point of the typical trajectory and The closest stop position (movement start point data 3146 and movement target position data 3148) to the end point is specified.
(Interaction state)
Next, the human / human interaction state abstraction process will be described.

例として取り上げたような店舗内での人・人間のインタラクションでは、参加者が向かいあって話すか、カメラのところで一緒に立っているというような少数の静的な空間配置において、多くの時間が過ごされることを観察される。 In person-to-person interactions in the store as taken as an example, a lot of time is spent in a small number of static space arrangements where the participants speak face-to-face or stand together at the camera. Observed to spend.

空間的な行動のこの様相を分類するために、各インタラクションを、「向かいあって話す」あるいは「製品を示す」というような共通的な近接的配置により特徴づけられる、一連の「インタラクション状態」から成るものとしてモデル化する。 To classify this aspect of spatial behavior, each interaction is derived from a series of “interaction states” characterized by a common proximity arrangement such as “speaking across” or “showing product”. Model as

顧客および／または店主の移動とは、これらのインタラクション状態間を移行するための手段として機能するものとみなすことができる。 Customer and / or storekeeper movement can be viewed as acting as a means for transitioning between these interaction states.

訓練データの中では、以下に説明するような所定のインタラクション状態が、時間とともにシーケンスとして移行することが観測される。 In the training data, it is observed that a predetermined interaction state as described below shifts as a sequence with time.

たとえば、商品の購買を目的としている顧客であれば、以下のようなシーケンスが典型的である。 For example, the following sequence is typical for a customer who intends to purchase merchandise.

「待機（サービスカウンタ）」→「向かいあって話す」→「移動」→「特定の製品の提示」→「移動」
ここで、「待機」とは、特定の停止位置で、一人で停止している状態を意味し、「移動」とは、いずれかの典型軌道に代表される移動軌道で移動している状態を意味する。 "Standby (service counter)"->"Speak face to face"->"Move"->"Present specific product"->"Move"
Here, “standby” means a state where the vehicle is stopped alone at a specific stop position, and “move” means a state where the vehicle is moving on a moving track represented by any of the typical tracks. means.

一方で、たとえば、ウィンドショッピングをしに来た顧客では、たとえば、以下のようなシーケンスが典型的である。 On the other hand, for example, the following sequence is typical for a customer who has come for window shopping.

「待機（ブランドＡ）」→「移動」→「特定の製品の提示（ブランドＢ）」→「移動」→「待機（ブランドＣ）」
また、好奇心の強い顧客の場合は、たとえば、以下のようなシーケンスが典型的である。 “Standby (Brand A)” → “Moving” → “Presentation of specific product (Brand B)” → “Moving” → “Standby (Brand C)”
For a curious customer, for example, the following sequence is typical.

待機（ブランドＡ）」→「移動」→「特定の製品の提示（ブランドＢ）」→「移動」→「特定の製品の提示（ブランドＢ）」→「移動」
すなわち、ウィンドショッピングの場合は、各ブランドの製品を見て回っている途中で、たまたま、店員がいた場合に、その説明を受ける、という状態遷移をするものの、「特定の製品の提示」に費やす時間は、比較的短い。 "Standby (Brand A)"->"Move"->"Present specific product (Brand B)"->"Move"->"Present specific product (Brand B)"->"Move"
In other words, in the case of wind shopping, while looking around the products of each brand, if there happens to be a store clerk, the state transition is to receive an explanation, but spending "presenting a specific product" Time is relatively short.

これに対して、好奇心の強い顧客は、むしろ、店員がいるブランドの製品のところに移動して、「特定の製品の提示」を受けるという行動をとることになり、各々の「特定の製品の提示」に費やす時間も比較的長い。 In contrast, a curious customer would rather move to a branded product with a store clerk and receive a “specific product presentation” for each “specific product”. The time spent on “presentation” is relatively long.

そして、このようなインタラクション状態については、人・ロボット間のインタラクション（ＨＲＩ：human-robot interaction）としても、研究されてきている。 Such interaction states have been studied as human-robot interaction (HRI).

ここで、人・ロボット間のインタラクションを予め分類しモデル化したものをＨＲＩモデルと呼ぶ。 Here, what classifies and models the interaction between a person and a robot in advance is called an HRI model.

ＨＲＩモデルは、会話を始める、あるいは、オブジェクトを相手に示すというような特定の社会的な状況において適切な「近接的な行動」をロボットに生成させるために開発されてきたものである。 The HRI model has been developed to cause the robot to generate appropriate “proximity behavior” in specific social situations, such as starting a conversation or showing an object to an opponent.

インタラクション状態によって移動の目標位置について記述するだけでなく、ロボットの制御に対する詳細なレベルで、近接的な拘束条件および他の行動を指定することを可能とするので、これらのモデルは有用な抽象化である。 These models are useful abstractions because they allow you to specify proximity constraints and other behaviors at a detailed level of control of the robot, as well as describing the target location of movement by interaction state. It is.

本実施の形態では、観測する人・人間のインタラクション状態を、人・ロボット間のインタラクション状態として利用するために、所定のインタラクション状態として、以下の３つを考慮することとする。もちろん、考慮するインタラクション状態の数は、もっと多くてもよい。 In the present embodiment, in order to use the human / human interaction state to be observed as the human / robot interaction state, the following three are considered as predetermined interaction states. Of course, more interaction states may be considered.

ｉ）オブジェクトの（相手への）提示状態（以下、「製品の提示状態」：二人の人が、特定の製品の所定の距離以内の位置に停止している状態）
ｉｉ）人と人との間の距離に基づく対面状態（以下、「対面状態」：二人の人が、所定の距離以内で向き合っているが製品とは所定の距離以上離れて停止している状態）、
ｉｉｉ）待機状態（一人の人が特定の停止位置に停止しており、他者とはインタラクションをしていない状態）
図１５は、インタラクション状態のうち、「製品の提示状態」を示す概念図である。 i) Presentation state of the object (to the other party) (hereinafter, “product presentation state”: a state in which two people are stopped at a position within a predetermined distance of a specific product)
ii) Face-to-face state based on the distance between people (hereinafter, “face-to-face state”: two people are facing each other within a predetermined distance, but are stopped at a predetermined distance or more from the product State),
iii) Waiting state (one person is stopped at a specific stop position and is not interacting with others)
FIG. 15 is a conceptual diagram showing a “product presentation state” in the interaction state.

図中、製品(□で示す)の所定の距離以内に停止している状態には、二人の位置関係としては、図１５（ａ）〜（ｃ）のような様々な可能性があるものの、ここでは、これらの状態を全て、「製品の提示状態」という１つの概念に抽象化して分類するものとする。 In the figure, in the state where the product (indicated by □) is stopped within a predetermined distance, there are various possibilities as shown in FIGS. Here, all of these states are abstracted and classified into one concept of “product presentation state”.

したがって、システム１０は、インタラクションの主体とそれらの場所の間の距離に基づいて、これらのインタラクション状態の各々を以下のルールで識別する。 Accordingly, the system 10 identifies each of these interaction states with the following rules based on the distance between the interaction subject and their location.

たとえば、インタラクションの主体の両方が同じカメラに対応する停止位置にいた場合、インタラクション状態は「製品の提示」として分類される。 For example, if both of the interaction subjects are at a stop position corresponding to the same camera, the interaction state is classified as “product presentation”.

インタラクションの主体がカメラの近傍ではなく、互いのたとえば１．５ｍ以内にいる場合、それは「対面状態」としてモデル化される。また、店主がサービスカウンターにおり顧客がそうでなかった場合は、店主のインタラクション状態は、「待機状態」として分類される。 If the subjects of the interaction are not in the vicinity of the camera but are within 1.5 m of each other, for example, they are modeled as “face-to-face”. When the store owner is at the service counter and the customer is not, the store owner's interaction state is classified as “standby state”.

さらに、特定のインタラクション状態に対して状態ターゲットを定義する。 In addition, state targets are defined for specific interaction states.

「製品の提示」というインタラクション状態の状態ターゲットは、「ブランドＡ」、「ブランドＢ」あるいは「ブランドＣ」のいずれかである。一方で、「対面状態」および「待機状態」のインタラクション状態の状態ターゲットは空欄である。すなわち、「状態ターゲット」とは、インタラクションする二人の人（ロボットと人）以外に、そのインタラクション状態に関わり当該インタラクション状態を定義するオブジェクトのことを意味する。
［ベクトル化］
以下に説明するように、オフライン・トレーニングあるいはオンライン・インタラクション用の時系列のセンサー・データの処理においては、抽象化が実行され、離散的な顧客および店主の行動を後述するようなベクトルで表現する。 The state target of the interaction state of “product presentation” is any one of “brand A”, “brand B”, or “brand C”. On the other hand, the state targets in the “facing state” and “standby state” interaction states are blank. That is, the “state target” means an object that defines an interaction state related to the interaction state in addition to two people (robot and a person) who interact with each other.
[Vectorization]
As explained below, in the processing of time-series sensor data for offline training or online interaction, abstraction is performed to represent discrete customer and shopkeeper behavior in vectors as described below. .

最初に、動作分析が、典型的な軌道との比較に基づいて行なわれる。 Initially, a motion analysis is performed based on a comparison with a typical trajectory.

その後、移動と発話の検知に基づき行動を離散化することができる。 Thereafter, the behavior can be discretized based on the detection of movement and utterance.

各々の顧客行動は、そのアクションの時における参加者双方の抽出された状態について記述する結合状態ベクトルによって表わされ、各々の店主行動は、ロボットが後ほどそのアクションを再現するために必要な情報を含んでいるロボット行動ベクトルによって表わされる。 Each customer action is represented by a combined state vector that describes the extracted state of both participants at the time of the action, and each shopkeeper action captures information necessary for the robot to later reproduce the action. Represented by the containing robot action vector.

ここで示されたすべてのプロセスについては、センサー・データは、たとえば１Ｈｚの一定の割合でサンプリングされる。
［訓練データからの予測器の生成］
以下では、以上のようにして、訓練データについて、移動と発話を含む行動についての抽象化の処理が終了した後に、訓練処理部４００によって、訓練データから予測器が生成される処理について説明する。
（動作分析）
抽象化処理部２１０によるオフラインの軌道解析の中で使用されるのと同じ速度しきい値を用いて、動作要素抽出部４１０は、人が移動しているか、停止しているかを識別し、行動離散化部４２０は、動作要素抽出部４１０の識別結果に応じて、以下のように、対象の行動を離散化する。 For all the processes shown here, the sensor data is sampled at a constant rate, for example 1 Hz.
[Predictor generation from training data]
In the following, a process for generating a predictor from the training data by the training processing unit 400 after the abstraction process for the action including movement and speech is completed for the training data as described above will be described.
(Operation analysis)
Using the same velocity threshold used in the off-line trajectory analysis by the abstraction processor 210, the motion element extractor 410 identifies whether a person is moving or stopped, and acts The discretization unit 420 discretizes the target behavior as follows according to the identification result of the motion element extraction unit 410.

図１６は、行動離散化部４２０の処理を説明するための機能ブロック図である。 FIG. 16 is a functional block diagram for explaining the processing of the behavior discretization unit 420.

行動離散化部４２０における行動分析部４２０２は、まず、３つのパラメーターを含んでいるベクトルを使用して、人の運動を特徴づける。 The behavior analysis unit 4202 in the behavior discretization unit 420 first characterizes a human motion using a vector including three parameters.

このような３つのパラメーターとは、「現在位置」、「運動起点」および「運動目標位置」であり、これらは、抽象化処理部２１０によるクラスタリングによって特定された停止位置のいずれかに対応する。 These three parameters are “current position”, “movement start point”, and “movement target position”, and these correspond to any of the stop positions specified by clustering by the abstraction processing unit 210.

「停止位置軌道」に対しては、現在位置は最も近い停止位置にセットされ、運動起点および運動目標位置は、空欄である。 For the “stop position trajectory”, the current position is set to the nearest stop position, and the motion start point and the motion target position are blank.

「移動軌道」に対しては、現在位置は空欄であり、運動起点は最も最近の現在位置にセットされる。 For the “movement trajectory”, the current position is blank, and the motion starting point is set to the most recent current position.

顧客に対しては、運動目標位置の欄が評価されるものの、後に説明するように、店主に対しては、これは不要である。 Although the column of the exercise target position is evaluated for the customer, as described later, this is unnecessary for the store owner.

行動分析部４２０２は、顧客の運動目標位置を評価するにあたり、クラスタリングにおいて識別された典型的な軌道（代表軌道）と、分析対象となっている顧客の軌道の類似度を算出する。 In evaluating the customer's exercise target position, the behavior analysis unit 4202 calculates the similarity between the typical trajectory (representative trajectory) identified in the clustering and the customer's trajectory to be analyzed.

すなわち、行動分析部４２０２は、上述したＤＴＷを使用して、顧客の軌道と、訓練データからの典型的な軌道の各々の間の空間時間上の距離を計算する。その後、行動分析部４２０２は、そのクラスタ中の場合の数によって各軌道クラスタに対する計算された距離を確率として重みづけし、同じ到達場所で終了する軌道に対して確率を積算する。 That is, the behavior analysis unit 4202 calculates the space-time distance between the customer's trajectory and each of the typical trajectories from the training data using the DTW described above. Thereafter, the behavior analysis unit 4202 weights the calculated distance for each trajectory cluster as a probability according to the number of cases in the cluster, and accumulates the probabilities for trajectories ending at the same arrival location.

ある目標位置についての確率が、所定の値、たとえば、５０％以上に一旦なると、行動分析部４２０２は、その位置を運動目標位置として特定する。 Once the probability for a certain target position reaches a predetermined value, for example, 50% or more, the behavior analysis unit 4202 identifies that position as the exercise target position.

一方で、センサー・データによる運動目標位置の推定は店主にとって不必要である。すなわち、オンラインでロボットを制御する場合は、ロボットに送られるコマンドに基づいて、ロボットの目標位置をシステム側は確実に知ることができるので、店主の訓練データ中の運動目標位置は、意図される運動目標位置についてのこのような知識が反映される。 On the other hand, the estimation of the movement target position by the sensor data is unnecessary for the shopkeeper. That is, when controlling a robot online, the system side can reliably know the target position of the robot based on the command sent to the robot, so the target position of movement in the shopkeeper's training data is intended. Such knowledge about the exercise target position is reflected.

訓練データに対してその反映を行うために、センサー・データからの推定に依存するのではなく、事前知識を参照することで店主の最終的な目標位置をみつけ、いつでも店主の実際の運動目標位置を決定することができる。そうすることによって、訓練データ、および実時間データに基づく、店主運動目標位置を一貫させることができる。 Instead of relying on estimation from sensor data to reflect on training data, the store owner's final target position is found by referring to prior knowledge and the store owner's actual motion target position at any time. Can be determined. By doing so, the shopkeeper movement target position based on training data and real-time data can be made consistent.

空間配置検出部４２０４は、停止状態にある対象について、上述したいずれのインタラクション状態であるかを特定する。なお、空間配置検出部４２０４は、設定により、「移動中」の人・人間の空間配置について、「インタラクション状態」を特定する構成としてもよい。 The spatial arrangement detection unit 4204 identifies which interaction state is described above for the target in the stopped state. Note that the spatial arrangement detection unit 4204 may be configured to identify the “interaction state” for the spatial arrangement of a “moving” person / person by setting.

音声認識部２０２からの生の時系列データと、行動分析部４２０２および空間配置検出部４２０４から出力される対象の行動の状態についての情報の時系列データとは、相互に、タイムスタンプにより時間的に関連付けられて、記憶装置３００に、発話要素データ３２０、運動要素データ３２２およびインタラクション要素データ３２４として格納される。 The raw time-series data from the speech recognition unit 202 and the time-series data of the information about the behavior state of the target output from the behavior analysis unit 4202 and the spatial arrangement detection unit 4204 are mutually temporally based on time stamps. Are stored in the storage device 300 as utterance element data 320, movement element data 322, and interaction element data 324.

たとえば、１つのタイムスタンプ（所定時間、たとえば、１秒の間隔）に対して、人Ａの位置、人Ｂの位置、人Ａの発話、人Ｂの発話、人Ａの「現在位置」、「運動起点」および「運動目標位置」、人Ｂの「現在位置」、「運動起点」および「運動目標位置」ならびに人Ａ・Ｂ間のインタラクション状態（空間配置：ＨＲＩモデル）の各項目の情報が、格納される。
（行動の離散化）
さらに、行動離散化部４２０の位置追跡部４２０６は、参加者のうちの一人が話し、かつ／または、新しい場所へ移動し始める場合、離散的な「顧客行動」および「店主行動」を決定する。 For example, the position of person A, the position of person B, the utterance of person A, the utterance of person B, the “current position” of person A, “ Information on each item of “movement start point” and “movement target position”, “current position” of person B, “movement start point” and “movement target position”, and the interaction state between people A and B (spatial arrangement: HRI model) Stored.
(Discrete behavior)
Further, the position tracking unit 4206 of the behavior discretization unit 420 determines discrete “customer behavior” and “storekeeper behavior” when one of the participants speaks and / or begins to move to a new location. .

発話行動は、音声認識結果が受信される時点で決定され、運動行動は、運動目標位置が決定される時点で決定される。 The utterance behavior is determined when the speech recognition result is received, and the exercise behavior is determined when the exercise target position is determined.

位置追跡部４２０６により決定された「顧客行動」および「店主行動」は、タイムスタンプと関連付けられて、顧客行動データ３３０、店主行動データ３３２として、記憶装置３００に格納される。 The “customer behavior” and “storekeeper behavior” determined by the position tracking unit 4206 are stored in the storage device 300 as customer behavior data 330 and storekeeper behavior data 332 in association with the time stamp.

同じ１秒間隔で受信される顧客と店主のイベントは、２つの別個のイベントとして分類され、したがって、いずれのイベントも、顧客の発話および店主の発話の双方を含むことはない。
（結合状態ベクトルの生成）
図１７は、結合状態ベクトル生成部４３０とロボット行動生成部４４０の動作を説明するための機能ブロック図である。 Customer and storekeeper events received at the same 1-second interval are classified as two separate events, so neither event includes both customer utterances and storekeeper utterances.
(Generation of coupled state vectors)
FIG. 17 is a functional block diagram for explaining the operations of the combined state vector generation unit 430 and the robot action generation unit 440.

結合状態ベクトル生成部４３０の行動ペア特定部４３０２は、顧客行動が検知されたときの対となる店主行動を特定する。顧客行動の結合状態ベクトル生成部４３０４は、顧客行動が検知された場合、現在の顧客と店主との状態に基づき、結合状態ベクトルを生成し、記憶装置３００に結合状態ベクトル３４０として格納する。一方ロボット行動生成部４４０は、顧客行動に対応する店主行動をロボット行動として生成し、上記顧客行動と関連付けて、ロボット行動ベクトル３４２として、記憶装置３００に格納する。この結合状態ベクトル３４０とロボット行動ベクトル３４２とが、予測器５２０に対する訓練データとなる。 The action pair specifying unit 4302 of the combined state vector generating unit 430 specifies a storekeeper action to be paired when a customer action is detected. When the customer behavior is detected, the customer behavior combined state vector generation unit 4304 generates a combined state vector based on the current state of the customer and the store owner, and stores the combined state vector in the storage device 300 as the combined state vector 340. On the other hand, the robot behavior generation unit 440 generates the store owner behavior corresponding to the customer behavior as the robot behavior, and stores it in the storage device 300 as the robot behavior vector 342 in association with the customer behavior. The combined state vector 340 and the robot action vector 342 serve as training data for the predictor 520.

図１８は、行動ペア特定部４３０２による行動の特定処理を説明するための概念図である。 FIG. 18 is a conceptual diagram for explaining action specifying processing by the action pair specifying unit 4302.

図１８を参照して、行動ペア特定部４３０２は、検知され記憶装置３００に格納された行動の時間系列を調べることによって、顧客行動とそれに続く店主行動との対応を識別する。 With reference to FIG. 18, the behavior pair identification unit 4302 identifies the correspondence between the customer behavior and the subsequent storekeeper behavior by examining the time series of the behavior detected and stored in the storage device 300.

図１８（ａ）は、このような行動イベントの時系列を示す。 FIG. 18A shows a time series of such action events.

ここで、Ｃ１，…，Ｃ３は、顧客の行動（発話または移動）を意味し、Ｓ１，…，Ｓ３は、店主の行動（発話または移動）を意味する。 Here, C1,..., C3 mean customer actions (utterance or movement), and S1,..., S3 mean shopkeeper actions (utterance or movement).

ここで、社会的相互関係は、例えば、２つの顧客行動あるいは２つの店主行動が連なって生じる場合など、必ずしも行動と応答のペアにきれいに分割されるとは限らない。 Here, the social interrelationship is not always neatly divided into action-response pairs, for example, when two customer actions or two shopkeeper actions occur in succession.

図１８（ｂ）は、図１８（ａ）の時系列の行動をペアに分類する手続きを示す。 FIG. 18B shows a procedure for classifying the time-series actions of FIG. 18A into pairs.

原則として、顧客行動Ｃ１に連続する店主行動Ｓ１は、１組の行動ペア組み合わせられる。すなわち、行動Ｃ１と行動Ｓ１は、店主行動が後続する顧客行動という通常の場合を示しており、これらは予測器のためのトレーニング入力および出力としてペアになる。 In principle, the store owner behavior S1 that is continuous with the customer behavior C1 is combined with one set of behavior pairs. That is, action C1 and action S1 show the normal case of customer action followed by shopkeeper action, which are paired as training input and output for the predictor.

一方で、店主行為が続かない顧客行動Ｃ２は、予測器を訓練する目的のために、「店主行動なし」との要素に関連づけられたペアとなる。 On the other hand, the customer behavior C2 in which the storekeeper action does not continue is a pair associated with the element “no storekeeper behavior” for the purpose of training the predictor.

３番目の顧客行動Ｃ３には、２つの店主行動が続くが、それらは単一の店主行為とするためにマージされる。 The third customer action C3 is followed by two shopkeeper actions, which are merged into a single shopkeeper action.

対顧客行動の結合状態ベクトル生成部４３０２は、顧客行動が検知されたときに、顧客と店主の両方のインタラクションの主体の状態を、結合状態ベクトル３４０として記憶装置３００に格納する。 When the customer behavior is detected, the combined state vector generation unit 4302 of the customer behavior stores the state of the main subject of the interaction between the customer and the store owner as the combined state vector 340 in the storage device 300.

この結合状態ベクトルは、ロボットが行なうべき最も適切な行動を識別するように、予測器を訓練するために、予測器の入力として使用される。 This combined state vector is used as the predictor input to train the predictor to identify the most appropriate action the robot should take.

ロボット行動生成部４４０は、顧客行動が検知されたときの結合状態ベクトル３４０に対応する店主の行動である。 The robot behavior generation unit 440 is a storekeeper behavior corresponding to the combined state vector 340 when a customer behavior is detected.

ロボット行動の各々は、発話（たとえば、１６６の可能性）および目標インタラクション状態（たとえば、５つの可能性）で構成される。 Each robot action consists of an utterance (eg, 166 possibilities) and a target interaction state (eg, 5 possibilities).

店主行動をマージした後に、店主行動の各々をロボット行動ベクトルに翻訳する。 After merging storekeeper behavior, each storekeeper behavior is translated into a robot behavior vector.

図１に示したような店舗における顧客と店主とのインタラクションを例にとると、訓練データ・セットに対するロボット行動ベクトルの最終リストは、発話およびインタラクション状態の４６７個の異なる組合せを含む。 Taking the customer-storekeeper interaction in the store as shown in FIG. 1 as an example, the final list of robot behavior vectors for the training data set includes 467 different combinations of utterances and interaction states.

図１９は、結合状態ベクトルにおける特徴量およびロボット行動ベクトルにおける特徴量を示す概念図である。 FIG. 19 is a conceptual diagram illustrating the feature amount in the combined state vector and the feature amount in the robot action vector.

まず、図１９（ａ）に示すように、結合状態ベクトルは、顧客発話ベクトル、顧客の空間状態および店主の空間状態、インタラクション状態とを含んでいる。 First, as shown in FIG. 19A, the combined state vector includes a customer utterance vector, a customer space state, a store owner space state, and an interaction state.

また、図１９（ｂ）に示すように、顧客発話ベクトルは、たとえば、発話とキーワードの両方に対するＬＳＡベクトルを含んでおり、たとえば、説明した訓練条件では、合計３４６次元のベクトルとなっている。また、空間状態の各々は、現在位置、運動起点および運動目標位置を含んでいる。インタラクション状態は、空間配置および状態ターゲットを含んでいる。 Further, as shown in FIG. 19B, the customer utterance vector includes, for example, LSA vectors for both the utterance and the keyword. Each of the spatial states includes a current position, a movement start point, and a movement target position. The interaction state includes a spatial arrangement and a state target.

より詳しくは、店主行為が検知される場合、それはロボット行動ベクトルとして表現される。そして、ロボット行動ベクトルは、後でロボットのためのコマンドに翻訳される。 More specifically, when a storekeeper action is detected, it is expressed as a robot action vector. The robot action vector is then translated into a command for the robot.

上述したような具体例では、ロボットに、発話および移動を再現することに関心がある。したがって、ロボット行動ベクトルは、以下の２つの特性を含んでいる。 In the specific examples as described above, the robot is interested in reproducing speech and movement. Therefore, the robot action vector includes the following two characteristics.

ｉ）(発話クラスタＩＤから成る)発話
ｉｉ）(空間配置および状態ターゲットから成る)インタラクション状態
ここで、「ロボット発話」のフィールドは、ロボットが店主発話を再現することを可能にするための情報を含んでいる。 i) Utterance (consisting of utterance cluster ID) ii) Interaction state (consisting of spatial layout and state target) where the “Robot Utterance” field contains information to allow the robot to reproduce the shopkeeper's utterance Contains.

このとき、店主行動が、発話コンポーネントを含んでいる場合、それは結合状態ベクトルに含められる。そうでなければ、それはブランクのままとなる。 At this time, if the store owner behavior includes an utterance component, it is included in the combined state vector. Otherwise it will remain blank.

「ロボット発話」については、店主の発話のキャプチャデータには、しばしば、それが音声認識エラーを含んでいるので、音声認識からの生のテキスト出力を直接使用することは、ロボット発話を生成するのに適切ではない。 For “robot utterances”, the store owner's utterance capture data often contains speech recognition errors, so using the raw text output directly from speech recognition generates robot utterances. Not suitable for.

この理由で、「ロボット発話」のフィールドには、検知された発話を含んでいる店主発話クラスタのＩＤを記録する。 For this reason, the ID of the storekeeper utterance cluster including the detected utterance is recorded in the field of “robot utterance”.

例えば、音声から認識された発話が「what does it has 28 different lenses」である場合、クラスタＩＤ２９２が、図８に例示したように、代表的な店主発話クラスタとして選択される。クラスタの典型的な発話は、各店主発話クラスタから抽出される。クラスタの典型的な発話は、認識された発話の典型的な例より少ないランダム誤りを含むことが期待される。クラスタＩＤからのロボット言語行動を生成するために、ロボットの音声合成装置に送られるべきテキストとして、この典型的な発話を使用する。 For example, when the utterance recognized from the voice is “what does it has 28 different lenses”, the cluster ID 292 is selected as a representative storekeeper utterance cluster as illustrated in FIG. A typical utterance of the cluster is extracted from each storekeeper utterance cluster. A typical utterance of a cluster is expected to contain fewer random errors than a typical example of a recognized utterance. This typical utterance is used as the text to be sent to the robot's speech synthesizer to generate robot language behavior from the cluster ID.

上記の例において、選ばれたロボット発話は、「このカメラに利用可能な２８個の異なる交換レンズがあります。」ということになる。 In the above example, the chosen robot utterance would be "There are 28 different interchangeable lenses available for this camera."

また、「ロボット行動」のフィールドについては、店主のインタラクション状態は、与えられた時刻における、２つのインタラクションの主体の近接的な配置の情報（空間配置）が含まれる。また、「ロボット行動」のフィールドに、店主の「状態ターゲット」を記録することにより、ロボット運動を生成するためにこの情報を使用することができる。 In the “robot action” field, the store owner's interaction state includes information on the close arrangement (spatial arrangement) of the subjects of the two interactions at a given time. Also, this information can be used to generate robot motion by recording the store owner's “state target” in the “robot behavior” field.

ここで、顧客の行動が検知される時に店主が移動していなければ、店主の現在のインタラクション状態が空間配置として記録される。 Here, if the store owner is not moving when the behavior of the customer is detected, the current interaction state of the store owner is recorded as a spatial arrangement.

一方、店主が移動していれば、時間的に予め予測をして、店主の目標位置を決定する。その後、店主が目標位置に着く時のインタラクションの主体の空間配置の評価により、「状態ターゲット」を決定する。 On the other hand, if the store owner is moving, the target position of the store owner is determined by predicting in advance in time. After that, the “state target” is determined by evaluating the spatial arrangement of the subject of the interaction when the store owner reaches the target position.

店主が先行して顧客を案内し、先に停止位置に到着する場合について調整を実行する以外は、インタラクション状態は、上述したのと同じ方法で識別される。顧客の現在位置あるいは顧客の運動目標位置のいずれかが、店主の現在位置と同じ対象である場合、目標状態を「製品の提示」として分類する。 The interaction state is identified in the same manner as described above, except that the store owner guides the customer ahead of time and makes adjustments for the case where it first arrives at the stop position. When either the customer's current position or the customer's exercise target position is the same object as the shop owner's current position, the target state is classified as “product presentation”.

図２０は、予測器訓練部４５０の動作を説明するための機能ブロック図である。 FIG. 20 is a functional block diagram for explaining the operation of the predictor training unit 450.

図２０を参照して、予測器訓練部４５０の分類器機械学習部４５０２は、記憶装置３００内に格納された結合状態ベクトルデータ３４０を入力とし、ロボット行動データ３４２を出力とするような分類器を機械学習により生成する。分類器機械学習部４５０２は、生成した分類器を特定するための情報を、予測器特定情報３５０として、記憶装置３００に格納する。 Referring to FIG. 20, the classifier machine learning unit 4502 of the predictor training unit 450 receives as input the coupled state vector data 340 stored in the storage device 300 and outputs the robot behavior data 342. Is generated by machine learning. The classifier machine learning unit 4502 stores information for specifying the generated classifier in the storage device 300 as the predictor specifying information 350.

すなわち、予測器とは、結合状態ベクトルで表されるような状態を分類することで、特定の結合状態ベクトルの状態にあるときに、人である店主がとる可能性が最も高いと考えられる行動を、ロボット行動データとして予測するものである。 In other words, a predictor is an action that is most likely to be taken by a store owner who is a person when a state is in a specific combined state vector by classifying the state represented by the combined state vector. Is predicted as robot action data.

より具体的には、分類器機械学習部４５０２は、一旦訓練データ中の行動ペアがすべて特定され結合状態ベクトルが生成された後に、各顧客行動を分類器の訓練入力とし、それに続く店主行動に対応するロボット行動ベクトルを訓練出力とするために、結合状態ベクトルを入力として使用して、ロボット行動ベクトルを出力とするような分類器を生成するために、たとえば、ナイーブベイズ分類器の訓練を実行する。 More specifically, the classifier machine learning unit 4502 once identifies all the action pairs in the training data and generates the combined state vector, and then uses each customer action as a training input for the classifier, Perform training of a naive Bayes classifier, for example, to generate a classifier that uses the combined state vector as input and outputs the robot behavior vector as the corresponding robot action vector as training output To do.

ただし、分類器としては、機械学習により訓練するものであれば、他のものであってもよい。 However, other classifiers may be used as long as they are trained by machine learning.

ナイーブベイズ分類器は、１セットの特徴値ペアから成る事例を分類するために以下の式を使用する。 The Naive Bayes classifier uses the following formula to classify cases consisting of a set of feature value pairs.

ここで、ａ_jは、ロボット行動ベクトルを示し、ｆ_iは、結合状態ベクトルにおける特徴量を示す。また、Ｃは、ロボット行動のすべての可能な場合（クラス）を意味する。 Here, a _j represents a robot action vector, and f _i represents a feature amount in the combined state vector. C means all possible cases (classes) of robot behavior.

ナイーブベイズ分類器は、各特徴量ｆ_iに対して特徴値νiが与えられたときのロボット行動と分類される確率が最も大きなロボット行動ａ_NBをとりだす。 The naive Bayes classifier takes out a robot action a _NB having the highest probability of being classified as a robot action when a feature value ν _i is given to each feature quantity f _i .

結合状態ベクトル中で、各特徴量ｆ_iに対応する特徴値ｖ_iは、１組の要素ｔ_ikから成る多次元の量である。すなわち、項ｔ_ikは、ｉ番目の特徴量のｋ番目の要素である。 In a bound vector in the feature value v _i corresponding to each feature quantity f _i is the amount of multidimensional consisting of a set of elements t _ik. That is, the term t _ik is the k th element of the i th feature quantity.

例えば、顧客発話ベクトルは３４６次元を持っており、一方で、顧客の空間状態は、２１次元である。 For example, the customer utterance vector has 346 dimensions, while the customer's spatial state is 21 dimensions.

したがって、以下の式（２）のように、各特徴に対する値の間の部分的一致を考慮するように分類器方程式を書き直すことができる。ここで、各特徴の各項の条件付き確率は、ロボット行動ａ_jが与えられたときに、訓練中において計算されるものである。 Thus, the classifier equation can be rewritten to take into account partial matches between the values for each feature, as in Equation (2) below. Here, the conditional probability of each term of each feature is calculated during training when a robot action a _j is given.

ここで、「ｔ_ik apperas in ｆ_i」とは、特徴量ｆ_iに対応する特徴値νi中のｋ番目の要素が、ｔ_ikであることを意味する。 Here, “t _ik apperas in f _i ” means that the k-th element in the feature value ν _i corresponding to the feature amount f _i is t _ik .

そして、ロボット行動の分類において、最も特徴的な特徴量の値に、より高い優先順位を与えるために、結合状態ベクトル中の与えられた特徴がどれくらい重要かを表す利得比を考慮する。ここで、利得比は、訓練データ中における特定の特徴量とロボット行動との間の関連性の他の相関値に対する相対的な大きさを表す量として、たとえば、訓練データに基づいて、その大きさを決定するものとし、たとえば、両者の相関値を規格化して大きさを設定することなどが可能である。 In order to give higher priority to the most characteristic feature value in the robot action classification, a gain ratio representing how important the given feature in the combined state vector is considered. Here, the gain ratio is a quantity representing a relative magnitude of the correlation between the specific feature quantity and the robot behavior in the training data with respect to other correlation values, for example, based on the training data. For example, it is possible to set the size by standardizing the correlation value between the two.

式（２）において、重みｗiは、各特徴量の利得比から計算された分類器用の重み付け係数である。 In equation (2), the weight w i is a weighting coefficient for the classifier calculated from the gain ratio of each feature quantity.

図２１は、結合状態ベクトルにおける特徴量がとり得る特徴値の一例を示す図である。 FIG. 21 is a diagram illustrating an example of feature values that can be taken by the feature amount in the combined state vector.

図２１（ａ）に示すように、所定の空間領域内に６つの停止位置が特定されているとする。 As shown in FIG. 21A, it is assumed that six stop positions are specified in a predetermined space area.

図２１（ｂ）は、図２１（ａ）の場合における各特徴量ｆ_i（顧客空間状態、店主空間状態、インタラクション状態、顧客発話）を構成する特徴値ν_iと、特徴値ν_iに対応する要素ｔ_ikの次元との関係を示す図である。 FIG. 21B corresponds to the feature value ν _i and the feature value ν _i constituting each feature quantity f _i (customer space state, store owner space state, interaction state, customer utterance) in the case of FIG. 21A. It is a figure which shows the relationship with the dimension of element _tik to perform.

この場合において、顧客のとり得る停止位置は、６つの全ての位置（黒丸）であり、これにこれらいずれの位置でもない状態を考慮して、全部で７つの状態をとり得るものとする。したがって、顧客の空間状態の特徴量は、現在位置ν₁、運動起点ν_２、運動目標位置ν_３の３つから成り、それぞれは、７つの位置をとり得る。 In this case, the stop positions that the customer can take are all six positions (black circles), and in consideration of a state that is not any of these positions, a total of seven states can be taken. Accordingly, the feature quantity of the customer's spatial state is composed of three positions, the current position ν ₁ , the movement starting point ν ₂ , and the movement target position ν ₃ , and each can take seven positions.

一方、店主のとり得る停止位置は、５つの位置（白丸）であり、これにこれらいずれの位置でもない状態を考慮して、全部で６つの状態をとり得るものとする。したがって、顧客の空間状態の特徴量は、現在位置ν_４、運動起点ν_５、運動目標位置ν_６の３つから成り、それぞれは、６つの位置をとり得る。 On the other hand, there are five positions (white circles) that can be taken by the store owner, and in consideration of a state that is not any of these positions, a total of six states can be taken. Therefore, the feature amount of the customer's spatial state is composed of three positions, the current position ν ₄ , the movement starting point ν ₅ , and the movement target position ν ₆ , and each can take six positions.

さらに、インタラクション状態は、特徴量として「状態ν_７」と「状態ターゲットν_８」とを含む。「状態ν_７」は、「対面状態」「製品の提示状態」「待機状態」「いずれでもない」の４つの状態のいずれかをとり得る。また、「状態ターゲットν_８」は、「ブランドＡ」、「ブランドＢ」、「ブランドＣ」、「いずれでもない」の４つの状態のいずれかをとり得る。 Furthermore, the interaction state includes “state ν ₇ ” and “state target ν ₈ ” as feature amounts. The “state ν ₇ ” can take one of four states: “face-to-face state”, “product presentation state”, “standby state”, and “none”. Further, the “state target ν ₈ ” can take any one of the four states “brand A”, “brand B”, “brand C”, and “none”.

顧客発話の特徴量は、「発話ＩＤ ν_９」と「キーワードν_１０」とを含む。発話ＩＤについては、顧客発話をクラスタ化した際に決定された発話ＩＤの個数Ｎｕの次元を有し、キーワードについては、顧客発話をクラスタ化した際に抽出された個数Ｎｋの次元を有する。 The feature amount of the customer utterance includes “utterance ID ν ₉ ” and “keyword ν ₁₀ ”. The utterance ID has a dimension of the number Nu of utterance IDs determined when the customer utterances are clustered, and the keyword has a dimension of the number Nk extracted when the customer utterances are clustered.

特に限定されないが、たとえば、結合状態ベクトルは、これらの特徴値νiを、とり得る次元について並べ、該当する要素が“１”で、それ以外の要素が“０”であるベクトルとして表現することができる。
（ロボット行動の生成処理）
図２２は、オンライン処理部５００の動作を説明するための機能ブロック図である。 Although not particularly limited, for example, the combined state vector can be expressed as a vector in which these feature values ν i are arranged in a possible dimension and the corresponding element is “1” and the other elements are “0”. it can.
(Robot action generation process)
FIG. 22 is a functional block diagram for explaining the operation of the online processing unit 500.

すなわち、人の顧客と人の店主との間のインタラクションの観測（「学習データ収集過程」）による訓練データに基づいて訓練処理部４００により予測器が生成（「ロジック学習過程」）された後に、オンライン処理過程において、オンライン処理部５００は、人（この場合は、顧客）に対するロボットのインタラクションを制御する処理を実行する。 That is, after a predictor is generated (“logic learning process”) by the training processing unit 400 based on the training data obtained by observing the interaction between the human customer and the shop owner (“learning data collection process”), In the online processing process, the online processing unit 500 executes a process for controlling the robot's interaction with a person (in this case, a customer).

人間の顧客とロボット店主の間のライブのインタラクション中においても、たとえば、センサネットワーク３０．１〜３０．ｎ，３２．１〜３２．ｍ，３４．１〜３４．ｐは、１秒間隔で顧客の運動および発話を記録する。また、ロボット自身に対しても位置計測センサと、発話計測センサが設けられる。 Even during a live interaction between a human customer and a robot store owner, for example, sensor networks 30.1-30. n, 32.1-32. m, 34.1-34. p records customer movements and utterances at 1 second intervals. A position measurement sensor and an utterance measurement sensor are also provided for the robot itself.

動作要素抽出部５１０は、センサネットワークからのデータに基づき、顧客の動作解析を行う動作解析部５１０２と、顧客とロボットとの空間内の配置を検出する空間配置検知部５１０４と、顧客の発話を認識する音声認識部５１０６とを含む。動作解析部５１０２，空間配置検知部５１０４および音声認識部５１０６との識別結果により、顧客行動が検知された場合、顧客行動の結合状態ベクトル生成部５１０８は、現在の顧客とロボットとの状態に基づき、結合状態ベクトルを生成する。生成された結合状態ベクトルは、訓練されたナイーブベイズ予測器５２０に入力される。ここで、予測器５２０は、記憶装置３００中の予測器情報のパラーメタに基づいて特定されるものであり、予測処理を実行する。 Based on the data from the sensor network, the motion element extraction unit 510 performs an operation analysis unit 5102 for analyzing the customer's motion, a spatial location detection unit 5104 for detecting the location of the customer and the robot in the space, and the customer's speech. And a voice recognition unit 5106 for recognition. When customer behavior is detected based on the identification results of the motion analysis unit 5102, the spatial arrangement detection unit 5104, and the voice recognition unit 5106, the combined state vector generation unit 5108 for customer behavior is based on the current state of the customer and the robot. Generate a coupled state vector. The generated combined state vector is input to a trained naive Bayes predictor 520. Here, the predictor 520 is specified based on the parameters of the predictor information in the storage device 300, and executes a prediction process.

予測器５２０は、訓練データ中から分類された所定の個数、たとえば、４６７個のロボット行動のうちの１つのＩＤを出力するか、あるいは、それは、「行動しない」という予測を返す。 The predictor 520 outputs an ID of one of a predetermined number, for example, 467 robot actions classified from the training data, or it returns a prediction of “no action”.

ロボット行動生成部５３０は、ロボット行動を生成するために、ロボットの現在位置と目標インタラクション状態を達成するのに必要な場所とを比較する。そして、必要ならば移動コマンドをロボット１０００のロボット行動実行モジュール１００２に対して発行する。 The robot action generation unit 530 compares the current position of the robot with a place necessary to achieve the target interaction state in order to generate the robot action. If necessary, a movement command is issued to the robot action execution module 1002 of the robot 1000.

「待機状態」については、店主は、待機状態では、サービスカウンターにいることになるので、状態ターゲットはサービスカウンターになる。製品の提示状態については、状態ターゲットは、顧客の興味のある対象物の位置になり、対面状態に対しては、状態ターゲットは、固定された位置ではなく、むしろ、顧客の目の前の位置ということになる。 Regarding the “standby state”, the store owner is at the service counter in the standby state, so the state target is the service counter. For product presentation status, the status target is the location of the object of interest to the customer, and for face-to-face status, the status target is not a fixed location, but rather a location in front of the customer's eyes. It turns out that.

ロボット行動生成部５３０は、ロボット１０００が、状態ターゲットの位置にまだいない場合、その位置の近くの地点に移動することをロボットに命じる。 If the robot 1000 is not yet at the position of the state target, the robot action generation unit 530 instructs the robot to move to a point near that position.

ロボット行動が指定される場合、ロボット行動生成部５３０は、その行動に対応する遅延時間テーブル中で指定された時間だけ待ってから、ロボット１０００に対して、目標地点まで移動せよ、あるいは、発話をせよ、というコマンドを送出する。 When the robot action is specified, the robot action generation unit 530 waits for the time specified in the delay time table corresponding to the action and then moves the robot 1000 to the target point or speaks. Send the command

ロボット行動が「製品の提示状態」または「対面状態」のインタラクション状態を含んでいる場合、正確な目標位置が、空間配置の近接的モデルによって計算される。 If the robot behavior includes an interaction state of “product presentation state” or “face-to-face state”, the exact target position is calculated by a proximity model of spatial arrangement.

運動状態である間は、ロボットは、目標位置に到着するまでは、１秒ごとに、顧客の将来位置を予測し、近接モデルによって目標位置を再計算する。 While in motion, the robot predicts the customer's future position every second until it reaches the target position and recalculates the target position with the proximity model.

（遅れのモデル化）
まず、上述したロボット行動生成部５３０の処理のうち、遅れ（遅延時間）のモデル化について説明する。 (Delay modeling)
First, modeling of delay (delay time) in the processing of the robot action generation unit 530 described above will be described.

顧客行動と店主応答の間には自然な遅延時間がある。たとえば、ロボットがあまりに速くあるいはあまりにゆっくり答える場合、それは、相手の人間に不自然な印象を与える。 There is a natural delay between customer behavior and storekeeper response. For example, if a robot answers too fast or too slowly, it gives the other person an unnatural impression.

顧客行動と店主からのレスポンスの間の遅延時間を再現するために、各ロボット行動に対応する訓練データからの顧客と店主の行動間の平均の時間遅れを計算し、ロボット行動と平均遅延時間とをマッピングするルックアップテーブルを構築しておく。 In order to reproduce the delay time between customer behavior and store owner response, calculate the average time delay between customer behavior and store owner behavior from the training data corresponding to each robot behavior. Build a lookup table that maps

たとえば、質問に直接答えるというような、ロボット行動については、遅延時間は通常０〜２．５秒の範囲である。 For example, for robot behavior such as answering questions directly, the delay time is typically in the range of 0 to 2.5 seconds.

また、いくつかの行動については、より長い休止が観察される。例えば、顧客が何も言わない間に、直接、ブランドＡのカメラの領域に入り移動した時は、ロボット行動生成部５３０は、予測器５２０の予測結果に基づいて、ロボット１０００が１７秒の遅れの後に、接近して、顧客に「何かお探しですか」などというような発話による支援を提供することを指示する。 Also, longer pauses are observed for some behaviors. For example, when the customer directly enters the area of the brand A camera while moving without saying anything, the robot action generation unit 530 causes the robot 1000 to delay by 17 seconds based on the prediction result of the predictor 520. After that, it tells the customer to provide utterance support such as “Looking for something?”

顧客がこの間に別の行動を行なったならば、ロボットはその行動に応答する。したがって、このように、ロボット１０００は、例えば「ウィンドウショッピング」シナリオなどにおいて、生じる長い休止に応答することができる。 If the customer takes another action during this time, the robot responds to that action. Thus, in this way, the robot 1000 can respond to long pauses that occur, such as in a “window shopping” scenario.

（空間配置についての行動生成）
以下では、「インタラクション状態」についてさらに詳しく説明する。 (Behavior generation for spatial layout)
Hereinafter, the “interaction state” will be described in more detail.

図２３は、人・人間のインタラクション状態の例を示す概念図である。 FIG. 23 is a conceptual diagram illustrating an example of a human / human interaction state.

一般には、図２３（ａ）に示すような「オブジェクトの提示状態（本実施の形態では製品の提示）」、図２３（ｂ）に示すような「対面状態」、図２３（ｃ）に示すような「横に並んだ歩行状態」、図２３（ｄ）に示すような「待機状態」などが、インタラクション状態として想定される。 In general, the “presentation state of an object (product presentation in the present embodiment)” as shown in FIG. 23A, the “facing state” as shown in FIG. 23B, and the state shown in FIG. Such “walking state side by side”, “standby state” as shown in FIG. 23D, and the like are assumed as the interaction state.

ただし、本実施の形態では、簡単のために、移動中のインタラクションについては考慮から外すことする。 However, in this embodiment, for the sake of simplicity, the moving interaction is excluded from consideration.

図２４は、「提示状態」について、結合状態ベクトルで表現される位置関係と対応するロボットへの行動生成との関係を示す概念図である。 FIG. 24 is a conceptual diagram showing the relationship between the positional relationship expressed by the combined state vector and the action generation to the corresponding robot for the “presentation state”.

「提示状態」について、センサネットワークから取得され、結合状態ベクトルで表現される位置関係のモデルとしては、たとえば、図２４（ａ）のように、以下のような位置関係であるとする。 As for the “presentation state”, the positional relationship model acquired from the sensor network and expressed by the coupling state vector is assumed to be the following positional relationship as shown in FIG.

ｉ）ロボット・人間の間隔Dist(RH)〜１．２ｍ
ｉｉ）ロボット・製品間の間隔Dist(RO)〜１．１ｍ
ｉｉｉ）人の移動速度Speed(H)〜０
ｉｖ）ロボットの移動速度Speed(R)〜０
ｖ）製品−ロボット−人の角度Angle(ORH)＜１５０°
ｖｉ）製品−人−ロボットの角度Angle(OHR)＜９０°
これに対して、図２４（ｂ）のように、ロボット行動生成部５３０は、対応するロボットへの行動を生成するモデルとして以下のものを採用する。 i) Distance between robot and human Dist (RH) ~ 1.2m
ii) Distance between robot and product Dist (RO) ~ 1.1m
iii) Human moving speed Speed (H) ~ 0
iv) Robot movement speed Speed (R) ~ 0
v) Product-Robot-Human Angle Angle (ORH) <150 °
vi) Product-Human-Robot Angle Angle (OHR) <90 °
On the other hand, as shown in FIG. 24B, the robot behavior generation unit 530 employs the following as a model for generating behavior to the corresponding robot.

すなわち、まず、ロボットの目標ターゲットを、以下のような条件を満たす位置Ｒ１またはＲ２とする。なお、条件を満たす位置が複数箇所ある場合は、現在位置から最も近い位置を選択するか、あるいは、ランダムにいずれかを選択する構成とすることができる。 That is, first, the target target of the robot is set to a position R1 or R2 that satisfies the following conditions. If there are a plurality of positions that satisfy the condition, a position closest to the current position may be selected, or any one may be selected at random.

ｉ）ロボット・人間の間隔Dist(RH)〜１．２ｍ
ｉｉ）ロボット・製品間の間隔Dist(RO)〜１．１ｍ
ｉｉｉ）ロボットの移動速度Speed(R)〜０
ｉｖ）（ロボットの向き）Angle(R)＝１／２×（製品−ロボット−人の角度）Angle(ORH)となる向き
すなわち、図２４（ａ）のように、センサネットワークからの２人の人間の位置関係および姿勢についての入力に対して、人・人間のインタラクション状態を識別するためのモデルを、「認識モデル」と呼ぶ。したがって、システムは、センサネットワークからのセンシング結果を入力として、空間配置検出部４２０４は、人・人間のインタラクション状態（近接配置）がいずれのパターンであるかを判断する。 i) Distance between robot and human Dist (RH) ~ 1.2m
ii) Distance between robot and product Dist (RO) ~ 1.1m
iii) Robot movement speed Speed (R) ~ 0
iv) (Robot direction) Angle (R) = 1/2 × (Product−Robot−Person angle) Angle (ORH) In other words, as shown in FIG. A model for identifying a human / human interaction state in response to an input about a human positional relationship and posture is called a “recognition model”. Therefore, the system receives the sensing result from the sensor network as input, and the spatial arrangement detection unit 4204 determines which pattern is the human / human interaction state (proximity arrangement).

また、図２４（ｂ）のように、認識モデルに基づいて検出されたインタラクション状態に対して、ターゲットとなる空間配置（近接配置）となるように、ロボット行動生成部５３０がロボット行動を生成するためのモデルを「生成モデル」と呼ぶ。 Also, as shown in FIG. 24B, the robot behavior generation unit 530 generates robot behavior so that the target spatial arrangement (proximity arrangement) is obtained with respect to the interaction state detected based on the recognition model. The model for this is called a “generation model”.

図２５は、「認識モデル」と「生成モデル」とを対比して説明する概念図である。 FIG. 25 is a conceptual diagram illustrating the comparison between the “recognition model” and the “generation model”.

図２５（ａ）は、認識モデルにより、「提示状態」であると判断される人・人間の近接配置の例である。これに対して、図２５（ｂ）は、認識モデルにより、「提示状態」ではないと判断される人・人間の近接配置の例である。 FIG. 25A is an example of the proximity arrangement of people / humans determined to be in the “presentation state” based on the recognition model. On the other hand, FIG. 25B is an example of the proximity arrangement of a person / human that is determined not to be in the “presentation state” by the recognition model.

図２５（ａ）に示されるように、複数の少しずつ異なる配置も「提示状態」として検出されることになり、これは、つまり、人・人間の近接配置が、抽象化されていることを意味する。一方で、人の姿勢や人の配置が、「認識モデル」から外れている場合は、図２５（ｂ）のように、配置、または、姿勢のいずれか一方は、「認識モデル」の範囲に一致していても、「提示状態」とは検知されない。 As shown in FIG. 25A, a plurality of slightly different arrangements are also detected as “presentation states”, which means that the proximity arrangement of people / humans is abstracted. means. On the other hand, if the posture of the person or the placement of the person is out of the “recognition model”, either the placement or the posture is within the range of the “recognition model” as shown in FIG. Even if they match, the “presentation state” is not detected.

一方で、図２５（ｃ）に示すように、「生成モデル」は、まず、人間の位置（オブジェクト（製品）に対する相対位置）を特定して、この人間の位置に対して、ロボットが、目的とする配置を形成すための移動・運動のコマンドを生成するために使用される。目的とする配置によっては、ロボットがその配置を形成するために移動すべき位置については、複数の可能性がモデルの中に、予め含まれている場合もあり得る。 On the other hand, as shown in FIG. 25C, the “generation model” first specifies a human position (relative position with respect to an object (product)), and the robot moves the target to the target position. It is used to generate movement and movement commands to form an arrangement. Depending on the target arrangement, there may be a plurality of possibilities in advance in the model for the position where the robot should move to form the arrangement.

後述する他の空間配置（近接配置）についても、認識モデルと生成モデルが、それぞれ配置の検知とコマンドの生成にそれぞれ使用される。 For other spatial arrangements (proximity arrangements) to be described later, the recognition model and the generation model are respectively used for the arrangement detection and the command generation.

図２６は、「対面状態」について、結合状態ベクトルで表現される位置関係と対応するロボットへの行動生成との関係を示す概念図である。 FIG. 26 is a conceptual diagram illustrating the relationship between the positional relationship represented by the combined state vector and the action generation to the corresponding robot for the “face-to-face state”.

「対面状態」について、センサネットワークから取得され、結合状態ベクトルで表現される位置関係のモデルとしては、たとえば、図２６（ａ）のように、以下のような位置関係であるとする。 As for the “face-to-face state”, the positional relationship model acquired from the sensor network and expressed by the coupled state vector is assumed to be the following positional relationship as shown in FIG.

ｉ）ロボット・人間の間隔〜１．５ｍ
ｉｉ）人の移動速度〜０
ｉｉｉ）ロボットの移動速度〜０
ｉｖ）人の向き〜人・ロボットの向き
ｖ）ロボットの向き〜ロボット・人の向き
これに対して、図２６（ｂ）のように、ロボット行動生成部５３０は、対応するロボットへの行動を生成するモデルとして以下のものを採用する。 i) Distance between robot and human ~ 1.5m
ii) Human moving speed-0
iii) Movement speed of robot-0
iv) Direction of person-direction of person / robot v) Direction of robot-direction of robot / person On the other hand, as shown in FIG. 26 (b), the robot action generation unit 530 performs an action on the corresponding robot. The following is adopted as a model to be generated.

すなわち、まず、ロボットの目標ターゲットを、以下のような条件を満たす位置Ｒとする。 That is, first, the target target of the robot is set to a position R that satisfies the following conditions.

ｉ）人の向き〜人・ロボットの向き
ｉｉ）ロボット・人間の間隔〜１．５ｍ
ｉｉｉ）ロボットの移動速度〜０
ｉｖ）ロボットの向き〜ロボット・人の向き
図２７は、「待機状態」について、結合状態ベクトルで表現される位置関係と対応するロボットへの行動生成との関係を示す概念図である。 i) Direction of people-direction of people / robots ii) Distance between robots and humans-1.5m
iii) Movement speed of robot-0
iv) Direction of Robot to Robot / People Direction FIG. 27 is a conceptual diagram showing the relationship between the positional relationship represented by the combined state vector and action generation to the corresponding robot for the “standby state”.

「待機状態」について、センサネットワークから取得され、結合状態ベクトルで表現される位置関係のモデルとしては、たとえば、図２７（ａ）のように、以下のような位置関係であるとする。 As for the “standby state”, a positional relationship model acquired from the sensor network and expressed by a coupled state vector is assumed to be the following positional relationship as shown in FIG.

ｉ）ロボット・待機位置間の距離〜０ｍ
ｉｉ）ロボットの移動速度〜０
ｉｉｉ）ロボット・人間の間隔＞１．５ｍ
すなわち、ロボットは、待機位置に停止しており、人は、ロボットから離れた位置にいるという状態である。 i) Distance between robot and standby position-0m
ii) Robot moving speed-0
iii) Distance between robot and human> 1.5m
That is, the robot is stopped at the standby position, and the person is in a position away from the robot.

これに対して、図２６（ｂ）のように、ロボット行動生成部５３０は、対応するロボットへの行動を生成するモデルとして以下のものを採用する。 On the other hand, as shown in FIG. 26B, the robot behavior generation unit 530 employs the following as a model for generating behavior to the corresponding robot.

ｉ）ロボット・待機位置間の距離〜０ｍ
ｉｉ）ロボットの移動速度〜０
ｉｉｉ）ロボットの向き：所定の方向（たとえば）、グローバル座標で―９０°）
図２８は、実際に観測された人・人間の位置関係およびそれに対応する人・ロボットの位置関係の図である。 i) Distance between robot and standby position-0m
ii) Robot moving speed-0
iii) Robot orientation: Predetermined direction (eg, -90 ° in global coordinates)
FIG. 28 is a diagram of the actually observed positional relationship between a person and a human and the corresponding positional relationship between a person and a robot.

図２８（ａ）に示すように、「提示状態」では、店主（Ｓ）と顧客（Ｃ）とが、製品（Ｏ）の近傍にやや斜めに向かって位置する。これに対応して、顧客（Ｃ）とロボット（Ｒ）とが、製品（Ｏ）の近傍にやや斜めに向かって位置する。 As shown in FIG. 28A, in the “presentation state”, the store owner (S) and the customer (C) are positioned slightly obliquely in the vicinity of the product (O). Correspondingly, the customer (C) and the robot (R) are positioned slightly obliquely in the vicinity of the product (O).

図２８（ｂ）は、対面状態の店主（Ｓ）と顧客（Ｃ）との位置関係を示す。 FIG. 28B shows the positional relationship between the store owner (S) and the customer (C) in a face-to-face state.

図２８（ｃ）は、「待機状態」において、店主（Ｓ）は待機位置（サービスカウンタ）で停止しており、顧客（Ｃ）は、所定のブランドの製品の近傍に停止している。これに対応して、顧客（Ｃ）は、所定のブランドの製品の近傍に停止し、ロボット（Ｒ）は、待機位置に停止している。 In FIG. 28C, in the “standby state”, the store owner (S) stops at the standby position (service counter), and the customer (C) stops near the product of the predetermined brand. Correspondingly, the customer (C) stops near the product of the predetermined brand, and the robot (R) stops at the standby position.

以上説明したように、本実施の形態のシステム１０では、どのような人間の行動に応答して、どのようにロボットが行動を行なわなければならないかを決定するために、離散的になった行動データを調べて、訓練データ中において、顧客と店主の行動の連続する組である行動ペアを特定する。 As described above, in the system 10 according to the present embodiment, in order to determine what human action should be taken and how the robot should perform the action, the action that has become discrete The data is examined, and an action pair that is a continuous set of actions of the customer and the shopkeeper is specified in the training data.

そして、各行動ペアについては、顧客と店主の行動に対応する結合状態ベクトルおよびロボット行動ベクトルを使用して、機械学習を使用して、予測器を訓練する。 Then, for each action pair, the predictor is trained using machine learning using the coupled state vector and the robot action vector corresponding to the actions of the customer and the store owner.

最後に、検知された顧客行動に応じてロボット行動を生成するために、この予測器はオンラインとして使用される。 Finally, the predictor is used online to generate robot behavior in response to detected customer behavior.

このような構成により、特定の環境において、実際に観測された人・人間のインタラクション行動のデータに基づいて、同様の環境下で、人・ロボット間のインタラクションを行うためのロボットへの行動コマンドを生成することができる。 With this configuration, based on actually observed human / human interaction behavior data in a specific environment, action commands to the robot to perform human-robot interaction in the same environment Can be generated.

また、本実施の形態によれば、システムの設計者がシナリオを作成する必要がないため、ロボットの行動生成のための設計者の負荷を大幅に低減できる。 In addition, according to the present embodiment, since it is not necessary for the system designer to create a scenario, it is possible to greatly reduce the designer's load for generating robot behavior.

また、本実施の形態によれば、人間行動の自然な多様性が考慮される場合にも、ロバストなインタラクションのための行動コマンドを作成することが可能である。 Further, according to the present embodiment, it is possible to create a behavior command for robust interaction even when natural diversity of human behavior is considered.

今回開示された実施の形態は、本発明を具体的に実施するための構成の例示であって、本発明の技術的範囲を制限するものではない。本発明の技術的範囲は、実施の形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲の文言上の範囲および均等の意味の範囲内での変更が含まれることが意図される。 Embodiment disclosed this time is an illustration of the structure for implementing this invention concretely, Comprising: The technical scope of this invention is not restrict | limited. The technical scope of the present invention is shown not by the description of the embodiment but by the scope of the claims, and includes modifications within the wording and equivalent meanings of the scope of the claims. Is intended.

１０システム、３０．１〜３０．ｎ２Ｄレーザレンジファインダ、３２．１〜３２．ｍ３Ｄレンジファインダ、３４．１〜３４．ｐスマートフォン、２００データ収集モジュール、３００記憶装置、３４０結合状態ベクトルデータ、３４２ロボット行動データ、４００訓練処理部、４１０動作要素抽出部、４２０行動離散化部、４３０結合状態ベクトル生成部、４５０予測器訓練部、４５０２分類器機械学習部。 10 system, 30.1-30. n 2D laser range finder, 32.1 to 32. m 3D range finder, 34.1-34. p smartphone, 200 data collection module, 300 storage device, 340 coupled state vector data, 342 robot behavior data, 400 training processing unit, 410 motion element extraction unit, 420 behavior discretization unit, 430 coupled state vector generation unit, 450 predictor Training unit, 4502 Classifier machine learning unit.

Claims

第１の状況において装置が第１の参加者と行動によるコミュニケーションを可能とするための行動コマンド生成システムであって、第２の参加者および第３の参加者が行動によるコミュニケーションをとる第２の状況において取得されたデータに基づき、前記装置は、前記第１の状況において前記第２の参加者の代わりとして行動するものであり、
人の行動に関する時系列データを収集するための複数のセンサと、
前記第２の状況において、前記行動の時系列データをクラスタリングして、各クラスタごとに代表行動を決定する行動パターンクラスタ化手段と、
結合状態ベクトルと行動ベクトルとをそれぞれ関連付けるためのベクトル生成手段とを備え、前記結合状態ベクトルは、前記第２の状況において、前記クラスタリングの結果と前記行動の時系列データに基づき、前記第３の参加者の状態と前記第２の参加者の状態とから生成され、各前記行動ベクトルは、前記結合状態ベクトルに対応し前記第２の参加者の後続する代表行動を表し、
前記結合状態ベクトルを入力とし、前記行動ベクトルを出力とする予測器を生成するための予測器生成手段と、
前記第１の状況において、生成された前記予測器により予測された、前記第１の参加者の行動に応答する前記行動ベクトルに応じて、前記装置へのコマンドを生成するためのコマンド生成手段とを備える、行動コマンド生成システム。 A behavior command generation system for enabling a device to communicate with a first participant by behavior in a first situation, wherein a second participant and a third participant communicate by behavior second Based on data obtained in a situation, the device acts as a substitute for the second participant in the first situation,
Multiple sensors for collecting time-series data about human behavior;
In the second situation, action pattern clustering means for clustering the action time-series data and determining a representative action for each cluster;
Vector generation means for associating a combined state vector with an action vector, and the combined state vector is based on the result of the clustering and the time series data of the action in the second situation. Generated from the state of the participant and the state of the second participant, each of the action vectors corresponding to the combined state vector and representing a subsequent representative action of the second participant;
Predictor generating means for generating a predictor having the combined state vector as an input and the behavior vector as an output;
Command generation means for generating a command to the device according to the behavior vector responsive to the behavior of the first participant predicted by the generated predictor in the first situation; An action command generation system comprising:

前記代表行動は、代表発話と代表運動とを含む、請求項１記載の行動コマンド生成システム。 The behavior command generation system according to claim 1, wherein the representative behavior includes a representative utterance and a representative exercise.

前記行動パターンクラスタ化手段は、
観測された前記第２の参加者の発話を発話クラスタに分類する発話クラスタ化手段と、
前記クラスタ内で最も多くの他の発話と語彙上の類似度が最高レベルである発話を選ぶことで、前記発話クラスタごとに１つの代表発話を選択する典型発話抽出手段とを含む、請求項２記載の行動コマンド生成システム。 The behavior pattern clustering means includes:
Utterance clustering means for classifying the observed utterances of the second participant into utterance clusters;
3. A typical utterance extracting unit that selects one representative utterance for each utterance cluster by selecting an utterance having the highest lexical similarity with the most other utterances in the cluster. The described behavior command generation system.

前記ベクトル生成手段は、
前記第２および第３の参加者の前記行動の区切りを検出して、前記行動の時系列データを離散化するための離散化手段と、
前記区切られた前記第３の参加者の行動を検出したことに応じて、前記第３の参加者の状態と前記第２の参加者の状態とを結合状態ベクトルとして抽出する結合状態抽出手段と、
前記抽出された結合状態ベクトルに対応する前記第２の参加者の後続する代表行動を前記行動ベクトルとして抽出するための行動ベクトル抽出手段と、を含む、請求項２または３記載の行動コマンド生成システム。 The vector generation means includes
Discretization means for discriminating the time series data of the behavior by detecting a break of the behavior of the second and third participants;
A combined state extracting means for extracting the state of the third participant and the state of the second participant as a combined state vector in response to detecting the behavior of the divided third participant; ,
The action command generation system according to claim 2, further comprising action vector extraction means for extracting a representative action following the second participant corresponding to the extracted combined state vector as the action vector. .

前記結合状態ベクトルにおける前記第２または第３の参加者の状態は、
前記第２の参加者の空間状態と、
前記第３の参加者の空間状態と、
２人の人間間についての所定の共通の近接配置のうちの１つを含む、請求項４記載の行動コマンド生成システム。 The state of the second or third participant in the combined state vector is
The spatial state of the second participant;
The spatial state of the third participant;
5. The behavior command generation system according to claim 4, comprising one of a predetermined common proximity arrangement for two persons.

前記行動パターンクラスタ化手段は、
前記第２または第３の参加者の観測された軌道を、停止セグメントと移動セグメントにセグメント化する軌道セグメント化手段と、
前記停止セグメントを停止クラスタにクラスタ化する空間クラスタ化手段と、
対応する停止クラスタを各々代表する停止位置を特定する停止位置抽出手段とを含む、請求項２記載の行動コマンド生成システム。 The behavior pattern clustering means includes:
Trajectory segmenting means for segmenting the observed trajectory of the second or third participant into a stop segment and a moving segment;
Spatial clustering means for clustering the stop segments into stop clusters;
The action command generation system according to claim 2, further comprising stop position extraction means for specifying a stop position that represents each corresponding stop cluster.

前記行動パターンクラスタ化手段は、
前記移動セグメントを移動クラスタにクラスタ化する軌道クラスタ化手段と、
対応する移動クラスタを各々代表する軌道を特定する典型軌道抽出手段とを含む、請求項２記載の行動コマンド生成システム。 The behavior pattern clustering means includes:
Trajectory clustering means for clustering the moving segments into moving clusters;
The behavior command generation system according to claim 2, further comprising: a typical trajectory extracting unit that identifies trajectories each representing a corresponding moving cluster.

前記行動ベクトルは、前記第２の参加者の認識された発話を含む発話クラスタを特定するための情報を含む、請求項３記載の行動コマンド生成システム。 4. The action command generation system according to claim 3, wherein the action vector includes information for specifying an utterance cluster including an utterance recognized by the second participant.

前記行動ベクトルは、２人の人間間についての所定の共通の近接配置を含み、
前記コマンド生成手段は、前記共通の近接配置にそれぞれ対応する生成モデルに基づいて、前記コマンドを生成する、請求項８記載の行動コマンド生成システム。 The action vector includes a predetermined common proximity arrangement between two people,
The behavior command generation system according to claim 8, wherein the command generation unit generates the command based on a generation model corresponding to each of the common proximity arrangements.

第１の参加者と行動によるコミュニケーションを可能とするための応答システムであって、
第１の状況において、複数のセンサにより収集された前記第１の参加者の行動に関する時系列データに基づき、人に類似の行動を前記第１の参加者に提示するための装置を備え、前記装置は、第２の参加者（店主）および第３の参加者（顧客）が行動によるコミュニケーションをとる第２の状況において取得されたデータに基づき、前記第１の状況において前記第２の参加者の代わりとして行動するものであり、
前記装置は、
前記第２の状況において前記取得されたデータに基づき生成された結合状態ベクトルと前記第２の参加者の代表行動に対応する行動ベクトルとを関連付けて格納するための記憶装置と、
前記結合状態ベクトルを入力とし、前記行動ベクトルを出力とする予測器と、
前記第１の状況において、生成された前記予測器により予測された、前記第１の参加者の行動に応答する前記行動ベクトルに応じて、前記装置の行動コマンドを生成するためのコマンド生成手段とを含み、
前記代表行動は、前記第２の状況において、前記時系列データをクラスタリングして、各クラスタごとに離散化された単位行動として決定されたものであり、
前記結合状態ベクトルは、前記第２および第３の参加者の前記行動の区切りを検出し前記行動の時系列データを離散化して、区切られた前記第３の参加者の行動を検索キーとして、前記第３の参加者の状態と前記第２の参加者の状態との結合として決定されたものである、応答システム。 A response system for enabling behavioral communication with a first participant,
In a first situation, the apparatus comprises a device for presenting behavior similar to a person to the first participant based on time-series data regarding the behavior of the first participant collected by a plurality of sensors, The device uses the second participant in the first situation based on data acquired in the second situation in which the second participant (storekeeper) and the third participant (customer) communicate by action. Act as a substitute for
The device is
A storage device for associating and storing a combined state vector generated based on the acquired data in the second situation and an action vector corresponding to the representative action of the second participant;
A predictor having the combined state vector as an input and the action vector as an output;
Command generating means for generating a behavior command of the device according to the behavior vector responsive to the behavior of the first participant predicted by the generated predictor in the first situation; Including
The representative action is determined as a unit action that is discretized for each cluster by clustering the time-series data in the second situation.
The combined state vector detects a break of the behavior of the second and third participants, discretizes the time series data of the behavior, and uses the separated behavior of the third participant as a search key. A response system that is determined as a combination of the state of the third participant and the state of the second participant.

第１の状況において装置が第１の参加者と行動によるコミュニケーションを可能とするための行動コマンド生成方法であって、
第２の参加者および第３の参加者が行動によるコミュニケーションをとる第２の状況において、人の行動に関する時系列データを収集するステップと、
前記第２の状況において、前記行動の時系列データをクラスタリングして、各クラスタごとに代表行動を決定するステップと、
結合状態ベクトルと行動ベクトルとをそれぞれ関連付けるステップとを備え、前記結合状態ベクトルは、前記第２の状況において、前記クラスタリングの結果と前記行動の時系列データに基づき、前記第３の参加者の状態と前記第２の参加者の状態とから生成され、各前記行動ベクトルは、前記結合状態ベクトルに対応し前記第２の参加者の後続する代表行動を表し、
前記結合状態ベクトルを入力とし、前記行動ベクトルを出力とする予測器を生成するステップと、
前記第１の状況において、生成された前記予測器により予測された、前記第１の参加者の行動に応答する前記行動ベクトルに応じて、前記装置が、前記第１の状況において前記第２の参加者の代わりとして行動するように、前記装置へのコマンドを生成するステップとを備える、行動コマンド生成方法。 A behavior command generation method for enabling an apparatus to communicate with a first participant by behavior in a first situation,
Collecting time-series data relating to human behavior in a second situation in which the second and third participants communicate by behavior;
In the second situation, clustering the action time-series data and determining a representative action for each cluster;
Associating a combined state vector and an action vector with each other, wherein the combined state vector is based on the result of the clustering and the time series data of the action in the second situation. And the state of the second participant, and each of the action vectors represents a subsequent representative action of the second participant corresponding to the combined state vector,
Generating a predictor having the combined state vector as an input and the behavior vector as an output;
In response to the behavior vector responsive to the behavior of the first participant predicted by the generated predictor in the first situation, the device is configured to use the second situation in the first situation. Generating a command to the device so as to act on behalf of the participant.

前記代表行動は、代表発話と代表運動とを含む、請求項１１記載の行動コマンド生成方法。 The action command generation method according to claim 11, wherein the representative action includes a representative utterance and a representative exercise.