JP6719745B2

JP6719745B2 - Model learning device, speech word estimation device, model learning method, speech word estimation method, program

Info

Publication number: JP6719745B2
Application number: JP2017058796A
Authority: JP
Inventors: 大塚　和弘; 和弘大塚; 将吾岡田
Original assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2020-07-08
Anticipated expiration: 2037-03-24
Also published as: JP2018163400A

Description

本発明は、モデル学習装置、発話単語推定装置、モデル学習方法、発話単語推定方法、プログラムに関する。 The present invention relates to a model learning device, a spoken word estimation device, a model learning method, a spoken word estimation method, and a program.

コミュニケーションに用いられる自然なジェスチャの認識・生成技術はインターフェース、会話ロボット、エージェントなどの実現に必要不可欠である。また、表出された非言語情報と、プレゼンテ―ションの能力を含むコミュニケーションへの効果の関連性をモデル化する研究も行われている。 The natural gesture recognition/generation technology used for communication is indispensable for realizing interfaces, conversation robots, agents, etc. Also, research is being conducted to model the relationship between the expressed non-verbal information and the effects on communication, including the ability of presentation.

しかし非言語情報、とりわけハンドジェスチャの認識・生成モデルの構築は以下の二つの理由で容易ではない。第一に、ハンドジェスチャは、発話内容だけでなく、発話者の態度や、談話調整といった様々なコンテキストに関連して生成されている。第二に、会話中に観測される手の動かし方、手を動かす頻度、ジェスチャを行うタイミングには個人差があり、汎用的なモデルを構築することが難しい。 However, it is not easy to construct a recognition/generation model of non-verbal information, especially hand gesture, for the following two reasons. First, hand gestures are generated in association with various contexts such as the speaker's attitude and discourse adjustment as well as the utterance content. Secondly, it is difficult to construct a general-purpose model because there are individual differences in the way of moving the hand, the frequency of moving the hand, and the timing of gestures observed during conversation.

この問題が故に、従来のジェスチャ認識に関する研究では、被験者に予め同じジェスチャを行うよう教示し、訓練データを収集し、モデルを構築していた。このデータ収集アプローチでは会話中の自然なジェスチャをモデル化することは困難であった。 Because of this problem, in the conventional research on gesture recognition, a subject was instructed to perform the same gesture in advance, training data was collected, and a model was constructed. It was difficult to model natural gestures during conversation with this data collection approach.

この分野の従来技術として、予め動作やジェスチャのカテゴリを定義して認識を行う方法として、モーションキャプチャやカメラといったデバイスを用いて手の動作特徴量を抽出し、条件付き確率場（Conditional Random Fields、非特許文献１）や、潜在動的条件付き確率場（Latent Dynamic Conditional Random Fields、非特許文献２）といった時系列データの構造を捉えることができる学習モデルを用いる方法が知られている。また、近年では深層学習(Deep learning、非特許文献３）も利用されている。 As a conventional technique in this field, as a method of preliminarily defining a category of motion or gesture and performing recognition, a motion feature amount of a hand is extracted using a device such as a motion capture or a camera, and a conditional random field (Conditional Random Fields, A method using a learning model such as Non-Patent Document 1) or Latent Dynamic Conditional Random Fields (Non-Patent Document 2) that can capture the structure of time series data is known. Further, in recent years, deep learning (Non-Patent Document 3) has also been used.

一方、一連の動作データよりパターンを発見する教師無し学習のアプローチでジェスチャを解析する方法も提案されている。Zhouらは連続時系列データの分節化と、分節化されたパターンのクラスタリングを交互に行うことでパターンを発見するHACA（hierarchical aligned cluster analysis、非特許文献４）アルゴリズムを提案している。Bozkurtらは、並列隠れマルコフモデル（PHMMs, parallel hidden Markov models、非特許文献５）を提案し、手の動作のプリミティブパターンを発見する方法を提案した。Joshiらはジェスチャ分類に基づいて時系列データを分節化する方法を提案している（非特許文献６）。 On the other hand, a method of analyzing a gesture by an unsupervised learning approach of discovering a pattern from a series of motion data has also been proposed. Zhou et al. have proposed an HACA (hierarchical aligned cluster analysis, non-patent document 4) algorithm for discovering patterns by alternately performing segmentation of continuous time series data and clustering of segmented patterns. Bozkurt et al. proposed a parallel hidden Markov model (PHMMs, non-patent document 5), and proposed a method for discovering a primitive pattern of a hand motion. Joshi et al. have proposed a method of segmenting time-series data based on gesture classification (Non-Patent Document 6).

S. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random fields for gesture recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1521.1527, 2006.S. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random fields for gesture recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1521.1527, 2006. L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discriminative models for continuous gesture recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2007.L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discriminative models for continuous gesture recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2007. S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221.231, 2013.S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221.231, 2013. F. Zhou, F. De la Torre, and J. K. Hodgins. Aligned cluster analysis for temporal segmentation of human motion. In Automatic Face & Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on, pages 1.7. IEEE, 2008F. Zhou, F. De la Torre, and J. K. Hodgins. Aligned cluster analysis for temporal segmentation of human motion. In Automatic Face & Gesture Recognition, 2008.FG′08. 8th IEEE International Conference on, pages 1.7. IEEE, 2008 E. Bozkurt, S. Asta, S. O ¨ zkul, Y. Yemez, and E. Erzin. Multimodal analysis of speech prosody and upper body gestures using hidden semimarkov models. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3652.3656. IEEE, 2013.E. Bozkurt, S. Asta, S. O ¨ zkul, Y. Yemez, and E. Erzin. Multimodal analysis of speech prosody and upper body gestures using hidden semimarkov models.In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3652.3656. IEEE, 2013. A. Joshi, C. Monnier, M. Betke, and S. Sclaroff. A random forest approach to segmenting and classifying gestures. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1.7. IEEE, 2015.A. Joshi, C. Monnier, M. Betke, and S. Sclaroff. A random forest approach to segmenting and classifying gestures. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1.7. IEEE, 2015. Y. Li, C. Fermuller, Y. Aloimonos, and H. Ji. Learning shift-invariant sparse representation of actions. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2630-2637, 2010.Y. Li, C. Fermuller, Y. Aloimonos, and H. Ji. Learning shift-invariant sparse representation of actions.In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2630-2637, 2010.

上記の研究では、教師あり・無し学習によるアプローチの違いはあるものの、対話中に表出するジェスチャの認識・理解に焦点をあてていない。 The above research does not focus on the recognition and understanding of gestures that appear during dialogue, although there are differences in approaches with and without supervised learning.

そこで本発明では、発話中に行われたジェスチャの特徴量と発話に含まれる単語の単語ベクトルとの間の時間的対応関係に基づいて、ジェスチャの特徴量から発話単語を推定するモデルを生成するモデル学習装置を提供する。 Therefore, in the present invention, based on the temporal correspondence between the feature amount of the gesture performed during the utterance and the word vector of the word included in the utterance, a model for estimating the uttered word from the feature amount of the gesture is generated. Provide a model learning device.

本発明のモデル学習装置は、ジェスチャ特徴取得部と、単語ベクトル取得部と、ジェスチャ単語対応付部を含む。 The model learning device of the present invention includes a gesture feature acquisition unit, a word vector acquisition unit, and a gesture word correspondence unit.

ジェスチャ特徴取得部は、身体動作の時系列データであるジェスチャの特徴量であるジェスチャ特徴を取得する。単語ベクトル取得部は、発話から抽出された単語の単語ベクトルを取得する。ジェスチャ単語対応付部は、ジェスチャ特徴と単語ベクトルをそれらの時間共起に基づいて対応付け、単語毎のモデルであって、ジェスチャ特徴を入力とし、入力されたジェスチャ特徴がモデルに対応付けられた単語と対応するか否かを分類するモデルを生成する。 The gesture feature acquisition unit acquires a gesture feature that is a feature amount of a gesture that is time-series data of a body motion. The word vector acquisition unit acquires the word vector of the word extracted from the utterance. The gesture word correspondence unit associates the gesture feature and the word vector based on their temporal co-occurrence, is a model for each word, and inputs the gesture feature, and the input gesture feature is associated with the model. Generate a model that classifies whether or not it corresponds to a word.

本発明のモデル学習装置によれば、発話中に行われたジェスチャの特徴量と発話に含まれる単語の単語ベクトルとの間の時間的対応関係に基づいて、ジェスチャの特徴量から発話単語を推定するモデルを生成することができる。 According to the model learning device of the present invention, the uttered word is estimated from the feature amount of the gesture based on the temporal correspondence between the feature amount of the gesture performed during the utterance and the word vector of the word included in the utterance. Model can be generated.

実施例１のモデル学習装置の構成を示すブロック図。1 is a block diagram showing the configuration of a model learning device of Example 1. FIG. 実施例１のモデル学習装置の動作を示すフローチャート。3 is a flowchart showing the operation of the model learning device of the first embodiment. 実施例１のジェスチャ特徴取得部の動作を示すフローチャート。6 is a flowchart showing the operation of the gesture feature acquisition unit of the first embodiment. 実施例１の単語ベクトル取得部の動作を示すフローチャート。5 is a flowchart showing the operation of the word vector acquisition unit of the first embodiment. 実施例２の発話単語推定装置の構成を示すブロック図。6 is a block diagram showing the configuration of a spoken word estimation device according to a second embodiment. FIG. 実施例２の発話単語推定装置の動作を示すフローチャート。9 is a flowchart showing the operation of the speech word estimation device according to the second embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. It should be noted that components having the same function are denoted by the same reference numeral, and redundant description will be omitted.

以下、図１を参照して実施例１のモデル学習装置について説明する。同図に示すように本実施例のモデル学習装置１は、ジェスチャ特徴取得部１１と、単語ベクトル取得部１２と、ジェスチャ単語対応付部１３と、モデル記憶部１４を含む構成である。 Hereinafter, the model learning device of the first embodiment will be described with reference to FIG. As shown in the figure, the model learning device 1 of the present embodiment is configured to include a gesture feature acquisition unit 11, a word vector acquisition unit 12, a gesture word correspondence unit 13, and a model storage unit 14.

ジェスチャ特徴取得部１１は、ジェスチャ入力部１１１と、ジェスチャ区間抽出部１１２と、ジェスチャ特徴抽出部１１３を含む。ジェスチャ特徴抽出部１１３は、動作軌跡特徴抽出部１１３１と、ジェスチャフェーズ特徴取得部１１３２と、動作プリミティブパターン特徴抽出部１１３３を含む。 The gesture feature acquisition unit 11 includes a gesture input unit 111, a gesture section extraction unit 112, and a gesture feature extraction unit 113. The gesture feature extraction unit 113 includes a motion trajectory feature extraction unit 1131, a gesture phase feature acquisition unit 1132, and a motion primitive pattern feature extraction unit 1133.

単語ベクトル取得部１２は、音声信号入力部１２１と、発話区間検出部１２２と、音声認識部１２３と、単語ベクトル構築部１２４を含む。 The word vector acquisition unit 12 includes a voice signal input unit 121, a speech section detection unit 122, a voice recognition unit 123, and a word vector construction unit 124.

＜モデル学習装置１の動作の概要＞
以下、図２を参照してモデル学習装置１の動作の概要を説明する。同図に示すように、ジェスチャ特徴取得部１１は、身体動作の時系列データであるジェスチャの特徴量であるジェスチャ特徴を取得する（Ｓ１１）。単語ベクトル取得部１２は、発話から抽出された単語の単語ベクトルを取得する（Ｓ１２）。ジェスチャ単語対応付部１３は、ジェスチャ特徴と単語ベクトル（単語）をそれらの時間共起に基づいて対応付け、単語毎のモデルであって、ジェスチャ特徴を入力とし、入力されたジェスチャ特徴がモデルに対応付けられた単語と対応するか否かを分類するモデルを生成し、モデル記憶部１４に記憶する（Ｓ１３）。 <Outline of operation of model learning device 1>
The outline of the operation of the model learning device 1 will be described below with reference to FIG. As shown in the figure, the gesture feature acquisition unit 11 acquires a gesture feature that is a feature amount of a gesture that is time-series data of a body motion (S11). The word vector acquisition unit 12 acquires the word vector of the word extracted from the utterance (S12). The gesture word associating unit 13 associates a gesture feature with a word vector (word) based on their temporal co-occurrence, and is a model for each word. The gesture feature is input, and the input gesture feature is used as a model. A model that classifies whether or not it corresponds to the associated word is generated and stored in the model storage unit 14 (S13).

以下、図３を参照してジェスチャ特徴取得部１１内の各構成要件の動作について説明する。 Hereinafter, the operation of each constituent element in the gesture feature acquisition unit 11 will be described with reference to FIG.

＜ジェスチャ入力部１１１＞
ジェスチャ入力部１１１は、ジェスチャを取得する（Ｓ１１１）。ジェスチャ入力部１１１は例えば、光学式モーションキャプチャシステムでよい。ジェスチャ入力部１１１を光学式モーションキャプチャシステムとした場合、被験者の両手首に装着したマーカーの３次元座標の時系列データをジェスチャとして取得することができる。より詳細には、両腕のマーカーから取得した三次元座標、計６次元のベクトルの時系列データをジェスチャとして用いることができる。 <Gesture input unit 111>
The gesture input unit 111 acquires a gesture (S111). The gesture input unit 111 may be, for example, an optical motion capture system. When the gesture input unit 111 is an optical motion capture system, time-series data of the three-dimensional coordinates of the markers attached to both wrists of the subject can be acquired as a gesture. More specifically, three-dimensional coordinates obtained from the markers of both arms, a total of six-dimensional vector time series data can be used as a gesture.

＜ジェスチャ区間抽出部１１２＞
ジェスチャ区間抽出部１１２は、入力されたジェスチャからジェスチャの動作区間を抽出する（Ｓ１１２）。より詳細には、ジェスチャ区間抽出部１１２は、膝の上または膝付近で手が静止している状態が継続する区間を静止区間と定義し、それ以外の区間を動作区間と定義し、入力された各時刻のジェスチャを何れかの区間に分類する。ジェスチャ区間抽出部１１２はこの２クラス分類の手法として、隠れマルコフモデルを用いることができる。 <Gesture section extraction unit 112>
The gesture section extraction unit 112 extracts a gesture action section from the input gesture (S112). More specifically, the gesture section extraction unit 112 defines a section in which the hand is still still above or near the knee as a still section, and defines other sections as a motion section, which is input. The gesture at each time is classified into one of the sections. The gesture section extraction unit 112 can use a hidden Markov model as a method of the two-class classification.

＜ジェスチャ特徴抽出部１１３＞
ジェスチャ特徴抽出部１１３は、ジェスチャの動作区間に含まれるジェスチャの特徴量を抽出する（Ｓ１１３）。ジェスチャ特徴抽出部１１３は、手の動作軌跡に関する特徴量、ジェスチャフェーズに関する特徴量、動作プリミティブパターンに関する特徴量の３種類の特徴量のいずれか、または全部を計算する。これらの特徴量はそれぞれ動作軌跡特徴抽出部１１３１、ジェスチャフェーズ特徴抽出部１１３２、動作プリミティブパターン特徴抽出部１１３３により計算される。 <Gesture feature extraction unit 113>
The gesture feature extraction unit 113 extracts the feature amount of the gesture included in the motion section of the gesture (S113). The gesture feature extraction unit 113 calculates any or all of three types of feature amounts of a motion trajectory of a hand, a feature amount of a gesture phase, and a feature amount of a motion primitive pattern. These feature amounts are calculated by the motion trajectory feature extraction unit 1131, the gesture phase feature extraction unit 1132, and the action primitive pattern feature extraction unit 1133, respectively.

＜動作軌跡特徴抽出部１１３１＞
動作軌跡特徴抽出部１１３１は、手の動作軌跡に関する特徴量を抽出する（Ｓ１１３１）。動作軌跡特徴抽出部１１３１を構成するための学習モデルとして、隠れマルコフモデルを用いてもよい。 <Motion trajectory feature extraction unit 1131>
The motion trajectory feature extraction unit 1131 extracts the feature amount related to the motion trajectory of the hand (S1131). A hidden Markov model may be used as a learning model for configuring the motion trajectory feature extraction unit 1131.

より詳細には、動作軌跡特徴抽出部１１３１は、ジェスチャ入力部１１１が取得した６次元のベクトルの時系列データを、最初に話者の重心（両肩のマーカーの三次元座標の平均値）位置を原点とする座標系に変換する。動作軌跡特徴抽出部１１３１は、この時系列データに対して一定の窓幅のフィルタを適用し、時系列データを平滑化する。フィルタとしては、例えば窓幅50msのガウシアンフィルタが利用可能である。動作軌跡特徴抽出部１１３１は、この時系列データ中に一定時間以下の欠損が観測された場合には、データの補間を行う。この欠損の条件として、500ms以下の時間長が利用できる。また、補間方法として、線形補間が利用可能である。 More specifically, the motion trajectory feature extraction unit 1131 first determines the time-series data of the 6-dimensional vector acquired by the gesture input unit 111 as the position of the center of gravity (average value of the three-dimensional coordinates of the markers on both shoulders) of the speaker. Convert to a coordinate system with the origin as. The motion trajectory feature extraction unit 1131 applies a filter with a fixed window width to the time series data to smooth the time series data. As the filter, for example, a Gaussian filter with a window width of 50 ms can be used. The movement trajectory feature extraction unit 1131 interpolates the data when a defect of a certain time or less is observed in the time series data. As a condition for this loss, a time length of 500 ms or less can be used. Further, linear interpolation can be used as an interpolation method.

フレームtにおける左右のマーカーの三次元座標は The three-dimensional coordinates of the left and right markers in frame t are

と表され、座標の時系列データは And the time series data of coordinates is

と表される。ここでpに基づいて、速度ベクトル（フレーム間微分）を計算し、 Is expressed as Here, the velocity vector (interframe differential) is calculated based on p,

と表す。ただし Express. However

とする。次に、個人差を正規化した座標時系列データを計算し、 And Next, calculate the coordinate time series data that normalized the individual differences,

と表す。ここで、 Express. here,

とする。meanIは、被験者iのセッション中に観測される時系列データの平均ベクトルを示し、stdIは標準偏差を示す。右腕のマーカーに関しても、P_l,VP_l,IP_lと同様にP_r,VP_r,IP_rを計算する。次に、左右両腕のマーカーの座標の差分ベクトルを And meanI indicates the mean vector of the time series data observed during the session of subject i, and stdI indicates the standard deviation. For the marker on the right arm, P _r ,VP _r ,IP _r are calculated in the same manner as P _l ,VP _l ,IP _l . Next, the difference vector of the marker coordinates of the left and right arms

として表す。ここで Express as. here

とする。動作軌跡特徴抽出部１１３１は、計７種類、２１次元の時系列データを動作軌跡の特徴量MTFとして定義して、特徴量MTFを計算する。 And The motion locus feature extraction unit 1131 defines a total of seven types and 21-dimensional time-series data as the motion locus feature amount MTF, and calculates the feature amount MTF.

＜ジェスチャフェーズ特徴抽出部１１３２＞
ジェスチャフェーズ特徴抽出部１１３２は、ジェスチャフェーズに関する特徴量を抽出する（Ｓ１１３２）。 <Gesture phase feature extraction unit 1132>
The gesture phase feature extraction unit 1132 extracts a feature amount related to the gesture phase (S1132).

より詳細には、ジェスチャフェーズ特徴抽出部１１３２は、ジェスチャ区間抽出部１１２で抽出されたジェスチャの動作区間を隠れマルコフモデルを用いて、ストローク、ホールドと呼ばれるジェスチャフェーズに分類する。ジェスチャフェーズ特徴抽出部１１３２は、ストローク、ホールドの各々について、その時間長、頻度、及び、それらの区間が全体に占める時間割合を特徴量として計算する。 More specifically, the gesture phase feature extraction unit 1132 classifies the motion sections of the gesture extracted by the gesture section extraction unit 112 into gesture phases called stroke and hold using the hidden Markov model. The gesture phase feature extraction unit 1132 calculates, for each stroke and hold, the time length, frequency, and the time ratio of these sections to the whole as feature amounts.

ジェスチャフェーズには、準備と復帰のフェーズを含めることもできる。具体的には、ジェスチャフェーズ特徴抽出部１１３２は、ジェスチャセグメントの時間長(MT)を動作区間のフレーム数TDとして計算する。ジェスチャフェーズ特徴抽出部１１３２は、ストローク・ホールドセグメントの頻度として、動作区間に含まれるストローク・ホールドのセグメントの回数をMTで割った値を頻度 The gesture phase can also include the prepare and return phases. Specifically, the gesture phase feature extraction unit 1132 calculates the time length (MT) of the gesture segment as the number of frames TD in the motion section. The gesture phase feature extraction unit 1132 determines the frequency of the stroke hold segment by dividing the number of stroke hold segments included in the motion section by MT.

とそれぞれ定義して計算する。ジェスチャフェーズ特徴抽出部１１３２は、ストローク・ホールドの占める時間割合として、ホールドジェスチャの時間共起割合を計算する。 Define and calculate respectively. The gesture phase feature extraction unit 1132 calculates the time co-occurrence rate of the hold gesture as the time rate occupied by the stroke and hold.

T^H _iはi番目のホールドセグメントのフレーム長である。ジェスチャフェーズ特徴抽出部１１３２は、同様にストロークセグメントの総時間長Sd_Lも計算する。ジェスチャフェーズの特徴量を T ^H _i is the frame length of the i-th hold segment. Similarly, the gesture phase feature extraction unit 1132 also calculates the total time length Sd _L of the stroke segment. The feature amount of the gesture phase

と定義する。 It is defined as.

＜動作プリミティブパターン特徴抽出部１１３３＞
動作プリミティブパターン特徴抽出部１１３３は、手の動作軌跡に含まれる共通の短い時系列パターン（動作プリミティブパターン）に関する特徴量を抽出する（Ｓ１１３３）。 <Operation Primitive Pattern Feature Extraction Unit 1133>
The motion primitive pattern feature extraction unit 1133 extracts a feature amount related to a common short time series pattern (motion primitive pattern) included in the motion trajectory of the hand (S1133).

動作プリミティブパターン特徴抽出部１１３３は、動作系列から教師なし学習で動作プリミティブパターンに関する特徴量を抽出する。そのため動作プリミティブパターン特徴抽出部１１３３は、移動不変疎符号化（Sift Invariant Sparse Coding,SISC）により抽出した特徴を用いることができる。移動不変疎符号化の学習には、サポートベクターマシンを用いることができる。 The motion primitive pattern feature extraction unit 1133 extracts a feature amount related to a motion primitive pattern from the motion sequence by unsupervised learning. Therefore, the motion primitive pattern feature extraction unit 1133 can use the features extracted by the movement invariant sparse coding (SISC). A support vector machine can be used for learning moving invariant sparse coding.

SISCは辞書学習手法の一つであり、辞書に含まれる各コードが短い時系列パターンに対応しており、時系列データを複数の短い時系列パターンの集合として分解する。この時系列パターンをプリミティブと呼称する。 SISC is one of the dictionary learning methods. Each code included in the dictionary corresponds to a short time series pattern, and time series data is decomposed into a set of a plurality of short time series patterns. This time series pattern is called a primitive.

SISCは複数のプリミティブの生起する時刻とプリミティブの形状を交互に学習するように定式化される。本実施例の方法では、片手だけ無意味な動きが観測される場合に、ノイズとなる不要な次元の値が得られる可能性があるため、多次元パターンとしてではなく、各次元ごとにプリミティブパターンを学習する。パラメータの最適化には、非特許文献７の近接勾配法（Gradient Descent,GD）を用いることができる。 SISC is formulated to alternately learn the time of occurrence of multiple primitives and the shape of the primitives. In the method of the present embodiment, when a meaningless movement is observed in only one hand, an unnecessary dimension value that becomes noise may be obtained. Therefore, not a multidimensional pattern but a primitive pattern To learn. The proximity gradient method (Gradient Descent, GD) of Non-Patent Document 7 can be used for parameter optimization.

２１次元の時系列データMTFをSISCの入力とする。f_m[n]をMTFにおけるm次元目の固定長のプリミティブパターンとする。信号の長さをN、0<n<Nとすると、f_m[n]は以下で表される。 21-dimensional time series data MTF is input to SISC. Let f _m [n] be the fixed-length primitive pattern of the m-th dimension in MTF. If the signal length is N and 0<n<N, f _m [n] is expressed as follows.

ここで、φ^k _d[m]はd番目のプリミティブとする(0<d<D,0<m<M)。一般的に、そのパターン長（ベクトル長）は短くM≪Nとなる。活性化系列：α^k _d[n]はスパース応答を構成する。その活性化系列はそのd番目のプリミティブが生起する時刻を表している。α^k _d[n]の時間長は入力時系列データの長さNに等しい。φ^k _d[m]の*は畳み込み作用素を示しており、α^k _d[n]の値を各プリミティブに畳み込むことで、各プリミティブと各時刻での活性度の間の相関を計算するために用いられる。 Here, φ ^k _d [m] is the d-th primitive (0<d<D, 0<m<M). Generally, the pattern length (vector length) is short and M<<N. Activation sequence: α ^k _d [n] constitutes a sparse response. The activation sequence represents the time when the d-th primitive occurs. The time length of α ^k _d [n] is equal to the length N of the input time series data. The * in φ ^k _d [m] indicates a convolution operator, and by convolving the value of α ^k _d [n] into each primitive, we calculate the correlation between each primitive and the activity at each time. Used.

学習では、モデルパラメータφ^k _d[m]とα^k _d[n]の最適化が行われる。実際の入力時系列データとfm_k[n]の二乗誤差を最小化することで最適化を行う。ここで、正則化項にl₁ノルムを採用することで、αの多くの値は0になる。全体的な最適化問題は次で表される。 In learning, the model parameters φ ^k _d [m] and α ^k _d [n] are optimized. Optimization is performed by minimizing the squared error between the actual input time series data and fm _k [n]. Here, by adopting the l ₁ norm for the regularization term, many values of α become 0. The overall optimization problem is represented by

ここで、 here,

はαのl₁ノルムを示しており、その項の重みを制御するλはラグランジェ乗数である。また、制約として||φ||² _F≦1が用いられる。ここでは、目的関数（式(3)）は非凸な目的関数であるが、αとφのどちらかが固定されている場合、凸関数になることが知られている。ここでは相互最適化を行う。SISCの学習が終了した後、αとφから特徴量を構成する。プリミティブ特徴量SFはS_iの時間長を持つ動作区間iに対して、 Indicates the l ₁ norm of α, and λ that controls the weight of the term is a Lagrange multiplier. Also, ||φ|| ² _F ≦1 is used as a constraint. Here, the objective function (equation (3)) is a non-convex objective function, but it is known that it becomes a convex function when either α or φ is fixed. Mutual optimization is performed here. After the SISC learning is completed, the feature quantity is constructed from α and φ. The primitive feature amount SF is for the motion section i having the time length of S _i ,

と計算される。δはディラックのデルタ関数を示す。 Is calculated. δ represents the Dirac delta function.

次に、辞書特徴量DFはS_iの時間長を持つ動作区間iに対して、 Next, the operation interval i dictionary feature amount DF is having a time length of the S _i,

として計算される。SF_i,d,kはプリミティブパターンdの疎な度合いを示しており、df_i,d,kはプリミティブパターンdの活性化度合いを示している。ジェスチャプリミティブ特徴量は Calculated as SF _i,d,k indicates the degree of sparseness of the primitive pattern d, and df _i,d,k indicates the degree of activation of the primitive pattern d. The gesture primitive feature is

と定義される。 Is defined as

以下、図４を参照して単語ベクトル取得部１２内の各構成要件の動作について説明する。 The operation of each constituent element in the word vector acquisition unit 12 will be described below with reference to FIG.

＜音声信号入力部１２１＞
音声信号入力部１２１は、音声信号を取得する（Ｓ１２１）。音声信号入力部１２１として、例えばマイクロホンを用いることができる。 <Voice signal input unit 121>
The audio signal input unit 121 acquires an audio signal (S121). As the audio signal input unit 121, for example, a microphone can be used.

＜発話区間検出部１２２＞
発話区間検出部１２２は、入力された音声信号から発話区間を検出する（Ｓ１２２）。発話区間検出部１２２は、例えば零点交差法により発話区間の候補を抽出し、事前に発話区間／非発話区間を学習しておいた混合ガウシアンモデルを用いて発話区間を検出してもよい。 <Utterance section detection unit 122>
The speech section detection unit 122 detects a speech section from the input voice signal (S122). The utterance section detection unit 122 may detect the utterance section by extracting a candidate of the utterance section by, for example, the zero-crossing method, and detect the utterance section using a mixed Gaussian model in which the utterance section/non-utterance section is learned in advance.

＜音声認識部１２３＞
音声認識部１２３は、発話区間に含まれる単語を抽出する（Ｓ１２３）。音声認識部１２３として、音声信号に基づく自動音声認識処理が利用できる。また、自動処理の代わりに人間による書き起し処理が利用できる。例えば、700ms以下の短い音声断片を削除した結果を発話区間として抽出し、その後、手動にてアノテーションする方法を用いてもよい。 <Voice recognition unit 123>
The voice recognition unit 123 extracts words included in the utterance section (S123). As the voice recognition unit 123, automatic voice recognition processing based on a voice signal can be used. Also, a human transcription process can be used instead of the automatic process. For example, a method may be used in which a result obtained by deleting a short voice fragment of 700 ms or less is extracted as a utterance section and then manually annotated.

＜単語ベクトル構築部１２４＞
単語ベクトル構築部１２４は、認識された単語列の形態素解析に基づき、発話区間ごとに単語ベクトルを構築する（Ｓ１２４）。より詳細には、単語ベクトル構築部１２４は、音声認識部１２３において得られる文字列を入力とし、形態素解析により、発話に含まれる単語セットを構築し、各発話断片に含まれる単語集合から単語ベクトル（Bag of Words:BoW）を構成する。 <Word vector construction unit 124>
The word vector construction unit 124 constructs a word vector for each utterance section based on the morphological analysis of the recognized word string (S124). More specifically, the word vector construction unit 124 receives a character string obtained by the speech recognition unit 123 as an input, constructs a word set included in an utterance by morphological analysis, and extracts a word vector from the word set included in each utterance fragment. (Bag of Words:BoW).

以下、ジェスチャ単語対応付部１３の動作の詳細について説明する。 Hereinafter, the operation of the gesture word correspondence unit 13 will be described in detail.

＜ジェスチャ単語対応付部１３＞
ジェスチャ単語対応付部１３は、ジェスチャの動作区間と発話断片の時間的な共起関係に基づいて、ジェスチャ特徴と単語とを対応付けてモデルを生成し（Ｓ１３）、当該モデルをモデル記憶部１４に記憶する。以下のルールが利用可能である。
（１）発話断片U_yと動作区間G_xが時間的に共起している場合、それらを対応付ける。
（２）発話断片U_yに含まれる単語ベクトルを <Gesture word correspondence unit 13>
The gesture word associating unit 13 creates a model by associating the gesture feature with the word based on the temporal co-occurrence relationship between the motion section of the gesture and the utterance fragment (S13), and stores the model in the model storage unit 14 Remember. The following rules are available.
(1) If the utterance fragment U _y and the motion section G _x co-occur in time, they are associated.
(2) Set the word vector included in the utterance fragment U _y

とし、その中で、アクティブとなっているすべての単語（w_n>1）と共起した動作区間G_x内の特徴量 And the feature quantity in the motion interval G _x that co-occurs with all active words (w _n >1)

をペアとする。
（３）w_nとMF_xは教師付き学習の目的変数(Y)と入力ベクトル（X）に対応する。 As a pair.
(3) w _n and MF _x correspond to the objective variable (Y) and input vector (X) for supervised learning.

ジェスチャ単語対応付部１３は、上記（１）〜（３）の手順を全ての発話断片について行い、データセットを構築する。 The gesture word associating unit 13 builds a data set by performing the above steps (1) to (3) for all utterance fragments.

＜モデル、モデル記憶部１４＞
モデルは、ジェスチャ特徴を入力として、そのジェスチャが所定の単語に対応するか否かを分類（２値分類）し、出力する分類器である。モデルは、各単語につき一つずつ学習される。 <Model, model storage unit 14>
The model is a classifier which inputs a gesture feature, classifies (binary classification) whether or not the gesture corresponds to a predetermined word, and outputs it. The model is trained, one for each word.

図１の例では、モデル記憶部１４は、Ｎ個のモデル１４−１〜１４−Ｎ（モデルＷ１、…、モデルＷＮ）を記憶している。モデル１４−１〜１４−Ｎは抽出された特徴量MF_xを訓練データとして機械学習を行うことで構成することができる。 In the example of FIG. 1, the model storage unit 14 stores N models 14-1 to 14-N (models W1,..., Models WN). The models 14-1 to 14-N can be configured by performing machine learning using the extracted feature amount MF _x as training data.

MTF_xは時系列データであるため、モデルとして、時系列データのための学習モデルである隠れマルコフモデルを用いることができる。GFF_x,GPF_xは固定長の多変量ベクトルであるため、モデルとして線形サポートベクターマシンを用いることができる。それぞれのモデルについて、ジェスチャの断片と、単語ベクトルとを時間同期させ対応づけ、各単語をカテゴリに対応させて、各単語に対応する動作断片と、それ以外に対応する動作断片との間で２値分類器を訓練することによりモデルを構成することができる。 Since MTF _x is time series data, a hidden Markov model, which is a learning model for time series data, can be used as a model. Since GFF _x and GPF _x are fixed-length multivariate vectors, a linear support vector machine can be used as a model. For each model, the gesture fragment and the word vector are associated with each other in a time-synchronized manner, each word is associated with a category, and the action fragment corresponding to each word and the action fragment corresponding to other words are associated with each other. The model can be constructed by training a value classifier.

＜効果＞
本実施例のモデル学習装置１によれば、ジェスチャ特徴から発話単語を推定するためのモデルを学習することができる。説明を行う状況など話者がジェスチャを表出する場面をモデル学習装置１によって解析することにより、発話中の単語とジェスチャ、手の動作の特徴量の対応付けを学習し、ジェスチャと同時に発話される単語を推定するモデルを構築することができる。 <Effect>
According to the model learning device 1 of the present embodiment, it is possible to learn a model for estimating a spoken word from gesture features. By using the model learning device 1 to analyze a situation in which a speaker expresses a gesture such as a situation of explaining, the correspondence between the word being uttered, the gesture, and the feature amount of the hand motion is learned, and the gesture is uttered at the same time. You can build a model that estimates the words that

以下、図５、図６を参照して、実施例２の発話単語推定装置の構成および動作について説明する。図５に示すように、発話単語推定装置２は、ジェスチャ特徴取得部１１と、発話単語推定部２３と、モデル記憶部１４を含み、ジェスチャ特徴取得部１１と、モデル記憶部１４は実施例１と同じ機能を有する。 Hereinafter, the configuration and operation of the speech word estimation device according to the second embodiment will be described with reference to FIGS. 5 and 6. As shown in FIG. 5, the utterance word estimation device 2 includes a gesture feature acquisition unit 11, a utterance word estimation unit 23, and a model storage unit 14, and the gesture feature acquisition unit 11 and the model storage unit 14 are the first embodiment. Has the same function as.

本実施例の発話単語推定装置２は、実施例１のモデル学習装置１が学習したモデルを利用する装置である。ジェスチャ特徴取得部１１は実施例１と同様に動作して、ジェスチャ特徴を取得する（Ｓ１１）。発話単語推定部２３は、予めモデル記憶部１４に記憶されたモデルに基づいて、入力されたジェスチャ特徴と対応する単語を推定する（Ｓ２３）。 The spoken word estimation device 2 of this embodiment is a device that uses the model learned by the model learning device 1 of the first embodiment. The gesture feature acquisition unit 11 operates in the same manner as in the first embodiment, and acquires the gesture feature (S11). The uttered word estimation unit 23 estimates a word corresponding to the input gesture feature based on the model stored in the model storage unit 14 in advance (S23).

なお、モデル学習装置１と発話単語推定装置２の機能を兼ね備える装置を実現することもできる。モデル学習装置１と発話単語推定装置２の機能を兼ね備える装置とする場合、図１におけるジェスチャ単語対応付部１３に、発話単語推定部２３の機能を追加すればよい。 A device having the functions of the model learning device 1 and the spoken word estimation device 2 can also be realized. When the device has both the functions of the model learning device 1 and the spoken word estimation device 2, the function of the spoken word estimation unit 23 may be added to the gesture word association unit 13 in FIG. 1.

＜効果＞
本実施例の発話単語推定装置２によれば、予め学習されたモデルを用いて、ジェスチャ特徴から発話単語を推定することができる。 <Effect>
According to the spoken word estimation device 2 of the present embodiment, it is possible to estimate the spoken word from the gesture feature by using the model learned in advance.

＜性能評価実験＞
上述のモデル学習装置１、発話単語推定装置２の性能評価実験について説明する。この実験では、グループ対話タスクとして、動画を事前に観察した被験者（説明者）がその動画を見ていない被験者（聞き手）に動画の内容を説明するというタスクが設定された。動画として、ワーナーブラザーズ（登録商標）社の“Canary Row”というアニメーションが用意された。グループ対話タスクでは、アニメーションの情景、猫・鳥などの登場人物の動作を表現するためのハンドジェスチャが発言に伴って観測された。被験者の発話を取得するために、指向性無線マイクと録音機材を用いた。被験者の顔の向き、ハンドジェスチャをセンシングするために、モーションアナリシス（登録商標）社製の光学式モーションキャプチャシステムMac3Dを用いた。 <Performance evaluation experiment>
Performance evaluation experiments of the model learning device 1 and the spoken word estimation device 2 described above will be described. In this experiment, a group dialogue task was set up in which a subject (explainer) who observed the movie in advance explained the contents of the movie to a subject (listener) who did not see the movie. An animation called "Canary Row" from Warner Bros. (registered trademark) was prepared as a moving image. In the group dialogue task, hand gestures for expressing the scenes of animation and the movements of characters such as cats and birds were observed along with the speech. A directional wireless microphone and recording equipment were used to capture the subject's speech. An optical motion capture system Mac3D manufactured by Motion Analysis (registered trademark) was used to sense the face orientation and hand gesture of the subject.

１６人の被験者の動作データから、発話区間と時間共起する４４３個の動作断片が抽出された。発話データの形態素解析の結果、１６人による説明に含まれた語彙数（単語数）の合計は８８９単語であった。この８８９単語のうち、「は」、「の」といった格助詞は削除した。また１０回未満の低頻度単語に関しても、充分な訓練データが得られないため対象から除外した。最終的に３５６単語が対象となった。 From the motion data of 16 test subjects, 443 motion fragments co-occurring with the utterance period were extracted. As a result of morphological analysis of the utterance data, the total number of vocabularies (number of words) included in the explanation by 16 people was 889 words. Of these 889 words, the case particles such as "ha" and "no" have been deleted. In addition, low-frequency words less than 10 times were excluded from the subjects because sufficient training data could not be obtained. Finally 356 words were targeted.

３５６単語と共起する動作断片を訓練データのペアとして構成し、二値分類を行った。すなわち、３５６個の二値分類器を訓練・構築し、評価した。ある単語のカテゴリがジェスチャ特徴から高精度に分類できた場合、発言中の単語を表現する共通のジェスチャ特徴が存在するという仮説を立てるものとした。実験は５分割交差検定により行われた。動作断片と共起する単語は１対多対応なので、問題設定は、多重ラベル分類タスクとなる。単純に多クラス分類の精度を算出することは難しいため、評価尺度は、正例、負例両方のカテゴリの分類精度の平均を採用した。 Binary classification was performed by constructing motion fragments that co-occur with 356 words as a pair of training data. That is, 356 binary classifiers were trained/constructed and evaluated. If a category of a certain word can be classified with high accuracy based on the gesture feature, it is assumed that there is a common gesture feature that represents the word in the speech. The experiment was performed by 5-fold cross validation. Since the words that co-occur with the action fragments have a one-to-many correspondence, the problem setting is a multi-label classification task. Since it is difficult to simply calculate the accuracy of multi-class classification, we adopted the average of classification accuracy of both positive and negative categories as the evaluation scale.

評価の結果、３５６単語の分類器に関して、８０単語の分類器に関しては６０．０％以上の分類性能を得た。２値分類のベースラインは全サンプル数を考慮すると５６．１％であった。この精度はp<0.01の水準でランダム５０％より精度が高い閾値であり、以降ランダムベースラインと定義する。最大で６０．０％以上の精度を得られた８０単語（分類器）に関する平均精度として、ジェスチャフェーズとSCISに基づく１次元プリミティブ特徴量を用いた方法では、６３．４％であり、７６個の単語を６０％以上の精度で分類することができた。これは、ランダムベースライン５６．１％を上回る他、SCISの１次元プリミティブ特徴量のみを使用したモデルの平均精度５８．０％、SCISの多次元プリミティブ特徴量のモデルの平均精度４４．９％を上回る。以上により、実施例に記載の装置、方法の有効性が示された。 As a result of the evaluation, a classification performance of 60.0% or more was obtained for the 356-word classifier and for the 80-word classifier. The baseline for binary classification was 56.1% considering the total number of samples. This precision is a threshold with a precision higher than random 50% at the level of p<0.01, and is hereinafter defined as a random baseline. The average accuracy for 80 words (classifier) that achieved a maximum accuracy of 60.0% or higher was 63.4% in the method using the one-dimensional primitive feature amount based on the gesture phase and SCIS, which was 76 Could be classified with an accuracy of 60% or more. This is higher than the random baseline of 56.1%, the average accuracy of the model using only the SCIS one-dimensional primitive feature amount is 58.0%, and the average accuracy of the model of the SCIS multidimensional primitive feature amount is 44.9%. Surpass. From the above, the effectiveness of the apparatus and method described in the examples was shown.

また、８０単語の分類精度の内、６５％以上の精度が得られた単語について、分析を行った結果、６５％以上の精度を有した名詞は「それ」、「次」、「よう」、「服」、「上」、「そこ」、「感じ」、「時」であった。多くの場合、最大の精度はジェスチャフェーズの特徴量を用いたモデルで得られ、「よう」だけはSISCに基づく特徴量を用いたモデルで得られた。 In addition, as a result of analyzing words having a precision of 65% or more out of the classification precision of 80 words, nouns having a precision of 65% or more are "that", "next", "you", It was "clothes", "upper", "there", "feeling", "time". In most cases, the maximum accuracy was obtained with the model using the gesture phase features, and only "Yo" was obtained with the model using the SISC-based features.

６５％以上の精度を有した動詞は「入る」、「異なる」、「する」、「行く」、「試す」、「つかまえる」、「くる」、「かんがえる」、「たたく」であった。多くの場合、最大の精度はジェスチャフェーズの特徴量を用いたモデルで得られた。これらの動詞は説明課題の元のビデオのキャラクターの動作を示すものであり、それらを良く表す特徴量はジェスチャフェーズ、プリミティブパターンの特徴量であった。 Verbs with an accuracy of 65% or higher were "enter," "different," "do," "go," "try," "catch," "come," "give," and "tap." In most cases, the maximum accuracy was obtained by the model using the features of the gesture phase. These verbs indicate the motions of the characters in the original video of the explanation task, and the feature quantities that express them well are the gesture phase and the feature quantity of the primitive pattern.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Additional notes>
The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity. Connectable communication unit, CPU (Central Processing Unit, cache memory and registers may be provided), RAM or ROM that is memory, external storage device that is a hard disk, and their input unit, output unit, and communication unit , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged among external storage devices. Further, if necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. As a physical entity provided with such hardware resources, there is a general-purpose computer or the like.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary to realize the above-described functions and data necessary for the processing of this program (not limited to the external storage device, for example, the program is read). It may be stored in a ROM that is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in the RAM, the external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM, etc.) and data necessary for the processing of each program are read into the memory as necessary, and interpreted and executed/processed by the CPU as appropriate. .. As a result, the CPU realizes a predetermined function (each constituent element represented by the above,... Unit,... Means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiments, and can be modified as appropriate without departing from the spirit of the present invention. Further, the processes described in the above-described embodiments are not only executed in time series in the order described, but may be executed in parallel or individually according to the processing capability of the device that executes the processes or as necessary. ..

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions of the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on the computer, the processing functions of the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape or the like is used as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disc. Memory), CD-R (Recordable)/RW (ReWritable), etc. as a magneto-optical recording medium, MO (Magneto-Optical disc) etc., and semiconductor memory EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is performed by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in a storage device of a server computer and transferred from the server computer to another computer via a network to distribute the program.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program temporarily stores, for example, the program recorded on a portable recording medium or the program transferred from the server computer in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. As another execution form of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be sequentially executed. Further, the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by executing the execution instruction and the result acquisition without transferring the program from the server computer to the computer. May be It should be noted that the program in this embodiment includes information that is used for processing by an electronic computer and that is equivalent to the program (data that is not a direct command to a computer, but has the property of defining computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be implemented by hardware.

Claims

身体動作の時系列データであるジェスチャの特徴量であるジェスチャ特徴を取得するジェスチャ特徴取得部と、
発話から抽出された単語の単語ベクトルを取得する単語ベクトル取得部と、
ジェスチャ特徴と単語ベクトルをそれらの時間共起に基づいて対応付け、各単語につき一つずつ学習され、各単語に対応する動作断片と、それ以外に対応する動作断片との間で訓練される２値分類器であるモデルであって、前記ジェスチャ特徴を入力とし、入力された前記ジェスチャ特徴が前記モデルに対応付けられた単語と対応するか否かを分類するモデルを生成するジェスチャ単語対応付部を含む
モデル学習装置。 A gesture feature acquisition unit that acquires a gesture feature that is a feature amount of a gesture that is time-series data of a body movement,
A word vector acquisition unit that acquires the word vector of the word extracted from the utterance,
Gesture features are associated with word vectors based on their temporal co-occurrence, one for each word is learned, and trained between action fragments corresponding to each word and other action fragments. a model is a value classifier, the gesture features as input, gesture word with the corresponding generating a model of the gesture feature that is input to classifying whether corresponding to words associated with the model Model learning device including a part.

請求項１に記載のモデル学習装置であって、
前記ジェスチャ特徴は、
手の動作軌跡に関する特徴量、ジェスチャフェーズに関する特徴量、前記手の動作軌跡に含まれる共通の短い時系列パターンである動作プリミティブパターンに関する特徴量の少なくとも何れかを含む
モデル学習装置。 The model learning device according to claim 1, wherein
The gesture feature is
A model learning device including at least one of a feature amount related to a motion trajectory of a hand, a feature amount related to a gesture phase, and a feature amount related to a motion primitive pattern which is a common short time series pattern included in the motion trajectory of the hand.

身体動作の時系列データであるジェスチャの特徴量であるジェスチャ特徴を取得するジェスチャ特徴取得部と、
各単語につき一つずつ学習され、各単語に対応する動作断片と、それ以外に対応する動作断片との間で訓練される２値分類器であるモデルであって、前記ジェスチャ特徴を入力とし、入力された前記ジェスチャ特徴が前記モデルに対応付けられた単語と対応するか否かを分類するモデルに基づいて、前記ジェスチャ特徴と対応する単語を推定する発話単語推定部を含む
発話単語推定装置。 A gesture feature acquisition unit that acquires a gesture feature that is a feature amount of a gesture that is time-series data of a body movement,
One by one learned for each word, the operation fragments corresponding to each word, a model is a binary classifier is trained with the corresponding operation fragment otherwise, and input the gesture feature A utterance word estimation unit that estimates a word corresponding to the gesture feature based on a model that classifies whether the input gesture feature corresponds to a word associated with the model ..

請求項３に記載の発話単語推定装置であって、
前記ジェスチャ特徴は、
手の動作軌跡に関する特徴量、ジェスチャフェーズに関する特徴量、前記手の動作軌跡に含まれる共通の短い時系列パターンである動作プリミティブパターンに関する特徴量の少なくとも何れかを含む
発話単語推定装置。 The spoken word estimation device according to claim 3, wherein
The gesture feature is
An utterance word estimation device that includes at least one of a feature amount related to a motion trajectory of a hand, a feature amount related to a gesture phase, and a feature amount related to a motion primitive pattern that is a common short time series pattern included in the motion trajectory of the hand.

モデル学習装置が実行するモデル学習方法であって、
身体動作の時系列データであるジェスチャの特徴量であるジェスチャ特徴を取得するステップと、
発話から抽出された単語の単語ベクトルを取得するステップと、
ジェスチャ特徴と単語ベクトルをそれらの時間共起に基づいて対応付け、各単語につき一つずつ学習され、各単語に対応する動作断片と、それ以外に対応する動作断片との間で訓練される２値分類器であるモデルであって、前記ジェスチャ特徴を入力とし、入力された前記ジェスチャ特徴が前記モデルに対応付けられた単語と対応するか否かを分類するモデルを生成するステップを含む
モデル学習方法。 A model learning method executed by a model learning device,
A step of acquiring a gesture feature that is a feature amount of a gesture that is time-series data of a body movement,
Obtaining a word vector of words extracted from the utterance,
Gesture features are associated with word vectors based on their temporal co-occurrence, one for each word is learned, and trained between action fragments corresponding to each word and other action fragments. a model is a value classifier receives as input the gesture features, models the gesture feature that is input includes the step of generating a model for classifying whether corresponding to words associated with the model Learning method.

発話単語推定装置が実行する発話単語推定方法であって、
身体動作の時系列データであるジェスチャの特徴量であるジェスチャ特徴を取得するステップと、
各単語につき一つずつ学習され、各単語に対応する動作断片と、それ以外に対応する動作断片との間で訓練される２値分類器であるモデルであって、前記ジェスチャ特徴を入力とし、入力された前記ジェスチャ特徴が前記モデルに対応付けられた単語と対応するか否かを分類するモデルに基づいて、前記ジェスチャ特徴と対応する単語を推定するステップを含む
発話単語推定方法。 A spoken word estimation method executed by a spoken word estimation device, comprising:
A step of acquiring a gesture feature that is a feature amount of a gesture that is time-series data of a body movement,
One by one learned for each word, the operation fragments corresponding to each word, a model is a binary classifier is trained with the corresponding operation fragment otherwise, and input the gesture feature A speech word estimation method comprising: estimating a word corresponding to the gesture feature based on a model that classifies whether the input gesture feature corresponds to a word associated with the model.

コンピュータを、請求項１または２に記載のモデル学習装置として機能させるプログラム。 A program that causes a computer to function as the model learning device according to claim 1.

コンピュータを、請求項３または４に記載の発話単語推定装置として機能させるプログラム。 A program that causes a computer to function as the spoken word estimating device according to claim 3.