JP2021072059A

JP2021072059A - Gesture detection device

Info

Publication number: JP2021072059A
Application number: JP2019200201A
Authority: JP
Inventors: ヒブランベニテス; Benitez Gibran; 浮田　宗伯; Munenori Ukita; 宗伯浮田; 佳行津田; Yoshiyuki Tsuda
Original assignee: Denso Corp; Toyota Gauken
Current assignee: Denso Corp; Toyota Gauken
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2021-05-06
Anticipated expiration: 2039-11-01
Also published as: JP7216373B2

Abstract

To provide a gesture detection device capable of detecting a gesture operation without setting a specific area to turn on a gesture function.SOLUTION: A gesture detection device identifies a hand area in which a hand of an operator is present from imaged data of a distance image obtained by capturing a wide range other than the hand. Then, the gesture detection device extracts a gesture section in which a gesture operation is performed on the basis of feature quantity of a hand area image 41a obtained by cutting out the hand area from a whole image 40 and feature quantity of the whole image 40, and detects the gesture operation. Thus, when the gesture operation is performed on a wide screen such as a WSD, the gesture detection device can detect the gesture operation without setting a specific area to turn on the gesture function.SELECTED DRAWING: Figure 3B

Description

本発明は、操作者の手指の動きによるジェスチャ動作を検出するジェスチャ検出装置に関するものである。 The present invention relates to a gesture detection device that detects a gesture motion due to the movement of the operator's finger.

従来、ジェスチャ動作を検出するジェスチャ検出装置として、特許文献１に示されるものがある。この装置では、操作者の指が特定の領域内に入った時に、ジェスチャ機能をオンすることで、ユーザがジェスチャ以外の特別な操作を行わなくても、ジェスチャ認識のオン／オフを切り替えることを可能としている。 Conventionally, as a gesture detection device for detecting a gesture operation, there is one shown in Patent Document 1. In this device, when the operator's finger enters a specific area, the gesture function is turned on so that the gesture recognition can be turned on / off without the user performing any special operation other than the gesture. It is possible.

なお、本明細書において「ジェスチャ動作」とは、例えば、操作者が手指を上下左右などの所定の一方向に動かす、または指を時計回りもしくは反時計回りに動かすなど、機器操作などを行うために手指を用いて所定動作を行うことを意味する。 In addition, in this specification, "gesture operation" means, for example, for an operator to operate a device such as moving a finger in a predetermined direction such as up / down / left / right or moving a finger clockwise or counterclockwise. It means that a predetermined operation is performed by using fingers.

特開２０１３−０７７２２９号公報Japanese Unexamined Patent Publication No. 2013-07722

ジェスチャ動作に基づく機器操作を行うこと（以下、ジェスチャ操作という）を車両におけるウィンドシールドディスプレイ（以下、ＷＳＤという）のような広い画面を対象として適用することが検討されている。 It is being studied to apply device operation based on gesture operation (hereinafter referred to as gesture operation) to a wide screen such as a windshield display (hereinafter referred to as WSD) in a vehicle.

しかしながら、上記した特許文献１のジェスチャ検出装置でのジェスチャ動作の検出手法を適用する場合、ジェスチャ機能をオンする特定の領域を画面に合わせて広くする必要がある。このため、ジェスチャ動作以外の動作についてもジェスチャ動作と誤って検出してしまうという課題が発生する。 However, when applying the gesture motion detection method of the gesture detection device of Patent Document 1 described above, it is necessary to widen a specific area for turning on the gesture function according to the screen. For this reason, there arises a problem that an operation other than the gesture operation is erroneously detected as a gesture operation.

本発明は上記点に鑑みて、ジェスチャ機能をオンするために特定の領域を設定しなくても、ジェスチャ動作を検出することが可能なジェスチャ検出装置を提供することを目的とする。 In view of the above points, it is an object of the present invention to provide a gesture detection device capable of detecting a gesture operation without setting a specific area for turning on the gesture function.

上記目的を達成するため、請求項１に記載のジェスチャ検出装置は、操作者の手を含む全体画像（４０）の撮像データを入力し、該撮像データから手が存在する領域である手領域（４１）を特定する手検出部（１１）と、全体画像の撮像データに基づいて、該全体画像の特徴量を抽出する全体特徴抽出部（１２ａ）と、全体画像の中から手領域の画像である手領域画像（４１ａ）を切り出し、該手領域画像の特徴量を抽出する手領域特徴抽出部（１２ｂ）と、全体特徴抽出部で抽出された全体画像の特徴量と手領域特徴抽出部で抽出された手領域画像の特徴量を結合する特徴結合部（１２ｃ）と、特徴結合部による結合後の特徴量について時系列パターンの分類を行い、ジェスチャ動作の開始フレームから終了フレームまでのジェスチャ区間を抽出する時系列パターン分類部（１２ｄ）と、時系列パターン分類部で抽出されたジェスチャ区間中の画像フレームに基づき、ジェスチャ動作を識別するジェスチャ識別部（１４）と、を有している。 In order to achieve the above object, the gesture detection device according to claim 1 inputs the imaging data of the entire image (40) including the operator's hand, and from the imaging data, the hand region (the region where the hand exists) ( A hand detection unit (11) that identifies 41), an overall feature extraction unit (12a) that extracts the feature amount of the entire image based on the captured data of the entire image, and an image of the hand region from the entire image. A hand region feature extraction unit (12b) that cuts out a hand region image (41a) and extracts the feature amount of the hand region image, and a feature quantity and a hand region feature extraction unit of the entire image extracted by the overall feature extraction unit. A time-series pattern is classified for the feature combination part (12c) that combines the feature amounts of the extracted hand region image and the feature amount after the combination by the feature combination part, and the gesture section from the start frame to the end frame of the gesture operation. It has a time-series pattern classification unit (12d) for extracting the data, and a gesture identification unit (14) for identifying the gesture operation based on the image frame in the gesture section extracted by the time-series pattern classification unit.

このようなジェスチャ検出装置では、手および手以外の広範囲を撮像した距離画像の撮像データから操作者の手が存在する手領域を特定している。そして、全体画像から手領域を切り取った手領域画像の特徴量と全体画像の特徴量とに基づいてジェスチャ動作が行われているジェスチャ区間を抽出し、ジェスチャ動作を検出している。これにより、ＷＳＤのような広い画面を対象としてジェスチャ動作が行われる場合に、ジェスチャ機能をオンするために特定の領域を設定しなくても、ジェスチャ動作を検出することが可能となる。 In such a gesture detection device, the hand region in which the operator's hand exists is specified from the captured data of the hand and the distance image obtained by capturing a wide range other than the hand. Then, the gesture section in which the gesture operation is performed is extracted based on the feature amount of the hand area image obtained by cutting the hand area from the entire image and the feature amount of the entire image, and the gesture operation is detected. As a result, when the gesture operation is performed on a wide screen such as WSD, the gesture operation can be detected without setting a specific area to turn on the gesture function.

なお、各構成要素等に付された括弧付きの参照符号は、その構成要素等と後述する実施形態に記載の具体的な構成要素等との対応関係の一例を示すものである。 The reference reference numerals in parentheses attached to each component or the like indicate an example of the correspondence between the component or the like and the specific component or the like described in the embodiment described later.

本発明の第１実施形態にかかるジェスチャ検出装置のブロック構成を示した図である。It is a figure which showed the block structure of the gesture detection apparatus which concerns on 1st Embodiment of this invention. 撮像装置が撮像した全体画像の一例を示した図である。It is a figure which showed an example of the whole image which the image pickup apparatus has taken. 撮像装置が撮像した全体画像から手領域をマスクした場合を示した図である。It is a figure which showed the case where the hand region was masked from the whole image taken by the image pickup apparatus. 全体画像から手領域を切り出した様子を示した図である。It is a figure which showed the appearance which the hand area was cut out from the whole image. 全体画像から手領域を切り出し、手領域のサイズを全体画像の画像サイズに大きさを合わせた場合の様子を示した図である。It is a figure which showed the state when the hand area was cut out from the whole image, and the size of the hand area was adjusted to the image size of the whole image. 手領域の特定手法を示すフローチャートである。It is a flowchart which shows the specific method of a hand area. 手の画像から算出した特徴量と手以外の画像から算出した特徴量をＸＹ座標上にプロットして閾値を設定した場合を示す図である。It is a figure which shows the case where the feature amount calculated from the image of a hand and the feature amount calculated from the image other than a hand are plotted on XY coordinates and the threshold value is set. 全体画像から小領域を切り出して小領域での特徴量を抽出する際の様子を示した図である。It is a figure which showed the state at the time of cutting out a small area from the whole image and extracting the feature amount in a small area. 撮像データからの特徴量の抽出の様子を示した図である。It is a figure which showed the state of the extraction of the feature amount from the image pickup data. 時系列パターンの分類手法を示した図である。It is a figure which showed the classification method of a time series pattern. ジェスチャ検出装置が実行する処理のフローチャートである。It is a flowchart of the process executed by a gesture detection device. 第２実施形態で説明する特徴量の結合の様子を示した図である。It is a figure which showed the state of the combination of the feature amount explained in 2nd Embodiment. 時系列パターン分類のフローチャートである。It is a flowchart of time series pattern classification. 図１１Ａ中のステップＳ３００で用意される現フレームからＮフレーム前の特徴量を示した図である。It is a figure which showed the feature amount before N frames from the present frame prepared in step S300 in FIG. 11A. 第４実施形態にかかるジェスチャ検出装置のブロック構成を示した図である。It is a figure which showed the block structure of the gesture detection apparatus which concerns on 4th Embodiment.

以下、本発明の実施形態について図に基づいて説明する。なお、以下の各実施形態相互において、互いに同一もしくは均等である部分には、同一符号を付して説明を行う。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In each of the following embodiments, parts that are the same or equal to each other will be described with the same reference numerals.

（第１実施形態）
第１実施形態について説明する。本実施形態にかかるジェスチャ検出装置は、例えば車両における情報機器のジェスチャ操作に適用される。例えば、操作機器としてＷＳＤが挙げられ、ＷＳＤの表示部に映し出される表示画像上でのジェスチャ操作を行う場合のジェスチャ動作の検出のために本実施形態にかかるジェスチャ検出装置が適用される。 (First Embodiment)
The first embodiment will be described. The gesture detection device according to the present embodiment is applied to, for example, a gesture operation of an information device in a vehicle. For example, WSD is mentioned as an operation device, and the gesture detection device according to the present embodiment is applied to detect a gesture operation when performing a gesture operation on a display image displayed on a display unit of the WSD.

図１に示すように、本実施形態にかかるジェスチャ検出装置１０は、車室内に備えられた撮像装置２０からの撮像データに基づいてジェスチャ動作の検出を行う。そして、ジェスチャ検出装置１０は、ジェスチャ動作の検出結果を示すジェスチャ検出データを情報機器３０に出力し、ジェスチャ動作に対応する情報機器３０の操作が行われるようにする。 As shown in FIG. 1, the gesture detection device 10 according to the present embodiment detects the gesture operation based on the image pickup data from the image pickup device 20 provided in the vehicle interior. Then, the gesture detection device 10 outputs the gesture detection data indicating the detection result of the gesture operation to the information device 30, so that the information device 30 corresponding to the gesture operation is operated.

撮像装置２０は、操作者の手を含む領域の画像を連続的に撮像し、ジェスチャ検出装置１０に撮像データを入力するものであり、距離画像を撮像できるもの、例えば２Ｄの近赤外線画像もしくは可視光画像を撮像できる車載カメラなどで構成される。 The image pickup device 20 continuously captures an image of an area including the operator's hand and inputs the image pickup data to the gesture detection device 10, and is capable of capturing a distance image, for example, a 2D near-infrared image or visible. It consists of an in-vehicle camera that can capture optical images.

情報機器３０としては、ジェスチャ検出装置１０での検出結果が用いられるものであればどのようなものであっても良いが、上記したように、例えばＷＳＤが挙げられる。ＷＳＤでは、例えばナビゲーションシステムにおける地図表示や経路案内表示が行われ、操作者は、ジェスチャ動作を行って表示された地図の広域表示もしくは詳細表示の切り替えや、経路案内の設定等を行えるようになっている。また、自動運転時などにおいては、ＷＳＤでゲームなどの表示を行うことも可能であり、操作者は、ジェスチャ動作を行ってゲーム内の操作対象となるアイコンの操作などが行えるようになっている。 The information device 30 may be any device as long as the detection result of the gesture detection device 10 is used, and as described above, for example, WSD can be mentioned. In WSD, for example, map display and route guidance display in a navigation system are performed, and the operator can switch between wide area display or detailed display of the displayed map by performing a gesture operation, set route guidance, and the like. ing. In addition, during automatic driving, it is possible to display a game or the like on the WSD, and the operator can perform a gesture operation to operate an icon to be operated in the game. ..

ジェスチャ検出装置１０は、ＣＰＵ、ＲＯＭ、ＲＡＭ、Ｉ／Ｏなどを備えたマイクロコンピュータによって構成され、制御部を構成するものである。本実施形態では、ジェスチャ検出装置１０は、手検出部１１、ジェスチャ区間抽出部１２、メモリ１３およびジェスチャ識別部１４を有した構成とされている。 The gesture detection device 10 is composed of a microcomputer provided with a CPU, a ROM, a RAM, an I / O, and the like, and constitutes a control unit. In the present embodiment, the gesture detection device 10 is configured to include a hand detection unit 11, a gesture section extraction unit 12, a memory 13, and a gesture identification unit 14.

手検出部１１は、撮像装置２０から入力される撮像データから操作者の手が存在する領域（以下、手領域という）を特定し、その領域に関するデータをジェスチャ区間抽出部１２に伝える。手検出部１１による手領域の特定手法については後述する。 The hand detection unit 11 identifies a region in which the operator's hand exists (hereinafter referred to as a hand region) from the imaging data input from the imaging device 20, and transmits data relating to that region to the gesture section extraction unit 12. The method of specifying the hand region by the hand detection unit 11 will be described later.

ジェスチャ区間抽出部１２は、ジェスチャ動作が行われている区間であるジェスチャ区間を抽出する。具体的には、ジェスチャ区間抽出部１２は、全体特徴抽出部１２ａ、手領域特徴抽出部１２ｂ、特徴結合部１２ｃおよび時系列パターン分類部１２ｄを有した構成とされている。 The gesture section extraction unit 12 extracts a gesture section which is a section in which the gesture operation is performed. Specifically, the gesture section extraction unit 12 has a configuration including an overall feature extraction unit 12a, a hand region feature extraction unit 12b, a feature combination unit 12c, and a time series pattern classification unit 12d.

全体特徴抽出部１２ａは、撮像装置２０で撮像した全体画像の撮像データを入力として、その全体画像の特徴量を抽出する。全体画像の特徴量については、例えば、図２Ａに示すように撮像装置２０が撮像した全体画像４０そのものを用いて抽出しても良いし、図２Ｂに示すように全体画像４０から上記した手検出部１１で検出した手領域４１をマスクして抽出しても良い。マスクする場合には、特徴量に手の位置の情報が埋め込まれるように、撮像装置２０が撮像する画像に含まれる可能性が低い画像を手検出部１１が検出した手領域４１に埋め込むことでマスクしても良い。画像中に含まれる可能性が低い画像としては、例えば黒色や白色などのように撮像され得る画像の色彩と大きな差が生じる色彩の画像が挙げられる。色彩の差については、例えば手領域４１の色彩として想定される色彩の色相、再度、明度の少なくとも１つが閾値以上の差になっていれば良い。なお、この全体特徴抽出部１２ａによる特徴量の抽出手法については後述する。 The overall feature extraction unit 12a takes the captured data of the entire image captured by the imaging device 20 as an input and extracts the feature amount of the entire image. The feature amount of the whole image may be extracted using, for example, the whole image 40 itself captured by the imaging device 20 as shown in FIG. 2A, or the above-mentioned hand detection from the whole image 40 as shown in FIG. 2B. The hand region 41 detected in the part 11 may be masked and extracted. In the case of masking, an image that is unlikely to be included in the image captured by the image pickup device 20 is embedded in the hand region 41 detected by the hand detection unit 11 so that the information on the position of the hand is embedded in the feature amount. You may mask it. Examples of the image that is unlikely to be included in the image include an image having a color that is significantly different from the color of the image that can be captured, such as black or white. Regarding the difference in color, for example, the hue of the color assumed as the color of the hand region 41 and again at least one of the lightness may be a difference of the threshold value or more. The method for extracting the feature amount by the overall feature extraction unit 12a will be described later.

手領域特徴抽出部１２ｂは、図３Ａに示すように、撮像装置２０で撮像した全体画像４０の中から手検出部１１が検出した手領域４１に相当する部分を切り出す。そして、手領域特徴抽出部１２ｂは、切り出した画像（以下、手領域画像という）４１ａを入力として、その手領域画像４１ａの特徴量を抽出する。このときの特徴量の抽出については、図３Ｂに示すように、手領域画像４１ａの画像サイズを全体画像４０の画像サイズに大きさを揃えて行うようにしても良い。このようにすれば、手領域画像４１ａの特徴量を全体画像４０の特徴量と重みを揃えることができ、手領域画像４１ａの特徴量がクローズアップされ、全体画像４０の特徴量に埋もれないで現れ易くなるようにできる。 As shown in FIG. 3A, the hand region feature extraction unit 12b cuts out a portion corresponding to the hand region 41 detected by the hand detection unit 11 from the entire image 40 captured by the imaging device 20. Then, the hand region feature extraction unit 12b takes the cut out image (hereinafter referred to as a hand region image) 41a as an input and extracts the feature amount of the hand region image 41a. Regarding the extraction of the feature amount at this time, as shown in FIG. 3B, the image size of the hand region image 41a may be made uniform with the image size of the entire image 40. In this way, the feature amount of the hand area image 41a can be aligned with the feature amount of the entire image 40, the feature amount of the hand area image 41a is highlighted, and the feature amount of the hand area image 41a is not buried in the feature amount of the entire image 40. It can be made easier to appear.

なお、撮像装置２０として２Ｄの近赤外線画像が用いられる場合、例えば温度分布に応じた色彩が濃淡パターンとして現れた画像となり、可視光画像の場合は、明るさが濃淡パターンとして現れた画像となる。いずれの場合でも、同様の手法によって特徴量を抽出できる。 When a 2D near-infrared image is used as the image pickup apparatus 20, for example, an image in which colors corresponding to the temperature distribution appear as a shade pattern is obtained, and in the case of a visible light image, an image in which brightness appears as a shade pattern is obtained. .. In either case, the feature amount can be extracted by the same method.

特徴結合部１２ｃは、全体特徴抽出部１２ａで抽出した全体画像４０の特徴量と手領域特徴抽出部１２ｂで抽出した手領域画像４１ａの特徴量とを結合する。この結合によってデータ圧縮が行われるようにすると好ましい。特徴量の結合の手法としては様々な手法が有り、どのような手法でも良いが、例えば重み付き和によって結合させたり、単純に特徴量を示すデータを連結することで結合させたりする手法が挙げられる。重み付き和による結合とする場合には、データ圧縮も図れる。 The feature combining unit 12c combines the feature amount of the entire image 40 extracted by the overall feature extraction unit 12a with the feature amount of the hand region image 41a extracted by the hand region feature extraction unit 12b. It is preferable that data compression is performed by this combination. There are various methods for combining the features, and any method can be used. For example, a method of combining by a weighted sum or a method of simply concatenating the data indicating the features can be mentioned. Be done. Data compression can also be achieved when combining by weighted sum.

時系列パターン分類部１２ｄは、特徴結合部１２ｃで結合した特徴量について、時系列パターンを分類し、過去の画像フレームの中からジェスチャ動作の開始フレームを検出する。そして、時系列パターン分類部１２ｄでは、現在の画像フレーム（以下、現フレームという）をジェスチャ動作の終了フレームとして、開始フレームから終了フレームまでとなるジェスチャ区間を抽出する。時系列パターンの分類手法については様々な手法を適用できるが、ここでは再帰型ニューラルネットワーク（以下、ＲＮＮ（Recurrent Neural Network）という）を用いている。 The time-series pattern classification unit 12d classifies the time-series pattern with respect to the feature amount combined by the feature combination unit 12c, and detects the start frame of the gesture operation from the past image frames. Then, the time-series pattern classification unit 12d extracts the gesture section from the start frame to the end frame by using the current image frame (hereinafter referred to as the current frame) as the end frame of the gesture operation. Various methods can be applied to the classification method of the time series pattern, but here, a recurrent neural network (hereinafter referred to as RNN (Recurrent Neural Network)) is used.

ＲＮＮは、新たに特徴量が入力される毎に、現フレームをジェスチャ動作の終了フレームと仮定すると共に、｛１，２，・・・，Ｎ｝フレーム前を開始フレームと仮定した時の確からしさＮ個を出力するように学習されている。Ｎ個のうち確からしさの最大値に基づいてジェスチャ動作の開始フレームが求められ、現フレームを終了フレームとして、開始フレームから終了フレームまでのジェスチャ区間が抽出される。具体的には、Ｎ個のうち確からしさが最大値となっているｉ番目のノードが閾値を超えると、ｉフレーム前が開始フレームとされることで、ジェスチャ区間が抽出されるようになっている。この時系列パターンの分類手法についても後述する。 The RNN assumes that the current frame is the end frame of the gesture operation each time a new feature is input, and the certainty is that the frame before the {1, 2, ..., N} frame is the start frame. It is learned to output N pieces. The start frame of the gesture operation is obtained based on the maximum value of the certainty out of N, and the gesture section from the start frame to the end frame is extracted with the current frame as the end frame. Specifically, when the i-th node, which has the maximum probability among the N nodes, exceeds the threshold value, the gesture section is extracted by setting the start frame before the i-frame. There is. The classification method of this time series pattern will also be described later.

メモリ１３は、手検出部１１、ジェスチャ区間抽出部１２およびジェスチャ識別部１４で実行する各種処理のプログラムや、マッチング用データなどの各種データを記憶したＲＯＭ、ＲＡＭ等の非遷移有形記録媒体を構成するものである。このメモリ１３に記憶されたプログラムや各種データを用いて各種処理が実行されるようになっている。 The memory 13 constitutes a non-transition tangible recording medium such as a ROM or RAM that stores various processing programs executed by the hand detection unit 11, the gesture section extraction unit 12, and the gesture identification unit 14, and various data such as matching data. To do. Various processes are executed using the programs and various data stored in the memory 13.

ジェスチャ識別部１４は、ジェスチャ区間抽出部１２で抽出したジェスチャ区間中の各フレーム、つまり開始フレームから終了フレームまでの間のフレームに基づき、手検出部１１で検出した手領域の画像からジェスチャ動作を識別する。識別手法自体は、手領域が既に特定されていることから、従来と同様の手法などで良い。例えば、ジェスチャ識別部１４は、予めジェスチャ動作のパターンとしてメモリ１３に記憶させた様々なパターンと対比し、最も近いパターンのジェスチャ動作と判定することで、ジェスチャ動作を識別することができる。このようにして、ジェスチャ検出装置１０でジェスチャ動作が検出されると、その検出結果が情報機器３０に伝えられ、例えば、情報機器３０の操作について、ジェスチャ動作に従った操作が行われることになる。 The gesture identification unit 14 performs a gesture operation from the image of the hand region detected by the hand detection unit 11 based on each frame in the gesture section extracted by the gesture section extraction unit 12, that is, a frame between the start frame and the end frame. Identify. As the identification method itself, since the hand area has already been specified, the same method as the conventional method may be used. For example, the gesture identification unit 14 can identify the gesture operation by comparing it with various patterns stored in the memory 13 in advance as the gesture operation pattern and determining that the gesture operation is the closest pattern. In this way, when the gesture detection device 10 detects the gesture operation, the detection result is transmitted to the information device 30, and for example, the operation of the information device 30 is performed according to the gesture operation. ..

続いて、上記した手領域の特定手法、特徴量の抽出手法、時系列パターンの分類手法について、順に説明する。 Next, the above-mentioned method for specifying the hand region, the method for extracting the feature amount, and the method for classifying the time series pattern will be described in order.

（１）手領域の特定手法
手領域の特定については、図４に示すフローチャートに従って行っている。具体的には、ステップＳ１００のように、撮像装置２０から入力された全体画像４０から一部を切り出して特徴量を抽出したのち、ステップＳ１１０のように、事前学習済みのパターンと照合する。これにより、切り出された部分画像の「手らしさ」の確からしさ、つまり手であることの確からしさを算出する。
照合に関しては、メモリ１３に学習済みパターンを記憶しているため、そのデータをメモリ１３から読み出すことによって行われる。手の画像と手以外の画像をそれぞれ多数用意し、各画像の特徴量を算出した場合に、その算出結果が手の画像の特徴量と手以外の画像の特徴量とで差が出ることから、その差の中間に位置する値を学習済みパターンとしている。 (1) Method for specifying the hand area The hand area is specified according to the flowchart shown in FIG. Specifically, as in step S100, a part of the entire image 40 input from the image pickup apparatus 20 is cut out to extract a feature amount, and then the pattern is collated with the pre-learned pattern as in step S110. As a result, the certainty of the "handiness" of the cut out partial image, that is, the certainty of being a hand is calculated.
Since the learned pattern is stored in the memory 13, the collation is performed by reading the data from the memory 13. When a large number of hand images and non-hand images are prepared and the feature amount of each image is calculated, the calculation result is different between the hand image feature amount and the non-hand image feature amount. , The value located in the middle of the difference is regarded as the learned pattern.

例えば、図５に示すように、手の画像から算出した特徴量と手以外の画像から算出した特徴量をＸＹ座標で表すと、それぞれの特徴量が存在する領域が分かれるため、その２つの領域を区画する境界線を学習済みパターンとしている。 For example, as shown in FIG. 5, when the feature amount calculated from the image of the hand and the feature amount calculated from the image other than the hand are represented by XY coordinates, the areas in which the respective feature amounts exist are separated, and therefore the two areas are separated. The boundary line that divides is used as the learned pattern.

そして、ステップＳ１２０のように、切り出された部分画像の「手らしさ」の確からしさに基づいて、手領域を特定する。具体的には、全体画像４０から切り出した一部が手領域に相当するか否かを、その切り出した一部についての特徴量が学習済みパターンとして記憶した境界線より、手の画像から算出した特徴量の方へどれだけ離れているかに基づいて判定している。確からしさが閾値以上なら、その切り出した一部が手領域であると判定している。 Then, as in step S120, the hand region is specified based on the certainty of the “handiness” of the cut out partial image. Specifically, whether or not a part cut out from the entire image 40 corresponds to the hand area was calculated from the hand image from the boundary line in which the feature amount of the cut out part was stored as a learned pattern. The judgment is based on how far away the feature amount is. If the certainty is equal to or higher than the threshold value, it is determined that a part of the cut out is the hand area.

例えば、図６に示すように、全体画像４０をｍ×ｎの画素に区画し、注目したい画素とその周辺の画素を含んだ小領域４２、例えば５０×５０や１００×１００の領域を設定し、その小領域４２の画像について特徴量を抽出する。このときの特徴量の抽出についても、後述する手法を用いれば良い。 For example, as shown in FIG. 6, the entire image 40 is divided into m × n pixels, and a small area 42 including the pixel of interest and the surrounding pixels, for example, a 50 × 50 or 100 × 100 area is set. , The feature amount is extracted for the image of the small area 42. As for the extraction of the feature amount at this time, the method described later may be used.

（２）特徴量の抽出手法
特徴量の抽出手法としては、様々な手法を適用できるが、ここでは畳み込みニューラルネットワーク（以下、ＣＮＮ（Convolutional Neural Network）という）を適用している。ＣＮＮは、畳み込み演算（Convolution）による画像特徴量の抽出とプーリング（Pooling）と呼ばれる処理を行い、何層にもわたって積み上げられたネットワークを構成して特徴量を抽出する手法である。 (2) Feature extraction method Various methods can be applied as the feature extraction method, but here, a convolutional neural network (hereinafter referred to as CNN (Convolutional Neural Network)) is applied. CNN is a method of extracting image features by convolution and performing a process called pooling to form a network stacked over multiple layers to extract features.

Convolution層では、図６と同様に、注目したい画素とその周辺の画素を含んだ小領域にあるデータの重み付き和を計算することで特徴量を取り出している。例えば３×３の領域の場合には、９画素分の重み付き和を計算する。そして、少しずつ小領域をずらし、対象とする画像の全域について重み付き和を計算する。全体特徴抽出部１２ａによる全体画像４０での特徴量抽出を行う場合には、全体画像４０の全域について重み付き和を計算し、手領域特徴抽出部１２ｂによる手領域画像４１ａでの特徴量検出の場合には、手領域の全域について重み付き和を計算する。 In the Convolution layer, as in FIG. 6, the feature amount is extracted by calculating the weighted sum of the data in the small area including the pixel of interest and the pixels around it. For example, in the case of a 3 × 3 region, the weighted sum for 9 pixels is calculated. Then, the small areas are shifted little by little, and the weighted sum is calculated for the entire area of the target image. When the feature amount extraction in the whole image 40 is performed by the whole feature extraction unit 12a, the weighted sum is calculated for the entire area of the whole image 40, and the feature amount detection in the hand area image 41a by the hand area feature extraction unit 12b is performed. If so, the weighted sum is calculated for the entire hand region.

例えば、図７の状態Ａに示すような全体画像４０からの任意の小領域にあるデータの重み付き和が算出される。そして、対象とする画像の全域を走査して重み付き和が計算され、図７の状態Ｂに示すような画像全域の特徴量が得られる。 For example, the weighted sum of the data in any small area from the overall image 40 as shown in state A in FIG. 7 is calculated. Then, the weighted sum is calculated by scanning the entire area of the target image, and the feature amount of the entire image as shown in the state B of FIG. 7 is obtained.

Pooling層では、小領域にあるデータを代表値、例えば最大値や平均値などにまとめ、データの縮小を行っている。つまり、Convolution層で得た重み付き和の特徴を残しつつ、データ量を削減している。このとき、小領域については、互いに重ならないようにずらすようにする。これにより、図７の状態Ｃに示すようにPooling層が得られる。 In the Pooling layer, the data in the small area is collected into representative values such as the maximum value and the average value, and the data is reduced. In other words, the amount of data is reduced while retaining the characteristics of the weighted sum obtained in the Convolution layer. At this time, the small areas should be shifted so as not to overlap each other. As a result, a Pooling layer is obtained as shown in the state C of FIG.

これらConvolution層とPooling層を重ねて、対象とする画像の濃淡パターンに基づく特徴量が抽出することで、図７の状態Ｄに示すように、対象とする画像の特徴量のデータが形成される。 By superimposing these Convolution layer and Pooling layer and extracting the feature amount based on the shading pattern of the target image, as shown in the state D of FIG. 7, the feature amount data of the target image is formed. ..

なお、各層を得るときに、活性化関数を適用することもできる。活性化関数は、各層の出力値に非線形変換を行い、表現力を向上するものである。活性化関数については、Convolution層とPooling層のいずれか一方について、もしくは両方について適用することができる。活性化関数を適用することで、閾値を設定する境界線がより線形的なものであっても対応することができる。 It should be noted that the activation function can also be applied when obtaining each layer. The activation function performs non-linear conversion to the output value of each layer to improve expressiveness. The activation function can be applied to either or both of the Convolution layer and the Pooling layer. By applying the activation function, even if the boundary line for setting the threshold value is more linear, it can be dealt with.

全結合（Fully Connected）層では、Convolution層とPooling層を重ねて得られた対象とする画像の特徴量のデータを１つのノードに結合し、特徴変数を出力する。例えば、全結合層では、特徴量の全データの重み付き和を計算している。これにより、図７の状態Ｅに示すように、全結合層が得られる。 In the Fully Connected layer, the feature amount data of the target image obtained by stacking the Convolution layer and the Pooling layer is combined into one node, and the feature variable is output. For example, in the fully connected layer, the weighted sum of all the data of the features is calculated. As a result, as shown in the state E of FIG. 7, a fully bonded layer is obtained.

（３）時系列パターンの分類手法
上記した手法に基づいて、図８に示すように、全体特徴抽出部１２ａによる全体画面での特徴量抽出と手領域特徴抽出部１２ｂによる手領域画像４１ａでの特徴量抽出が行われると、これら各特徴量が特徴結合部１２ｃにて結合される。そして、この結合後の特徴量に基づいて時系列パターンの分類が行われる。ここでは、全体画面での特徴量のノード（以下、全体特徴量ノードという）と手領域画面での特徴量のノード（以下、手領域特徴量ノードという）の重み付き和を計算することによって、結合後の特徴量のノードを得ている。 (3) Time-series pattern classification method Based on the above method, as shown in FIG. 8, the feature amount extraction on the entire screen by the overall feature extraction unit 12a and the hand region image 41a by the hand region feature extraction unit 12b. When the feature amount extraction is performed, each of these feature amounts is combined at the feature coupling portion 12c. Then, the time series pattern is classified based on the feature amount after the combination. Here, by calculating the weighted sum of the feature amount node on the entire screen (hereinafter referred to as the total feature amount node) and the feature amount node on the hand area screen (hereinafter referred to as the hand area feature amount node). The node of the feature amount after the combination is obtained.

時系列パターンの分類については、ＲＮＮによって行っている。具体的には、結合後の特徴量のノードと、現フレームの直前のフレームの際に得た直前の内部状態を示すノードとの重み付き和を算出することで、現フレームの内部状態を計算する。そして、算出された内部状態を示す全データの重み付け和を計算することでＲＮＮの出力を得ている。なお、このときに得た現フレームの内部状態は、直前の内部状態を示すノードとして記憶され、次のフレームの内部状態を計算する際に用いられることになる。 The classification of time series patterns is performed by RNN. Specifically, the internal state of the current frame is calculated by calculating the weighted sum of the node of the feature amount after the combination and the node indicating the internal state immediately before the frame immediately before the current frame. To do. Then, the RNN output is obtained by calculating the weighted sum of all the data indicating the calculated internal state. The internal state of the current frame obtained at this time is stored as a node indicating the immediately preceding internal state, and is used when calculating the internal state of the next frame.

ＲＮＮの出力は、図中に示されるように、１〜Ｎのノードとして表され、各ノードのデータは、現フレームを終了フレームと仮定すると共に、｛１，２，・・・，Ｎ｝フレーム前を開始フレームと仮定した時の確からしさを示している。この各ノードに表されている確からしさの数値の中から最大値を選択し、その最大値が閾値を超えているか否かを判定する。そして、その最大値が閾値を超えていれば、その最大値となっていたノードの番号が現フレームを最終フレームとしたときに開始フレームが何フレーム前であるかを示していることになる。 The output of the RNN is represented as nodes 1 to N as shown in the figure, and the data of each node assumes the current frame as the end frame and {1, 2, ..., N} frame. It shows the certainty when the previous frame is assumed to be the start frame. The maximum value is selected from the numerical values of certainty represented in each node, and it is determined whether or not the maximum value exceeds the threshold value. Then, if the maximum value exceeds the threshold value, the number of the node that has become the maximum value indicates how many frames before the start frame when the current frame is set as the final frame.

一例を示すと、各ノードの確からしさの数値は０〜１で表されており、閾値は例えば０．５などで表されている。ＲＮＮの出力における２番目のノードの確からしさの数値が全ノード中の最大値であったとすると、その最大値が閾値となる０．５を超えていれば、開始フレームは現フレームの２フレーム前として算出される。これにより、ジェスチャ動作の開始フレームから終了フレームまでとなるジェスチャ区間が抽出され、その抽出結果がジェスチャ識別部１４に伝えられることで、ジェスチャ動作が識別されるようになっている。 As an example, the numerical value of the certainty of each node is represented by 0 to 1, and the threshold value is represented by, for example, 0.5. Assuming that the value of the certainty of the second node in the output of the RNN is the maximum value among all the nodes, if the maximum value exceeds the threshold value of 0.5, the start frame is two frames before the current frame. Is calculated as. As a result, the gesture section from the start frame to the end frame of the gesture operation is extracted, and the extraction result is transmitted to the gesture identification unit 14, so that the gesture operation is identified.

以上のようにして、本実施形態にかかるジェスチャ検出装置１０が構成されており、手領域の特定、特徴量の抽出、時系列パターンの分類およびジェスチャ動作の識別が行われている。 As described above, the gesture detection device 10 according to the present embodiment is configured, and the hand area is specified, the feature amount is extracted, the time series pattern is classified, and the gesture operation is identified.

図９は、ジェスチャ検出装置１０が実行する処理のフローチャートである。この処理は、撮像装置２０からの撮像データに基づいて所定の制御周期毎に実行される。ジェスチャ検出装置１０が車両に搭載される場合には、例えばイグニッションスイッチなどの車両の起動スイッチがオンされると、ジェスチャ検出装置１０も電源投入がなされ、所定の制御周期毎に図９に示す各処理が実行される。 FIG. 9 is a flowchart of the process executed by the gesture detection device 10. This process is executed at predetermined control cycles based on the image pickup data from the image pickup apparatus 20. When the gesture detection device 10 is mounted on the vehicle, for example, when the vehicle start switch such as the ignition switch is turned on, the gesture detection device 10 is also turned on, and each of the gesture detection devices 10 shown in FIG. 9 is turned on. The process is executed.

まず、ステップＳ２００では、撮像装置２０からの距離画像の撮像データを入力する。そして、ステップＳ２１０において、手領域を特定する。この処理は、上記した手領域の特定手法に基づいて、手検出部１１によって行われる。その後、ステップＳ２２０で全体画像４０の特徴量を抽出すると共に、ステップＳ２３０において手領域画像の特徴量を抽出する。この処理は、上記した特徴量の抽出手法に基づいて、全体特徴抽出部１２ａや手領域特徴抽出部１２ｂによって行われる。 First, in step S200, the captured data of the distance image from the imaging device 20 is input. Then, in step S210, the hand area is specified. This process is performed by the hand detection unit 11 based on the above-mentioned method for specifying the hand region. Then, in step S220, the feature amount of the entire image 40 is extracted, and in step S230, the feature amount of the hand region image is extracted. This process is performed by the overall feature extraction unit 12a and the hand region feature extraction unit 12b based on the feature amount extraction method described above.

次に、ステップＳ２４０に進み、ステップＳ２２０、Ｓ２３０で抽出した全体画像４０の特徴量と手領域画像４１ａの特徴量とを結合する。この処理は、重み付き和の算出などに基づき、特徴結合部１２ｃによって行われる。そして、ステップＳ２５０に進み、現フレームを終了フレームと仮定し、｛１，２，・・・，Ｎ｝フレーム前を開始フレームと仮定した時の確からしさを算出する。続いて、ステップＳ２６０において、ステップＳ２５０で得られた各ノードの確からしさの最大値がｉ番目のノードであったとすると、ｉ番目のノードの確からしさが閾値を超えているか否かを判定する。ここで肯定判定されればステップＳ２７０に進み、否定判定されたら処理を終了して再びステップＳ２００に戻る。ステップＳ２７０では、最大値となっていたノードの番号ｉが現フレームを最終フレームとしたときに開始フレームが何フレーム前であるかを示していることに基づき、ｉフレーム前から現フレームをジェスチャ区間として抽出する。これらステップＳ２４０〜Ｓ２７０の処理は、上記した時系列パターンの分類手法に基づいて、時系列パターン分類部１２ｄによって行われる。 Next, the process proceeds to step S240, and the feature amount of the entire image 40 extracted in steps S220 and S230 is combined with the feature amount of the hand area image 41a. This process is performed by the feature coupling portion 12c based on the calculation of the weighted sum and the like. Then, the process proceeds to step S250, and the certainty is calculated when the current frame is assumed to be the end frame and the frame before the {1, 2, ..., N} frame is assumed to be the start frame. Subsequently, in step S260, assuming that the maximum value of the certainty of each node obtained in step S250 is the i-th node, it is determined whether or not the certainty of the i-th node exceeds the threshold value. If an affirmative determination is made here, the process proceeds to step S270, and if a negative determination is made, the process is terminated and the process returns to step S200 again. In step S270, the gesture section is set from before the i-frame to the current frame based on the fact that the node number i, which has been the maximum value, indicates how many frames before the start frame when the current frame is set as the final frame. Extract as. The processes of steps S240 to S270 are performed by the time-series pattern classification unit 12d based on the time-series pattern classification method described above.

最後に、ステップＳ２８０に進み、ステップＳ２７０で抽出されたジェスチャ区間におけるジェスチャ動作を識別する。このようにして、ジェスチャ検出装置１０によってジェスチャ動作が検出される。 Finally, the process proceeds to step S280 to identify the gesture operation in the gesture section extracted in step S270. In this way, the gesture detection device 10 detects the gesture operation.

以上説明したように、本実施形態にかかるジェスチャ検出装置１０では、手以外の広範囲を撮像した距離画像の撮像データから操作者の手が存在する手領域を特定している。そして、全体画像４０から手領域を切り取った手領域画像４１ａの特徴量と全体画像４０の特徴量とに基づいてジェスチャ動作が行われているジェスチャ区間を抽出し、ジェスチャ動作を検出している。これにより、ＷＳＤのような広い画面を対象としてジェスチャ動作が行われる場合に、ジェスチャ機能をオンするために特定の領域を設定しなくても、ジェスチャ動作を検出することが可能となる。 As described above, in the gesture detection device 10 according to the present embodiment, the hand region in which the operator's hand exists is specified from the captured data of the distance image obtained by capturing a wide range other than the hand. Then, the gesture section in which the gesture operation is performed is extracted based on the feature amount of the hand area image 41a obtained by cutting the hand area from the entire image 40 and the feature amount of the entire image 40, and the gesture operation is detected. As a result, when the gesture operation is performed on a wide screen such as WSD, the gesture operation can be detected without setting a specific area to turn on the gesture function.

（第２実施形態）
第２実施形態について説明する。本実施形態は、第１実施形態に対して特徴量の結合手法を変更したものであり、その他については第１実施形態と同様であるため、第１実施形態と異なる部分についてのみ説明する。 (Second Embodiment)
The second embodiment will be described. Since this embodiment is a modification of the feature quantity combining method with respect to the first embodiment and is the same as the first embodiment in other respects, only the parts different from the first embodiment will be described.

本実施形態では、特徴結合部１２ｃによる特徴量の結合を第１実施形態のような重き付き和によって行うのではなく、次元削減手法、例えば主成分分析（以下、ＰＣＡ（Principal Component Analysis）という）によって行う。 In the present embodiment, the feature quantities are combined by the feature coupling portion 12c by the weighted sum as in the first embodiment, but by a dimension reduction method, for example, principal component analysis (hereinafter referred to as PCA (Principal Component Analysis)). Do by.

例えば、図１０に示すように、全体画像４０の特徴量と手領域画像４１ａの特徴量とを単純に連結する。そして、ＰＣＡによって、次元削減を行うことで、結合した特徴量を得る。 For example, as shown in FIG. 10, the feature amount of the entire image 40 and the feature amount of the hand area image 41a are simply connected. Then, the combined feature amount is obtained by performing the dimension reduction by the PCA.

このようにすれば、過学習の抑制、メモリ・計算量削減を図ることが可能となる。なお、ＰＣＡによって得た結合後の特徴量を用いて、第１実施形態において図８を用いて説明した場合と同様、時系列パターンの分類が行われることになる。 In this way, it is possible to suppress overfitting and reduce the amount of memory and calculation. In addition, using the feature amount after binding obtained by PCA, the time series pattern is classified as in the case described with reference to FIG. 8 in the first embodiment.

（第３実施形態）
第３実施形態について説明する。本実施形態は、第１、第２実施形態に対して時系列パターンの分類手法を変更したものであり、その他については第１、第２実施形態と同様であるため、第１、第２実施形態と異なる部分についてのみ説明する。 (Third Embodiment)
The third embodiment will be described. This embodiment is a modification of the time series pattern classification method with respect to the first and second embodiments, and is the same as the first and second embodiments except for the first and second embodiments. Only the part different from the form will be described.

上記第１実施形態では、時系列パターンの分類手法としてＲＮＮを用いたが、本実施形態では、複数のフレームの特徴量を結合してパターン分類を行うという手法を用いる。 In the first embodiment, RNN is used as a time-series pattern classification method, but in the present embodiment, a method of combining feature quantities of a plurality of frames to perform pattern classification is used.

具体的には、本実施形態では、図１１Ａに示すフローチャートに従って時系列パターンの分類を行っている。 Specifically, in the present embodiment, the time series patterns are classified according to the flowchart shown in FIG. 11A.

まず、ステップＳ３００のように、特徴結合部１２ｃで得た結合後の特徴量を複数フレーム分用意する。具体的には、図１１Ｂに示すように、現フレームからＮフレーム前に得た結合後の特徴量を用意する。なお、このように複数フレーム分の結合後の特徴量を用意できるように、メモリ１３には過去の複数フレーム分の結合後の特徴量が記憶されるようにしておく。 First, as in step S300, the feature amounts obtained in the feature coupling portion 12c after coupling are prepared for a plurality of frames. Specifically, as shown in FIG. 11B, the feature amount after the combination obtained from the current frame to N frames before is prepared. It should be noted that the memory 13 stores the past features after the combination of the plurality of frames so that the features after the combination of the plurality of frames can be prepared in this way.

次に、ステップＳ３１０のように、複数フレーム分の結合後の特徴量をさらに結合する。このときの結合手法については、第１実施形態のように重き付き和を算出したり、単純に結合したり、さらにＰＣＡ等による次元削減を行ったりする手法を用いることができる。 Next, as in step S310, the combined feature quantities for a plurality of frames are further combined. As the joining method at this time, a method of calculating the weighted sum, simply joining, or further reducing the dimension by PCA or the like can be used as in the first embodiment.

そして、ステップＳ３２０のように、事前学習済みのパターンと照合する。これにより、複数フレーム分の結合後の特徴量の「ジェスチャらしさ」の確からしさ、つまりジェスチャ動作であることの確からしさを算出する。 Then, as in step S320, the pattern is collated with the pre-learned pattern. As a result, the certainty of the "gesture-likeness" of the feature amount after the combination of a plurality of frames, that is, the certainty of the gesture operation is calculated.

照合に関しては、メモリ１３に学習済みパターンを記憶しているため、そのデータをメモリ１３から読み出すことによって行われる。ジェスチャ動作時に得られる結合後の特徴量とジェスチャ動作以外の時に得られる結合後の特徴量をそれぞれ多数用意し、それらの算出結果がジェスチャ動作時とそれ以外の時とで差が出ることから、その差の中間に位置する値を学習済みパターンとしている。 Since the learned pattern is stored in the memory 13, the collation is performed by reading the data from the memory 13. A large number of post-combination features obtained during gesture operation and a large number of post-coupling features obtained during non-gesture operation are prepared, and the calculation results differ between gesture operation and other times. The value located in the middle of the difference is regarded as the learned pattern.

そして、ステップＳ３３０において、ステップＳ３１０で得た結合後の特徴量の「ジェスチャらしさ」の確からしさに基づいて、ジェスチャ動作であることを特定する。具体的には、「ジェスチャらしさ」の確からしさが閾値以上であれば、Ｎフレーム前から現フレームがジェスチャ区間と判定する。 Then, in step S330, it is specified that the gesture operation is performed based on the certainty of the "gesture-likeness" of the feature amount after the combination obtained in step S310. Specifically, if the certainty of "gesture-likeness" is equal to or higher than the threshold value, it is determined that the current frame is the gesture section from before the N frame.

このように、複数フレーム分の特徴量を結合し、ジェスチャ動作の学習済みパターンに基づいて「ジェスチャらしさ」を算出して、ジェスチャ動作であることを特定するようにしても良い。 In this way, the feature quantities for a plurality of frames may be combined, and the "gesture-likeness" may be calculated based on the learned pattern of the gesture motion to identify the gesture motion.

なお、ここでは複数フレーム分の一例としてＮ個分のフレームを例に挙げているが、Ｎを複数通り試行しても良い。例えば、Ｎ−１個分のフレームについても、同様に「ジェスチャらしさ」の確からしさを算出するようにしても良い。その場合、現フレームからＮ−１フレーム前までの結合後の特徴量から得た「ジェスチャらしさ」の確からしさと、現フレームからＮフレーム前までの結合後の特徴量から得た「ジェスチャらしさ」の確からしさのいずれが高いかを比較する。そして、確からしさが高い方における最も古いフレームを開始フレームとすれば良い。 Although N frames are given as an example of a plurality of frames here, N may be tried in a plurality of ways. For example, for N-1 frames, the certainty of "gesture-likeness" may be calculated in the same manner. In that case, the certainty of the "gesture-likeness" obtained from the feature amount after the combination from the current frame to the N-1 frame before, and the "gesture-likeness" obtained from the feature amount after the combination from the current frame to the N-1 frame before. Compare which of the certainty is higher. Then, the oldest frame with the higher certainty may be set as the start frame.

（第４実施形態）
第４実施形態について説明する。本実施形態は、第１〜第３実施形態に対してジェスチャ動作の認識に他の要素を考慮するものであり、その他については第１〜第３実施形態と同様であるため、第１〜第３実施形態と異なる部分についてのみ説明する。 (Fourth Embodiment)
A fourth embodiment will be described. In the present embodiment, other factors are taken into consideration in recognizing the gesture motion with respect to the first to third embodiments, and the other factors are the same as those in the first to third embodiments. Only the parts different from the three embodiments will be described.

操作者がジェスチャ動作を行う場合、手の動き以外にも特徴として現れる部分がある。例えば、操作者は、ジェスチャ動作を行うときには、操作対象に顔や視線を向けるが、ジェスチャ動作を行う手指以外の部分の動きが小さくなる。このため、本実施形態では、その一例として、操作者の顔についても検出し、顔の部分の特徴量についても加味して時系列パターンの分類を行う。 When the operator performs a gesture movement, there is a part that appears as a feature other than the movement of the hand. For example, when performing a gesture operation, the operator directs his / her face or line of sight to the operation target, but the movement of parts other than the fingers that perform the gesture operation becomes small. Therefore, in the present embodiment, as an example, the face of the operator is also detected, and the time-series pattern is classified in consideration of the feature amount of the face portion.

本実施形態のジェスチャ検出装置１０は、図１２に示すように、第１実施形態で説明した構成に加えて顔検出部１５を備えていると共に、ジェスチャ区間抽出部１２に顔領域特徴抽出部１２ｅを備えている。 As shown in FIG. 12, the gesture detection device 10 of the present embodiment includes a face detection unit 15 in addition to the configuration described in the first embodiment, and the gesture section extraction unit 12 has a face region feature extraction unit 12e. It has.

顔検出部１５は、撮像装置２０から入力される撮像データから操作者の顔が存在する領域（以下、顔領域という）を特定し、その領域に関するデータをジェスチャ区間抽出部１２に伝える。顔検出部１５による顔領域の特定手法については、上記した手領域の特定手法と同様である。 The face detection unit 15 identifies a region in which the operator's face exists (hereinafter referred to as a face region) from the imaging data input from the imaging device 20, and transmits data relating to that region to the gesture section extraction unit 12. The method for identifying the face region by the face detection unit 15 is the same as the method for identifying the hand region described above.

顔領域特徴抽出部１２ｅは、撮像装置２０で撮像した画像の中から顔検出部１５が検出した顔領域に相当する部分を切り出し、切り出した画像（以下、顔領域画像という）を入力として、その顔領域画面の特徴量を抽出する。このときの特徴量の抽出については、手領域画像４１ａの特徴量の抽出と同様の手法で良い。 The face region feature extraction unit 12e cuts out a portion corresponding to the face region detected by the face detection unit 15 from the image captured by the image pickup apparatus 20, and receives the cut out image (hereinafter referred to as a face region image) as an input. Extract the feature amount of the face area screen. Regarding the extraction of the feature amount at this time, the same method as the extraction of the feature amount of the hand region image 41a may be used.

このように、顔検出部１５および顔領域特徴抽出部１２ｅを備えることで、顔領域画像の特徴量を抽出できる。また、全体特徴抽出部１２ａによる全体画像４０の特徴量を算出する際には、手領域４１に加えて顔領域もマスクするようにする。そして、特徴結合部１２ｃでは、全体特徴抽出部１２ａと手領域特徴抽出部１２ｂおよび顔領域特徴抽出部１２ｅそれぞれで抽出した特徴量を結合し、その結合後の特徴量を用いて時系列パターン分類部１２ｄによる時系列パターンの分類を行う。 By providing the face detection unit 15 and the face region feature extraction unit 12e in this way, the feature amount of the face region image can be extracted. Further, when calculating the feature amount of the entire image 40 by the overall feature extraction unit 12a, the face region is masked in addition to the hand region 41. Then, in the feature combination unit 12c, the feature amounts extracted by the overall feature extraction unit 12a, the hand area feature extraction unit 12b, and the face area feature extraction unit 12e are combined, and the feature amounts after the combination are used to classify the time series pattern. The time series pattern is classified by the part 12d.

このようにすることで、ジェスチャ動作に関連する手以外の部分の特徴量も加味して、ジェスチャ動作を検出できる。これにより、より精度良く、ジェスチャ動作を検出することが可能となる。 By doing so, the gesture motion can be detected by taking into account the feature amount of the part other than the hand related to the gesture motion. This makes it possible to detect the gesture motion with higher accuracy.

（他の実施形態）
本開示は、上記した実施形態に準拠して記述されたが、当該実施形態に限定されるものではなく、様々な変形例や均等範囲内の変形をも包含する。加えて、様々な組み合わせや形態、さらには、それらに一要素のみ、それ以上、あるいはそれ以下、を含む他の組み合わせや形態をも、本開示の範疇や思想範囲に入るものである。 (Other embodiments)
Although the present disclosure has been described in accordance with the above-described embodiment, the present disclosure is not limited to the embodiment, and includes various modifications and modifications within an equal range. In addition, various combinations and forms, as well as other combinations and forms that include only one element, more, or less, are also within the scope of the present disclosure.

例えば、特徴量の抽出手法として、ＣＮＮを例に挙げたが、その他の手法も適用可能である。例えば、ＨＯＧ（Histogram of oriented Gradient）、ＬＢＰ（Local Binary Pattern）などを用いて特徴量の抽出を行っても良い。また、時系列パターンの分類手法としてＲＮＮを用いる場合について説明したが、３ＤＣＮＮなどの他の時系列パターンの分類手法を用いることもできる。また、第２実施形態において、特徴量の結合を次元削減手法によって行う場合の一例としてＰＣＡを例に挙げたが、ＰＣＡ以外の次元削減手法を用いても良い。なお、次元削減手法による特徴量の結合を行っているが、次元削減は必須ではなく、例えば学習データが十分に得られているなどの場合には、次元削減を行わなくても良い。 For example, CNN is taken as an example as a feature extraction method, but other methods can also be applied. For example, the feature amount may be extracted using HOG (Histogram of oriented Gradient), LBP (Local Binary Pattern), or the like. Further, although the case where the RNN is used as the time-series pattern classification method has been described, another time-series pattern classification method such as 3DCNN can also be used. Further, in the second embodiment, PCA is given as an example of the case where the feature quantities are combined by the dimension reduction method, but a dimension reduction method other than PCA may be used. Although the feature quantities are combined by the dimension reduction method, the dimension reduction is not indispensable. For example, when sufficient learning data is obtained, the dimension reduction may not be performed.

また、直前の画像フレームをメモリ１３に保存しておき、現フレームから特徴量を抽出する際に、その直前の画像フレームを組み合わせて用いても良い。例えば、直前の画像フレームと現フレームとの間の動き情報であるオプティカルフローを入力としてＣＮＮなどで特徴量を抽出するようにしても良い。また、直前の画像フレームを現フレームに組み合わせる場合、ＨｏＯＦ（Histogram of Optical Flow）や３ＤＣＮＮ等を用いても良い。 Further, the immediately preceding image frame may be stored in the memory 13, and when the feature amount is extracted from the current frame, the immediately preceding image frame may be used in combination. For example, the feature amount may be extracted by CNN or the like by inputting the optical flow which is the motion information between the immediately preceding image frame and the current frame. When combining the immediately preceding image frame with the current frame, HoOF (Histogram of Optical Flow), 3D CNN, or the like may be used.

なお、手検出部１１とジェスチャ区間抽出部１２の各部で行われる特徴量の抽出手法については、いずれの手法を用いても構わないが、上記各実施形態のように、特徴量の抽出手法を同じ手法にすると好ましい。このようにすると、メモリ１３内に記憶しておく機械学習の辞書データ量を削減することが可能になる。 As for the feature amount extraction method performed in each part of the hand detection unit 11 and the gesture section extraction unit 12, any method may be used, but as in each of the above embodiments, the feature amount extraction method may be used. It is preferable to use the same method. In this way, it is possible to reduce the amount of machine learning dictionary data stored in the memory 13.

本開示に記載の制御部及びその手法は、コンピュータプログラムにより具体化された一つ乃至は複数の機能を実行するようにプログラムされたプロセッサ及びメモリーを構成することによって提供された専用コンピュータにより、実現されてもよい。あるいは、本開示に記載の制御部及びその手法は、一つ以上の専用ハードウエア論理回路によってプロセッサを構成することによって提供された専用コンピュータにより、実現されてもよい。もしくは、本開示に記載の制御部及びその手法は、一つ乃至は複数の機能を実行するようにプログラムされたプロセッサ及びメモリーと一つ以上のハードウエア論理回路によって構成されたプロセッサとの組み合わせにより構成された一つ以上の専用コンピュータにより、実現されてもよい。また、コンピュータプログラムは、コンピュータにより実行されるインストラクションとして、コンピュータ読み取り可能な非遷移有形記録媒体に記憶されていてもよい。 The controls and methods thereof described in the present disclosure are realized by a dedicated computer provided by configuring a processor and memory programmed to perform one or more functions embodied by a computer program. May be done. Alternatively, the controls and methods thereof described in the present disclosure may be implemented by a dedicated computer provided by configuring the processor with one or more dedicated hardware logic circuits. Alternatively, the control unit and method thereof described in the present disclosure may be a combination of a processor and memory programmed to perform one or more functions and a processor composed of one or more hardware logic circuits. It may be realized by one or more dedicated computers configured. Further, the computer program may be stored in a computer-readable non-transitional tangible recording medium as an instruction executed by the computer.

１１手検出部
１２ジェスチャ区間抽出部
１２ａ全体特徴抽出部
１２ｂ手領域特徴抽出部
１２ｃ特徴結合部
１２ｄ時系列パターン分類部
１２ｅ顔領域特徴抽出部
１４ジェスチャ識別部
４０全体画像
４１ａ手領域画像 11 Hand detection unit 12 Gesture section extraction unit 12a Overall feature extraction unit 12b Hand area feature extraction unit 12c Feature connection unit 12d Time series pattern classification unit 12e Face area feature extraction unit 14 Gesture identification unit 40 Overall image 41a Hand area image

Claims

操作者の手を含む全体画像（４０）の撮像データを入力し、該撮像データから前記手が存在する領域である手領域（４１）を特定する手検出部（１１）と、
前記全体画像の撮像データに基づいて、該全体画像の特徴量を抽出する全体特徴抽出部（１２ａ）と、
前記全体画像の中から前記手領域の画像である手領域画像（４１ａ）を切り出し、該手領域画像の特徴量を抽出する手領域特徴抽出部（１２ｂ）と、
前記全体特徴抽出部で抽出された前記全体画像の特徴量と前記手領域特徴抽出部で抽出された前記手領域画像の特徴量を結合する特徴結合部（１２ｃ）と、
前記特徴結合部による結合後の特徴量について時系列パターンの分類を行い、前記操作者が前記手を動かすことによるジェスチャ動作の開始フレームから終了フレームまでのジェスチャ区間を抽出する時系列パターン分類部（１２ｄ）と、
前記時系列パターン分類部で抽出された前記ジェスチャ区間中の前記画像フレームに基づき、前記ジェスチャ動作を識別するジェスチャ識別部（１４）と、を有している、ジェスチャ検出装置。 A hand detection unit (11) that inputs the captured data of the entire image (40) including the operator's hand and identifies the hand region (41) in which the hand exists from the captured data.
An overall feature extraction unit (12a) that extracts a feature amount of the entire image based on the captured data of the entire image, and an overall feature extraction unit (12a).
A hand region feature extraction unit (12b) that cuts out a hand region image (41a) that is an image of the hand region from the entire image and extracts a feature amount of the hand region image.
A feature coupling unit (12c) that combines the feature amount of the entire image extracted by the overall feature extraction unit and the feature amount of the hand region image extracted by the hand region feature extraction unit, and
A time-series pattern classification unit (a time-series pattern classification unit that classifies time-series patterns for the feature quantities after the combination by the feature coupling unit and extracts a gesture section from the start frame to the end frame of the gesture operation by the operator moving the hand. 12d) and
A gesture detection device having a gesture identification unit (14) for identifying the gesture operation based on the image frame in the gesture section extracted by the time series pattern classification unit.

前記時系列パターン分類部は、過去の前記撮像データにおける画像フレームの中から前記開始フレームを検出すると共に、現在の画像フレームを終了フレームとして、前記ジェスチャ区間を抽出する、請求項１に記載のジェスチャ検出装置。 The gesture according to claim 1, wherein the time-series pattern classification unit detects the start frame from the image frames in the past imaged data and extracts the gesture section using the current image frame as the end frame. Detection device.

前記全体特徴抽出部は、前記全体画像から前記手領域画像をマスクして、マスク後の前記全体画像の特徴量を抽出する、請求項１または２に記載のジェスチャ検出装置。 The gesture detection device according to claim 1 or 2, wherein the overall feature extraction unit masks the hand region image from the overall image and extracts the feature amount of the masked overall image.

前記全体特徴抽出部は、前記手領域画像を黒色または白色でマスクする、請求項３に記載のジェスチャ検出装置。 The gesture detection device according to claim 3, wherein the overall feature extraction unit masks the hand region image with black or white.

前記撮像データから前記全体画像の中から前記操作者の顔が存在する領域である顔領域を特定する顔検出部（１５）と、
前記全体画像の中から前記顔領域の画像である顔領域画像を切り出し、該顔領域画像の特徴量を抽出する顔領域特徴抽出部（１２ｅ）と、を備え、
前記特徴結合部は、前記全体特徴抽出部で抽出された前記全体画像の特徴量と前記手領域特徴抽出部で抽出された前記手領域画像の特徴量に加えて、前記顔領域特徴抽出部で抽出された前記顔領域画像の特徴量も結合する、請求項１ないし４のいずれか１つに記載のジェスチャ検出装置。 A face detection unit (15) that identifies a face region in which the operator's face exists from the entire image from the captured data, and a face detection unit (15).
A face region feature extraction unit (12e) that cuts out a face region image that is an image of the face region from the entire image and extracts a feature amount of the face region image is provided.
In addition to the feature amount of the whole image extracted by the overall feature extraction unit and the feature amount of the hand area image extracted by the hand region feature extraction unit, the feature combination portion is the face region feature extraction unit. The gesture detection device according to any one of claims 1 to 4, wherein the feature amount of the extracted face region image is also combined.