WO2015030264A1

WO2015030264A1 - Device, method, and program for detecting click operation

Info

Publication number: WO2015030264A1
Application number: PCT/JP2014/073415
Authority: WO
Inventors: 正広豊浦; 篤志杉浦; 暁陽茅
Original assignee: 国立大学法人山梨大学
Priority date: 2013-08-30
Filing date: 2014-08-29
Publication date: 2015-03-05
Also published as: JPWO2015030264A1; JP6524589B2

Abstract

An image of a previously created virtual object (31) and a dynamic image inputted from an image-capturing device (11) are synthesized and displayed (31A) on a display device; at least part of a hand region appearing in each frame image constituting the dynamic image is extracted; an amount of difference pertaining to a specific portion of the extracted hand region between frames adjacent in time to each other is obtained; and a temporal transition of states represented by the amount of difference is examined, whereby a click operation represented by the motion of the specific portion of the hand region is detected.

Description

クリック動作検出装置，方法およびプログラムClick motion detection device, method and program

　この発明は，仮想物体（ｖｉｒｔｕａｌ　ｏｂｊｅｃｔｓ）を手（指，指先を含む）でクリックする動作（操作）（ｃｌｉｃｋ　ｇｅｓｔｕｒｅ，ｃｌｉｃｋ　ｏｐｅｒａｔｉｏｎ）を検出する装置，方法およびプログラムに関し，たとえばカメラ付携帯端末装置，ヘッドマウントディスプレイ（特に透過型），大型，中型，小型を問わず表示装置を有する各種機器またはカメラを備えた各種機器において利用される。 The present invention relates to a device, a method, and a program for detecting an operation (operation) (click gesture, click operation) of clicking a virtual object with a hand (including a finger and a fingertip), for example, a mobile terminal device with a camera, It is used in various devices having a display device or a camera equipped with a head mounted display (especially transmission type), large size, medium size, and small size.

　ＡＲ（Ａｕｇｍｅｎｔｅｄ　Ｒｅａｌｉｔｙ：拡張現実）においては，仮想物体と実際の映像を合成して表示し，あたかも仮想物体が現実に存在するかのような印象を利用者に与えることが要求される。単純に仮想物体を表示するだけでは不十分であり，仮想物体に対し利用者が何らかの操作をすることが求められる。
　クリックはコンピュータなどを操作するための基本的な動作である。透過型ヘッドマウントディスプレイなどの実体を伴わない表示画面を用いる場合に，その表示画面に表示したボタンなどの仮想物体をクリック操作することは，空中でクリック動作を行うことになり，このクリック動作を検出することは難しかった。
　１台のカメラからの映像に基づいて空中でのクリック動作を検出しようとすると，指先の正確な三次元位置を推定することが困難で，三次元空間における動きである仮想物体との接触を判定することができない。
　従来技術として，指に付されたマーカから指の三次元位置を推定し，仮想物体との三次元的な衝突を判定するものがある。たとえば特許文献１では，肌色検出やエッジ抽出などを行い，指を検出することができるとされている。しかしながら，二次元画像上の手指の領域から推定できる奥行位置には精度の限界があり，仮想物体との衝突を正しく判定することは難しいので，指にマーカなどを付して検出することが開示されている。この場合にはマーカの画像上の大きさによってある程度正しく奥行き位置を推定できると考えられるが，指先にマーカをつけることが装置利用上の制約となることがある。
　特許文献２では，複数台のカメラから出力される画像に基づいて得られる指先位置から指の三次元位置を獲得し，仮想物体と三次元位置との衝突を判定している。この場合にも，大きさを持った手指の領域から三角測量の原理で指先位置を推定することになり，高い精度で奥行き位置を推定することは望めない。複数台のカメラを用いることが装置利用上の制約となることもある。
　また非特許文献１では，画像上で検出しやすいジェスチャである「親指と人差し指で挟む」（ｐｉｎｃｈ，ピンチ）ことに着目して，指先の三次元位置を推定することなく，仮想物体指定を実現している。このジェスチャは理解しやすいものの，一般に行われている現実物体を指定する方法とは異なるために，仮想物体とのやり取りをしている印象が強くなり，自然なインタフェースを実現する上で用途によっては不十分であることがある。 In AR (Augmented Reality), a virtual object and an actual video are synthesized and displayed, and it is required to give the user an impression as if the virtual object actually exists. Simply displaying a virtual object is not sufficient, and the user is required to perform some operation on the virtual object.
Clicking is a basic operation for operating a computer or the like. When using a display screen without an entity, such as a transmissive head-mounted display, clicking on a virtual object such as a button displayed on the display screen results in a click operation in the air. It was difficult to detect.
It is difficult to estimate the exact 3D position of the fingertip when trying to detect a click motion in the air based on the video from one camera, and it determines contact with a virtual object that is movement in 3D space. Can not do it.
As a conventional technique, there is one that estimates a three-dimensional position of a finger from a marker attached to the finger and determines a three-dimensional collision with a virtual object. For example, in Patent Document 1, it is assumed that a finger can be detected by performing skin color detection, edge extraction, or the like. However, the depth position that can be estimated from the finger area on the two-dimensional image is limited in accuracy, and it is difficult to correctly determine the collision with the virtual object. Has been. In this case, it is considered that the depth position can be estimated to some extent correctly depending on the size of the marker on the image. However, attaching the marker to the fingertip may be a limitation in using the device.
In Patent Document 2, a three-dimensional position of a finger is obtained from a fingertip position obtained based on images output from a plurality of cameras, and a collision between a virtual object and the three-dimensional position is determined. In this case as well, the fingertip position is estimated by the principle of triangulation from the finger area having a size, and it is not possible to estimate the depth position with high accuracy. Use of multiple cameras may be a restriction on the use of the apparatus.
Non-Patent Document 1 realizes virtual object specification without estimating the three-dimensional position of the fingertip, focusing on the pinch, which is a gesture that is easy to detect on an image (pinch, pinch). is doing. Although this gesture is easy to understand, it differs from the commonly used method of specifying a real object, so the impression of interacting with a virtual object becomes stronger, and depending on the application, a natural interface may be realized. It may be insufficient.

特開２０１３−４１４３１号公報JP 2013-41431 A 特開平６−３１４３３２号公報JP-A-6-314332

　発明が解決しようとする課題
　上述したように，指先にマーカを付したり，複数のカメラを用いたりする場合には，奥行位置を把握する精度が十分でなく，クリック動作の検出精度が低いことや，余分な装置を用いることで制約が生じたり，製造コストが高くなるというような問題がある。
　また，現実物体の操作方法と異なるジェスチャを利用する場合には，不自然な操作方法となり違和感を与えるというような問題がある。
　発明の開示
　この発明は，最も一般的で現実的なクリック動作に着目して，これを検出しようとするものである。
　この発明はクリック操作する手，指等にマーカ等を付すことなく，素の状態で行うクリック動作を検出しようとするものである。
　この発明はまた，一台のカメラからの動画像に基づいてクリック動作を検出することができるようにするものである。
　この発明によるクリック動作検出装置は，あらかじめ作成された仮想物体の画像と撮像装置から入力される動画像とを合成し，表示装置に表示する表示制御手段，前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出する手領域抽出手段，時間的に隣接する（または，近い。以下同じ。）フレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求める差分算出手段，および前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するクリック動作検出手段を備えるものである。
　この発明によるクリック動作検出方法は，あらかじめ作成された仮想物体の画像と撮像装置から入力される動画像とを合成して表示装置に表示し，前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求め，そして前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するものである。
　この発明によるクリック動作検出のためのプログラムは，あらかじめ作成された仮想物体の画像と撮像装置から入力される動画像とを合成し，表示装置に表示させ，前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求め，そして前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するようにコンピュータを制御するものである。
　手領域は手領域の一部，たとえば特定の指や指先部分を含む。
　この発明によると，コンピュータの操作において最も一般的なクリック動作と類似の動作を検出することができる。しかも，手，指，指先等にマーカ等を付ける必要はない。そして，カメラは１台ですむ。
　好ましい実施態様では，前記表示装置に表示されている仮想物体の画像の所定領域内で，前記抽出された手領域の前記特定部分が，所定時間以上，動きを停止していることを検出してクリック可能状態と判断する（クリック可能状態検出手段）。
　クリック可能状態と判断したときにその旨を報知するとよい（クリック可能状態報知手段）。この報知は，表示画像上で行ってもよいし，音等を発生することにより行ってもよい。これにより，ユーザは目的とする仮想物体を選択できたことを認識することができる。
　特に望ましい実施態様では，前記クリック可能状態検出手段がクリック可能状態と判断したときに，前記表示制御手段は前記仮想物体の画像の所定領域に関連する部分の表示態様（色，大きさ，形）を変化させる。これにより，ユーザはクリックしようとする仮想物体を表示画面上で認識することができ，誤ったクリック対象を回避することができる。
　さらに，前記クリック動作検出手段がクリック動作を検出したときにその旨を報知すると一層好ましい（クリック動作報知手段）。ユーザは自分がクリック動作を正しく行なえたことを認識することができる。
　前記差分算出手段が算出する差分量の例は，抽出した手領域の特定部分の速度情報および加速度情報である。
　クリック動作検出の態様にはさまざまある。その一は，前記手領域の前記特定部分の動きが，動いている状態から，急減速して停止したことを検出してクリック動作と判断するものである。その二は，前記手領域の前記特定部分の動きに停止状態，運動状態，急減速，そして停止状態の遷移があったときにクリック動作と判断するものである。
　上記において，急減速から，表示されている仮想物体の画像の所定領域内で停止したときにクリック動作と判断するようにすると，特定の仮想物体が確かにクリックされたことを確認できる。
　あらかじめ作成された仮想物体の画像を必ずしも表示しなくてもよい。仮想物体の画像の表示を要しないこの発明のクリック動作検出装置は，撮像装置から入力される動画像を表示装置に表示する表示制御手段，前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出する手領域抽出手段，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求める差分算出手段，および前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するクリック動作検出手段を備えているものである。
　この発明によるクリック動作検出方法は，撮像装置から入力される動画像を表示装置に表示し，前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求め，そして前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するものである。
　この発明によるクリック動作検出のためのプログラムは，撮像装置から入力される動画像を表示装置に表示させ，前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求め，そして前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するようにコンピュータを制御するものである。
　上記は特に仮想物体が一つでユーザが仮想物体を選択する必要がない場合に有効である。
　クリック可能状態の検出も可能である。すなわち，前記表示装置の表示画面内の所定領域内で，前記抽出された手領域の前記特定部分が所定時間以上，動きを停止していることを検出してクリック可能状態と判断する。
　表示装置を必要とせず，撮像装置からの動画像信号に基づいてクリック動作を検出するようにすることもできる。これは，クリック動作を何らかの合図とするような場合に有効である。
　この発明によるクリック動作検出装置は，撮像装置から入力される動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出する手領域抽出手段，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求める差分算出手段，および前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するクリック動作検出手段を備えているものである。
　この発明によるクリック動作検出方法は，撮像装置から入力される動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求め，そして前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するものである。
　この発明によるクリック動作検出のためのプログラムは，撮像装置から入力される動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，時間的に隣接するフレーム（フレーム画像）間の（２フレーム間，３フレーム間などの），抽出した手領域の特定部分に関する差分量を求め，そして前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するようにコンピュータを制御するものである。
　この発明は，上記のプログラムを格納したコンピュータ読取可能な記録（記憶）媒体も提供している。 Problems to be Solved by the Invention As described above, when a marker is attached to a fingertip or a plurality of cameras are used, the accuracy of grasping the depth position is not sufficient, and the detection accuracy of the click motion is low. In addition, there are problems such as limitations due to the use of an extra device and an increase in manufacturing cost.
In addition, when using a gesture different from the operation method of the real object, there is a problem that it becomes an unnatural operation method and gives a sense of incongruity.
DISCLOSURE OF THE INVENTION The present invention focuses on the most common and realistic click action and tries to detect it.
The present invention is intended to detect a click operation performed in a plain state without attaching a marker or the like to a hand, finger or the like to be clicked.
The present invention also makes it possible to detect a click operation based on a moving image from one camera.
The click motion detection device according to the present invention combines a virtual object image created in advance with a moving image input from an imaging device and displays it on a display device, within each frame image constituting the moving image. Hand region extraction means for extracting at least a part of the hand region appearing in the frame, extraction between frames (frame images) adjacent in time (or close, the same applies hereinafter) (between 2 frames, 3 frames, etc.) A difference calculating means for obtaining a difference amount relating to a specific portion of the hand region, and a click operation for detecting a click operation represented by the movement of the specific portion of the hand region by examining a time transition of a state represented by the difference amount A detection means is provided.
According to the click motion detection method of the present invention, a virtual object image created in advance and a moving image input from an imaging device are combined and displayed on a display device, and a hand appearing in each frame image constituting the moving image. Extract at least a part of the area, obtain a difference between temporally adjacent frames (frame images) (between 2 frames, 3 frames, etc.), a specific part of the extracted hand area, and the difference The click action represented by the movement of the specific part of the hand region is detected by examining the time transition of the state represented by.
A program for detecting a click motion according to the present invention combines a virtual object image created in advance with a moving image input from an imaging device, displays the synthesized image on a display device, and displays the frame image in each frame image constituting the moving image. Extract at least a part of the hand region appearing in, determine the amount of difference between the temporally adjacent frames (frame images) (such as between two frames, between three frames, etc.), and a specific portion of the extracted hand region; and By checking the time transition of the state represented by the difference amount, the computer is controlled to detect a click operation represented by the movement of the specific part of the hand region.
The hand area includes a part of the hand area, for example, a specific finger or a fingertip part.
According to the present invention, an operation similar to the most common click operation in the operation of a computer can be detected. In addition, there is no need to attach a marker or the like to the hand, finger or fingertip. And only one camera is required.
In a preferred embodiment, it is detected that the specific part of the extracted hand region has stopped moving for a predetermined time or more within a predetermined region of the image of the virtual object displayed on the display device. A clickable state is determined (clickable state detecting means).
When it is determined that the clickable state is determined, this fact may be notified (clickable state notifying means). This notification may be performed on a display image, or may be performed by generating a sound or the like. Accordingly, the user can recognize that the target virtual object has been selected.
In a particularly desirable embodiment, when the clickable state detecting unit determines that the clickable state is present, the display control unit displays a display mode (color, size, shape) of a portion related to a predetermined area of the image of the virtual object. To change. Thereby, the user can recognize the virtual object to be clicked on the display screen, and can avoid an erroneous click target.
Furthermore, it is more preferable to notify that when the click motion detection means detects a click motion (click motion notification means). The user can recognize that he / she has correctly performed the click operation.
An example of the difference amount calculated by the difference calculation means is speed information and acceleration information of a specific part of the extracted hand region.
There are various modes of detecting the click motion. The first is to detect that the movement of the specific part of the hand region is suddenly decelerated from the moving state and stop, thereby determining a click operation. Second, when the movement of the specific portion of the hand region includes a stop state, a motion state, a sudden deceleration, and a transition of the stop state, it is determined that the click operation is performed.
In the above description, it is possible to confirm that a specific virtual object has been clicked by determining that the click operation is performed when the stop is performed within a predetermined region of the image of the displayed virtual object from sudden deceleration.
It is not always necessary to display a virtual object image created in advance. The click motion detection device of the present invention that does not require display of a virtual object image includes display control means for displaying a moving image input from the imaging device on the display device, and a hand region that appears in each frame image constituting the moving image. Hand region extraction means for extracting at least a part of a difference, calculating a difference between temporally adjacent frames (frame images) (for example, between two frames, between three frames, etc.) and a difference amount for a specific portion of the extracted hand region And a click action detecting means for detecting a click action represented by the movement of the specific portion of the hand region by examining time transition of the state represented by the difference amount.
The click motion detection method according to the present invention displays a moving image input from an imaging device on a display device, extracts at least a part of a hand region appearing in each frame image constituting the moving image, and is temporally adjacent. By calculating a difference amount between specific frames of the extracted hand region between frames (frame images) (such as between two frames or between three frames) and examining the time transition of the state represented by the difference amount, A click operation represented by the movement of the specific portion of the hand region is detected.
A program for detecting a click motion according to the present invention displays a moving image input from an imaging device on a display device, extracts at least a part of a hand region appearing in each frame image constituting the moving image, and extracts time. Determining the amount of difference between adjacent frames (frame images) (for example, between two frames, between three frames, etc.) and a specific part of the extracted hand region, and examining the time transition of the state represented by the difference amount Thus, the computer is controlled so as to detect a click operation represented by the movement of the specific portion of the hand region.
The above is particularly effective when there is one virtual object and the user does not need to select a virtual object.
A clickable state can also be detected. That is, it is determined that a clickable state is detected by detecting that the specific portion of the extracted hand region has stopped moving for a predetermined time or more within a predetermined region in the display screen of the display device.
It is also possible to detect a click operation based on a moving image signal from an imaging device without requiring a display device. This is effective when the click operation is a signal.
The click motion detection device according to the present invention comprises a hand region extracting means for extracting at least a part of a hand region appearing in each frame image constituting a moving image input from an imaging device, and temporally adjacent frames (frame images). A difference calculating means for obtaining a difference amount relating to a specific part of the extracted hand region between two frames (between two frames, three frames, etc.), and examining the time transition of the state represented by the difference amount, A click motion detecting means for detecting a click motion represented by the motion of the specific portion is provided.
In the click motion detection method according to the present invention, at least a part of a hand region appearing in each frame image constituting a moving image input from an imaging device is extracted, and temporally adjacent frames (frame images) (2 This is represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region (between frames, between three frames, etc.) and examining the time transition of the state represented by the difference amount. The click operation to be detected is detected.
A program for detecting a click motion according to the present invention extracts at least a part of a hand region appearing in each frame image constituting a moving image input from an imaging device, and temporally adjacent frames (frame images). Of the particular part of the hand region by determining the difference amount for the particular part of the extracted hand region (such as between two frames, three frames, etc.) and examining the time transition of the state represented by the difference amount The computer is controlled so as to detect a click action represented by a movement.
The present invention also provides a computer-readable recording (storage) medium storing the above program.

　第１図は，クリック動作検出装置をヘッドマウントディスプレイに応用した例を示す斜視図である。
　第２図は，実施例によるクリック動作検出装置の電気的構成を示すブロック図である。
　第３図は，仮想物体を表示する表示画面の一例を示す。
　第４図は，カメラで撮像した画面の一例を示す。
　第５図は，仮想物体の表示画面とカメラの撮像画面とを合成して得られる画面の例を示す。
　第６図は，クリック動作可能状態の通知例を示す。
　第７図は，指先の検出処理を説明するためのものである。
　第８図は，指先の運動，静止の遷移図である。
　第９Ａ図および第９Ｂ図は，クリック動作時における指先の動きの遷移を示す。
　第１０図は，クリック動作検出の処理手順を示すフローチャートである。
　第１１図は，携帯端末装置への応用例を示す斜視図である。
　第１２図は，大型ディスプレイへの応用例を示す斜視図である。 FIG. 1 is a perspective view showing an example in which the click motion detection device is applied to a head mounted display.
FIG. 2 is a block diagram showing an electrical configuration of the click motion detection device according to the embodiment.
FIG. 3 shows an example of a display screen for displaying a virtual object.
FIG. 4 shows an example of a screen imaged by the camera.
FIG. 5 shows an example of a screen obtained by synthesizing a virtual object display screen and a camera imaging screen.
FIG. 6 shows a notification example of the clickable state.
FIG. 7 is a diagram for explaining the fingertip detection process.
FIG. 8 is a transition diagram of fingertip movement and rest.
9A and 9B show the transition of the fingertip movement during the click operation.
FIG. 10 is a flowchart showing the procedure for detecting the click operation.
FIG. 11 is a perspective view showing an application example to a portable terminal device.
FIG. 12 is a perspective view showing an application example to a large display.

　クリック動作検出装置の概要の理解を促進するためにその使用例について第１図を参照して説明する。
　ユーザがヘッドマウントディスプレイ（以下，ＨＭＤという）２０を頭部に装着している。ＨＭＤは表示装置を有しており，ユーザは表示装置に映された画像を見る。画像の中に現われている仮想物体（ｖｉｒｔｕａｌ　ｏｂｊｅｃｔ）がユーザの目の前の適当な距離はなれた位置（空中）に存在するかのように見える。第１図に示す例では仮想物体は複数個の配列されたボタン３１であり，それらのボタン３１が表示装置の表示画面３０Ａ内に表示されている。
　コンピュータ上での従来のクリック動作は，表示画面上の特定の位置（対象）または領域（対象）内にカーソルを位置決めし，マウス上のボタンを押すことにより，該当する対象を選択する，または特定の命令の実行を指令するものである。これと類似の動作（操作）として，この実施例では表示画面上の仮想物体（表示された位置，領域，対象）を選択して，あたかもボタンを押すかのように指先で押す動作を行うことをクリック動作という（後述するところから明らかになるように，クリック動作の検出のためには，特定の仮想物体を選択することは必ずしも必須の要件ではない）。
　ＨＭＤ２０を装着したユーザは，その表示画面に表示され，あたかも前方の空中に存在するかのように見える複数のボタン３１のうちの１つを選択して，自分の指先でそのボタンを押す（クリックする）動作を行う。ＨＭＤ２０にはカメラ１１が設けられ，その前方，すなわち仮想物体３１が存在する付近を撮像し，撮像により得られる動画像信号を出力する。ユーザの指先（指，手）はカメラ１１で撮影され，カメラ１１から出力される動画像信号についての画像処理により，ユーザが特定のボタンを選択したこと，およびそのボタンをクリックしたこと（押したこと）が検出される。
　第２図はこの実施例のクリック動作検出装置の電気的構成を示すものである。
　処理装置１０はたとえばコンピュータにより実現され，機能的には，後に詳述する画像メモリ１３，表示制御部１４，手領域抽出部１５，差分算出部１６，クリック可能状態検出部１７，クリック動作検出部１８を備えている。カメラ１１は，たとえば上述したＨＭＤ２０に設けられたものであり，クリック動作を行うユーザの手または指付近を撮影するようにその視野が位置決めされる。表示装置１２は，一例として，上述したＨＭＤ２０に装備されている表示装置（ディスプレイ）である。
　クリック動作検出装置はさらに入力装置２１，出力装置２２，記憶装置１９等を必要に応じて備える。入力装置２１はクリック動作検出プログラム，パラメータ，指令等を入力するもので，キーボード，表示画面とマウス，通信装置，媒体の読取装置等により実現される。出力装置２２は，クリック動作により入力されたデータ等を出力するもので，表示装置（表示装置１２と兼用できる），通信装置，媒体の書込装置等により実現される。記憶装置１９は，クリック動作検出プログラム，パラメータ，入力データ（クリック動作により入力されたデータを含む）等を記憶する。
　処理装置１０の画像メモリ１３は，カメラ１１により撮像され，カメラ１１から出力される動画像（信号，データ）の少なくとも複数フレーム分の静止画像（信号，データ）を記憶する。これらの画像データは手領域抽出部１５，差分算出部１６において利用される。
　表示制御部１４は，あらかじめ作成された仮想物体（ボタン３１等）を表示するための画像データを保存しており，それに基づいて表示装置１２の表示画面上に仮想物体を表示する。また，画像メモリ１３に記憶されている撮像画像を仮想物体の画像上に重ね合わせて（合成して），表示装置１２の表示画面上に表示する。
　たとえば第３図は表示装置１２の表示画面に表示される仮想物体３１の画像３０Ａを示している。画面の左上角がＸＹ座標の原点である。仮想物体３１の位置，領域はこのＸＹ座標上にあらかじめ定められている。第４図はカメラ１１が撮像した動画像のうちの１フレーム分の画像３０Ｂを示している。ユーザの手領域４０や人差し指４１が表わされている。この画像３０ＢのＸＹ座標の原点も左上の角に定められている。
　第５図は第３図に示す仮想物体３１を含む表示装置１２の表示画像３０Ａとカメラ１１が撮像した１フレーム分の画像３０Ｂとを，ＸＹ座標原点を一致させて重ね合わせて（合成して）得られる画像３０Ｂを示している。最終的にはこの合成画像３０が表示制御部１４の制御の下に表示装置１２の表示画面に表示されることになる。合成画像の作成において，上の説明では，両画像３０Ａ，３０ＢのＸＹ座標原点を一致させているが，必ずしも原点でなくてもよい。画像３０Ａと画像３０Ｂの特定の点を一致させればよい。また，両画像３０Ａ，３０Ｂの一方または両方を，拡大したり，縮小したりした上で合成してもよい。
　手領域抽出部１５は，画像メモリ１３に記憶されている撮影画像の各フレーム（画像）上で手領域を算出し，手領域を特定する。手領域の特定は，あらかじめＨＳＶ（Ｈｕｅ，Ｓａｔｕｒａｔｉｏｎ　Ｖａｌｕｅ）表色系などで定義される色の範囲に基づいて，肌色の画素を抜き出すことなどによって行うことができる。例えば，ＨＳＶ色空間上で代表的な肌色の位置を与え，その位置から色空間内で一定距離内にある領域の色を肌色として定義する。
　手以外にも肌色を持つ領域が画像上に含まれていることが考えられるので，領域の大きさによって手領域を絞り込む。肌色であると判定された領域をラベリングによって確定し，それらのうちで最大の面積を持つ領域を手領域であると推定する。
　最初のフレームで抽出した肌色領域の画素のＨＳＶ色空間における分布を学習データとして，次のフレーム以降の肌色領域の抽出に反映させることも可能である。ＨＳＶ色空間における抽出した肌色領域の画素の平均値と分散共分散行列を求める。求めた平均値と分散共分散行列を用いて判定対象となる画素の色とのマハラノビス距離を算出し，それをしきい値と比較することで次以降のフレームの肌色領域を抽出する。
　第７図は肌色領域と判定された領域について膨張縮退処理（クロージング，またはオープニング）を行って得られた手領域４０を示している。この手領域４０にはクリック動作するときに用いられる指（たとえば人差し指）４１が含まれており，その先端は最小のＹ座標を持つ画素であると考えられる。指先領域（指先の内側（爪の反対側）のふくらんだ部分）を確定するために，上記の先端画素を中心とする半径Ｒ（あらかじめ設定）の円の範囲内において，手（指）領域の境界の画素に距離値０を与え，手（指）領域の内側に向って境界から離れる画素ほど大きな距離値を与える。距離値が最大を示す画素（符号４２で示すＸの点）を指先とし，距離値が所定のしきい値以上の範囲を指先領域とする。
　上記のようにして決定した指先の画素のｘ，ｙ座標が指先の画面上の位置である。指先の奥行方向（ｚ方向）の位置は，一例として，指の太さをパラメータとして定めることができる。たとえば，指先（または指の先端）の画素を中心とする所定の半径の円（半径Ｒの円でもよい）を仮想し，この円と手（指）領域の境界との交点（２つある）間の距離を指の太さｗを示すものとする。指がカメラ１１から離れれば指の太さは細くなるので，ｚ方向（奥行方向）の位置のパラメータとして使うことができる。以上のようにして，指先の３次元位置（のパラメータ）が定まる。
　差分算出部１６は画像メモリ１３に記憶されている複数フレーム分の静止画像を用いて，時間的に隣接するフレーム間での指先の位置変化（速度），速度の変化（加速度）を求めるものである。時間的に隣接するフレーム間（２つのフレーム画像間）における指先位置（ｘ，ｙ，ｗ）の変化分をｄｘ，ｄｙ，ｄｗとすると，ａ（ｄｘ）^２＋ｂ（ｄｙ）^２＋ｃ（ｄｗ）^２（ａ，ｂ，ｃは適当な定数）の平方根を求めて，これを位置変化すなわち速度とする。さらに速度の変化（時間的に隣接する３フレームの画像を用いる）を求めて加速度とする。
　ユーザが指先で仮想物体３１をクリックするときには，まず指先をクリックしようとする仮想物体３１に重ねることが多い（必ずしも重ねないときもある）。クリック可能状態検出部１７は，指先が仮想物体３１に重なったことを検出するものである。画像３０Ａ上の各仮想物体（の像）３１は画像３０Ａ上における範囲（領域）を持っている。上述のようにして求められたユーザの指先の位置のＸ，Ｙ座標（ｘ，ｙ）が特定の仮想物体３１の領域の範囲内にあり，この状態が所定の時間（たとえば数フレームの時間，一例として４フレームの時間）継続していることを検知して，クリック可能状態検出とする。クリック可能状態が検出されると，たとえば，第６図に示すように，指先が重なった特定の仮想物体３１（Ｃの文字が表わされたボタン）を拡大して表示するというように，仮想物体３１の表示態様を変化させる。これにより，ユーザは目的とする仮想物体３１上に自分の指先があり，クリック動作が可能となったことを視覚的に認識することができる。仮想物体と指先の重なりは，２次元平面上のみならず，３次元的に，たとえば３次元空間内のある方向に沿う重なりとして検出することもできる。なお，クリック可能状態の検出部１７は無くてもよい。
　指先で仮想物体をクリックする動作を観察した結果，指先の動きには第８図に示す遷移があることが分った。
　停止状態は，所定時間（数フレームの時間，少なくとも１フレーム時間間隔）以上にわたって指先の位置が変化しない（変化範囲が微小な値以下）静止状態を意味する。低速状態と高速状態は指先が動いている状態を示し，その移動速度が相対的に小さい（速度が０または０に近い値以上で所定の第１のしきい値以下）状態を低速状態，相対的に大きい状態（速度が上記の第１のしきい値を超えている）（加速度が所定のしきい値以上という条件を加えてもよい）を高速状態という。急減速とは，指先の動きが急速に速度を落とす状態，すなわち急激に減速する状態である。第８図の矢印は遷移の方向を示している。停止状態と低速状態との間，停止状態と高速状態との間，低速状態と高速状態との間の遷移はそれぞれ双方向である。急減速については，高速状態から急減速に進み，最終的に停止する。この急減速を含む指先の動きがクリック動作に特徴的なものである。急減速は加速度が負であり，かつ所定のしきい値以下の状態をいう。また，急減速の後の停止は，速度が０またはその近傍（小さなしきい値以下）であればよい。すなわち，少なくとも１フレーム時間間隔の間，指先が実質的に動いていなければよい。
　仮想物体をクリックする動作を分析的にみると，第９Ａ図に示すように，停止状態から低速状態に移り，さらに高速状態となって急減速して停止するという動作と，第９Ｂ図に示すように，停止状態から高速状態になり，急減速して停止状態になるという動作があることが分る。第９Ｂ図の高速状態は必ずしも高速でなくてもよい。最初の停止状態は動く前の静止状態と考えればよい。
　クリック動作検出部１８は，これらの第９Ａ図または第９Ｂ図に示す動作の状態遷移が生じたことを検出してクリック動作があったことを検出する。クリック動作において特徴的なことは，動いている状態から急減速して停止することである。最終的な停止位置では，クリックされる仮想物体の領域内に指先位置が存在し，かつ所定時間以上，その状態を保っていることが好ましいが，１フレーム時間間隔以上であればよい。動きの開始前は停止しているという意味では，動いている状態の前に，最初の停止状態が存在するが，この最初の停止状態では指先位置が必ずしも仮想物体の領域内に存在しなくてもよい。これとは対照的に，最初の停止状態の検出とクリック可能状態検出（指先位置が仮想物体の領域内に存在する）とを兼ねてもよい。
　クリック動作検出部１８はクリック動作を検出すると，クリックされた仮想物体の表示の態様を変える。たとえば，第６図に示す拡大された仮想物体の色を変化させるなどである。特定の音を発生させてもよい。これにより，ユーザはクリック動作を完遂したことを認識することができる。クリックされた仮想物体によって表わされるデータ（第６図の例ではＣというキャラクタ）または命令がクリック動作によって入力されたことを表示画面上の入力データ欄（領域），命令実行領域に表示するようにするとよい。
　第１０図は処理装置（コンピュータ）１０がそのクリック動作検出プログラムにしたがって行う処理の流れを示している。
　カメラ１１が撮像し画像メモリ１３に記憶されている最初のフレームの静止画を取出して（Ｓ１１），その画像データに基づいて手領域を抽出し，指先位置を算出する（手領域抽出部１５）（Ｓ１２）。算出された指先位置に基づいてその差分（速度，加速度）を算出する（差分算出部１６）（Ｓ１２）。速度の算出には２フレーム分の画像データが必要であり，加速度の算出には３フレーム分の画像データが必要であるから，第２，第３フレームの静止画像データを取得したときに，速度，加速度の算出が可能となる。
　得られた差分データに基づいてクリック可能状態の検出処理を行い（クリック可能状態検出部１７）（Ｓ１４），クリック可能状態であれば（Ｓ１５でＹＥＳ），仮想物体の表示態様の変化，特定の音の発生等により，クリック可能状態を通知する（Ｓ１６）。
　続いて，移動状態，動作状態の時間的遷移を調べることによりクリック動作検出処理を行う（クリック動作検出部１８）（Ｓ１７）。クリック動作を検出すると（Ｓ１８），仮想物体の表示態様を変化させる，特定の音を発生させるなどによりクリック動作状態を通知する（Ｓ１９）。
　画像メモリ１３に記憶された最終フレームでない限り（Ｓ２０），次のフレームを取得して（Ｓ２１），Ｓ１２からの処理を繰返す。最終フレームになれば，クリック検出処理は終了する。
　第１１図はクリック動作検出を携帯端末装置に適用した使用例（応用例）を示している。携帯端末装置５０の表示画面５１が設けられている面とは反対側の面にカメラ（カメラの視野を鎖線５２で示す）が設けられている。ユーザはカメラの視野内で指先でクリック動作を行う。すなわち，ユーザの手の指をカメラの視野内に置くと，それがカメラで撮影され，表示画面５１に表示される。表示画面５１には仮想物体（アイコン，ボタンなど）が表示されており，ユーザは表示画面５１上においてそこに表示された仮想物体をそこに表示された指先でクリックする。携帯端末装置の表示画面は小さく，かつ人間の実際の指先によってタッチできる領域の大きさには制約がある。表示画面５１に表示するユーザの指先を小さくすれば，多くの小さなアイコン等（仮想物体）を表示画面に配置し，かつ所望のものをクリックすることができるようになる。
　第１２図は逆に大型のディスプレイにクリック動作検出を適用した例を示している。大型の表示装置６０の上部にカメラ６２が設置され，その視野は表示装置６０の表示画面６１の前方の領域に設定されている。表示画面６１が大きすぎてユーザの手が届かないところがあるが，ユーザの手（指）がカメラ６２の視野内にありさえすれば，表示画面６１に表示された仮想物体をカメラ６２が撮影し，表示画面６１に表示された手または指（一般的には左右反転されよう）でクリックすることができる。
　さらにノートパソコンやディスクトップ端末にクリック動作検出を応用することができる。一般的には，これらの端末装置の表示装置の上部にカメラが設けられ，表示画面前方を撮影することになるであろう（第１２図と同じ配置）。もちろん，第１１図の形態と同じように表示画面の反対側をカメラで撮影してもよい。キーボードやマウスのほかに，タッチパネルを備えた端末装置が一般に販売されているが，環境によってはこれらの入力装置が不十分であることがある。たとえば，調理場，風呂場，船上，手術室などでは水や油，血液などで汚れるのでキーボード，マウス，タッチパネルに触れることが難しい。上述したクリック動作検出は非接触でボタン操作を実現するので，これらの状況に対しても端末の利用を可能にする。
　表示画面に表示される仮想物体は平面的（二次元的）な配置のみならず，三次元的に配置された状態の表示であってもよい。
　表示装置の表示画面上に必ずしも仮想物体を表示しなくてもよい。ユーザの手（指）がカメラで撮影されていれば，その動画像信号に基づいてクリック動作を検出することができる。この場合に，クリック可能状態の検出は，表示画面に表示されたユーザの手（指）が表示画面内の所定の領域内で所定時間以上停止状態にあることを検出すればよい。さらに表示装置も必ずしも必要ではない。カメラで手（指）の動きを撮像すれば，その出力画像信号に基づいてクリック動作の検出が可能である。表示装置がない場合には，複数の仮想物体のうちの一つを選択したことがユーザに分りにくい。仮想物体が一つの場合や，仮想物体そのものを想定せずに，何らかの合図としてクリック動作を検出する応用などの場合には必ずしも表示装置は必要ないであろう。 An example of its use will be described with reference to FIG. 1 in order to facilitate an understanding of the outline of the click motion detection device.
A user wears a head mounted display (hereinafter referred to as HMD) 20 on the head. The HMD has a display device, and the user views an image displayed on the display device. It appears as if a virtual object appearing in the image exists at a position (in the air) at a suitable distance in front of the user's eyes. In the example shown in FIG. 1, the virtual object is a plurality of arranged buttons 31, and these buttons 31 are displayed in the display screen 30A of the display device.
The conventional click operation on the computer is to select or specify the target by positioning the cursor in a specific position (target) or area (target) on the display screen and pressing a button on the mouse. Command to execute the instruction. As an operation (operation) similar to this, in this embodiment, a virtual object (displayed position, area, target) on the display screen is selected, and an operation of pressing with a fingertip is performed as if a button is pressed. Is called a click operation (as will be apparent from the following description, selecting a specific virtual object is not necessarily an essential requirement for detecting a click operation).
The user wearing the HMD 20 selects one of a plurality of buttons 31 that appear on the display screen as if they existed in the air in front, and presses the button with his / her fingertip (click Do). The HMD 20 is provided with a camera 11, and images the front, that is, the vicinity where the virtual object 31 exists, and outputs a moving image signal obtained by the imaging. The user's fingertip (finger, hand) is photographed by the camera 11, and the user has selected a specific button and clicked (pressed) the image by processing the moving image signal output from the camera 11. Detected).
FIG. 2 shows the electrical configuration of the click motion detection apparatus of this embodiment.
The processing device 10 is realized by a computer, for example, and functionally includes an image memory 13, a display control unit 14, a hand region extraction unit 15, a difference calculation unit 16, a clickable state detection unit 17, and a click operation detection unit, which will be described in detail later. 18 is provided. The camera 11 is provided in the HMD 20 described above, for example, and its visual field is positioned so as to photograph the vicinity of the hand or finger of the user who performs the click operation. The display device 12 is, for example, a display device (display) equipped in the HMD 20 described above.
The click motion detection device further includes an input device 21, an output device 22, a storage device 19 and the like as necessary. The input device 21 inputs a click operation detection program, parameters, commands, and the like, and is realized by a keyboard, a display screen and a mouse, a communication device, a medium reader, and the like. The output device 22 outputs data input by a click operation, and is realized by a display device (which can also be used as the display device 12), a communication device, a medium writing device, or the like. The storage device 19 stores a click operation detection program, parameters, input data (including data input by the click operation), and the like.
The image memory 13 of the processing device 10 stores still images (signals and data) for at least a plurality of frames of moving images (signals and data) captured by the camera 11 and output from the camera 11. These image data are used in the hand region extraction unit 15 and the difference calculation unit 16.
The display control unit 14 stores image data for displaying a virtual object (button 31 or the like) created in advance, and displays the virtual object on the display screen of the display device 12 based on the image data. In addition, the captured image stored in the image memory 13 is superimposed (synthesized) on the image of the virtual object and displayed on the display screen of the display device 12.
For example, FIG. 3 shows an image 30 </ b> A of the virtual object 31 displayed on the display screen of the display device 12. The upper left corner of the screen is the origin of the XY coordinates. The position and area of the virtual object 31 are predetermined on the XY coordinates. FIG. 4 shows an image 30B for one frame of the moving images captured by the camera 11. A user's hand area 40 and index finger 41 are shown. The origin of the XY coordinates of this image 30B is also determined at the upper left corner.
FIG. 5 shows the display image 30A of the display device 12 including the virtual object 31 shown in FIG. 3 and the image 30B of one frame imaged by the camera 11 with the XY coordinate origins being overlapped (synthesized). ) The obtained image 30B is shown. Eventually, the composite image 30 is displayed on the display screen of the display device 12 under the control of the display control unit 14. In creating a composite image, in the above description, the origins of the XY coordinates of both

images

30A and 30B are made to coincide with each other. A specific point in the image 30A and the image 30B may be matched. Alternatively, one or both of the

images

30A and 30B may be combined after being enlarged or reduced.
The hand area extraction unit 15 calculates a hand area on each frame (image) of the captured image stored in the image memory 13 and specifies the hand area. The hand region can be identified by extracting skin color pixels based on a color range defined in advance by an HSV (Hue, Saturation Value) color system or the like. For example, a representative skin color position is given in the HSV color space, and the color of a region within a certain distance in the color space from the position is defined as the skin color.
Since it is conceivable that an area having a skin color other than the hand is included in the image, the hand area is narrowed down according to the size of the area. A region determined to be a skin color is determined by labeling, and a region having the largest area is estimated as a hand region.
The distribution in the HSV color space of the pixels of the skin color area extracted in the first frame can be reflected in the extraction of the skin color area after the next frame as learning data. An average value and a variance-covariance matrix of the extracted skin color region pixels in the HSV color space are obtained. The Mahalanobis distance between the obtained average value and the variance-covariance matrix is calculated with the pixel color to be determined, and the skin color region of the next and subsequent frames is extracted by comparing it with a threshold value.
FIG. 7 shows a hand region 40 obtained by performing expansion / reduction processing (closing or opening) on a region determined to be a skin color region. This hand region 40 includes a finger (for example, an index finger) 41 used when a click operation is performed, and the tip thereof is considered to be a pixel having the minimum Y coordinate. In order to determine the fingertip area (the swollen part inside the fingertip (opposite the nail)), within the circle of radius R (preset) centered on the tip pixel, the hand (finger) area A distance value 0 is given to the pixels at the boundary, and a larger distance value is given to the pixels that move away from the boundary toward the inside of the hand (finger) region. A pixel having the maximum distance value (point X indicated by reference numeral 42) is defined as a fingertip, and a range having a distance value equal to or greater than a predetermined threshold is defined as a fingertip region.
The x and y coordinates of the pixel of the fingertip determined as described above are the positions on the screen of the fingertip. As an example, the position of the fingertip in the depth direction (z direction) can be determined by using the thickness of the finger as a parameter. For example, a circle with a predetermined radius (or a circle with a radius R) centered on the pixel of the fingertip (or the tip of the finger) is assumed, and the intersection (there are two) of this circle and the boundary of the hand (finger) region The distance between them indicates the thickness w of the finger. If the finger is separated from the camera 11, the thickness of the finger becomes thin, and can be used as a parameter for the position in the z direction (depth direction). As described above, the three-dimensional position (parameter) of the fingertip is determined.
The difference calculating unit 16 uses a plurality of frames of still images stored in the image memory 13 to obtain a fingertip position change (speed) and speed change (acceleration) between temporally adjacent frames. is there. Assuming that changes in fingertip position (x, y, w) between temporally adjacent frames (between two frame images) are dx, dy, dw, a (dx) ² + b (dy) ² + c (dw) ² Find the square root of (where a, b, c are appropriate constants) and use this as the position change, that is, the velocity. Furthermore, a change in speed (using images of three frames that are temporally adjacent) is obtained and used as an acceleration.
When the user clicks on the virtual object 31 with the fingertip, the user often first superimposes the virtual object 31 on which the user wants to click the fingertip (there may not always overlap). The clickable state detection unit 17 detects that the fingertip has overlapped the virtual object 31. Each virtual object (image) 31 on the image 30A has a range (area) on the image 30A. The X and Y coordinates (x, y) of the position of the fingertip of the user determined as described above are within the range of the area of the specific virtual object 31, and this state is determined for a predetermined time (for example, a time of several frames, As an example, it is detected that it is continuing for a period of 4 frames, and the clickable state is detected. When the clickable state is detected, for example, as shown in FIG. 6, a specific virtual object 31 (a button on which a letter C is displayed) with a fingertip overlapped is enlarged and displayed. The display mode of the object 31 is changed. As a result, the user can visually recognize that his / her fingertip is on the target virtual object 31 and the click operation is possible. The overlap between the virtual object and the fingertip can be detected not only on the two-dimensional plane but also three-dimensionally, for example, as an overlap along a certain direction in the three-dimensional space. Note that the clickable state detection unit 17 may be omitted.
As a result of observing the action of clicking on the virtual object with the fingertip, it was found that there is a transition shown in FIG. 8 in the movement of the fingertip.
The stop state means a stationary state in which the position of the fingertip does not change (the change range is a minute value or less) for a predetermined time (several frames, at least one frame time interval) or more. The low-speed state and the high-speed state indicate the state where the fingertip is moving, and the movement speed is relatively low (the speed is 0 or close to 0 and below a predetermined first threshold value). State (in which the speed exceeds the first threshold value) (the condition that acceleration is equal to or greater than a predetermined threshold value may be added) is referred to as a high speed state. Sudden deceleration is a state where the movement of the fingertip rapidly decreases in speed, that is, a state where it decelerates rapidly. The arrows in FIG. 8 indicate the direction of transition. Transitions between the stop state and the low speed state, between the stop state and the high speed state, and between the low speed state and the high speed state are bidirectional. With regard to sudden deceleration, it proceeds from the high speed state to sudden deceleration and finally stops. The movement of the fingertip including this sudden deceleration is characteristic of the click operation. Rapid deceleration refers to a state in which acceleration is negative and below a predetermined threshold. Further, the stop after the rapid deceleration may be performed when the speed is 0 or in the vicinity thereof (below a small threshold value). That is, it is sufficient that the fingertip does not move substantially for at least one frame time interval.
Analyzing the action of clicking on the virtual object analytically, as shown in FIG. 9A, it moves from the stop state to the low speed state, and further becomes a high speed state and suddenly decelerates and stops, as shown in FIG. 9B. In this way, it can be seen that there is an operation of changing from a stopped state to a high-speed state, suddenly decelerating and stopping. The high speed state in FIG. 9B does not necessarily have to be high speed. The first stop state may be considered as a stationary state before moving.
The click operation detection unit 18 detects that a click operation has occurred by detecting that the state transition of the operation shown in FIG. 9A or 9B has occurred. What is characteristic of the click operation is that it suddenly decelerates from the moving state and stops. At the final stop position, it is preferable that the fingertip position exists in the area of the virtual object to be clicked and kept in the state for a predetermined time or longer, but it may be at least one frame time interval. In the sense of stopping before the start of movement, the first stop state exists before the moving state. In this first stop state, the fingertip position does not necessarily exist within the region of the virtual object. Also good. In contrast, the detection of the first stop state and the clickable state detection (the fingertip position exists in the region of the virtual object) may be combined.
When the click operation detecting unit 18 detects the click operation, the click operation detecting unit 18 changes the display mode of the clicked virtual object. For example, the color of the enlarged virtual object shown in FIG. 6 is changed. A specific sound may be generated. As a result, the user can recognize that the click operation has been completed. The data represented by the clicked virtual object (character “C” in the example of FIG. 6) or the input of the instruction by the click operation is displayed in the input data field (area) on the display screen and the instruction execution area. Good.
FIG. 10 shows the flow of processing performed by the processing device (computer) 10 in accordance with the click operation detection program.
A still image of the first frame captured by the camera 11 and stored in the image memory 13 is extracted (S11), a hand area is extracted based on the image data, and a fingertip position is calculated (hand area extraction unit 15). (S12). The difference (speed, acceleration) is calculated based on the calculated fingertip position (difference calculation unit 16) (S12). Since the image data for two frames is necessary for calculating the speed and the image data for three frames is necessary for calculating the acceleration, when the still image data of the second and third frames are acquired, the speed is calculated. , Acceleration can be calculated.
A clickable state detection process is performed based on the obtained difference data (clickable state detection unit 17) (S14). If the clickable state is detected (YES in S15), a change in the display mode of the virtual object, a specific A clickable state is notified by the generation of sound or the like (S16).
Subsequently, a click motion detection process is performed by examining temporal transitions of the movement state and the motion state (click motion detection unit 18) (S17). When the click operation is detected (S18), the state of the click operation is notified by changing the display mode of the virtual object or generating a specific sound (S19).
Unless it is the last frame stored in the image memory 13 (S20), the next frame is acquired (S21), and the processing from S12 is repeated. When the last frame is reached, the click detection process ends.
FIG. 11 shows a usage example (application example) in which the click motion detection is applied to a mobile terminal device. A camera (the camera's field of view is indicated by a chain line 52) is provided on the surface opposite to the surface on which the display screen 51 of the mobile terminal device 50 is provided. The user performs a click operation with a fingertip within the field of view of the camera. In other words, when the user's finger is placed within the field of view of the camera, it is photographed by the camera and displayed on the display screen 51. A virtual object (icon, button, etc.) is displayed on the display screen 51, and the user clicks the virtual object displayed there on the display screen 51 with the fingertip displayed there. The display screen of the portable terminal device is small, and the size of the area that can be touched by a human fingertip is limited. If the user's fingertip displayed on the display screen 51 is reduced, many small icons (virtual objects) can be arranged on the display screen and a desired one can be clicked.
FIG. 12 shows an example in which click motion detection is applied to a large display. A camera 62 is installed on the upper portion of the large display device 60, and its field of view is set in a region in front of the display screen 61 of the display device 60. Although the display screen 61 is too large to reach the user's hand, the camera 62 captures the virtual object displayed on the display screen 61 as long as the user's hand (finger) is within the field of view of the camera 62. , It can be clicked with a hand or a finger (generally reversed horizontally) displayed on the display screen 61.
Furthermore, click motion detection can be applied to notebook computers and desktop terminals. Generally, a camera is provided on the upper part of the display device of these terminal devices, and the front of the display screen will be photographed (the same arrangement as in FIG. 12). Of course, the opposite side of the display screen may be photographed with a camera as in the embodiment of FIG. In addition to keyboards and mice, terminal devices with a touch panel are generally sold, but these input devices may be insufficient depending on the environment. For example, it is difficult to touch the keyboard, mouse, and touch panel in kitchens, bathrooms, boats, operating rooms, etc. because they are contaminated with water, oil, blood, etc. The click operation detection described above realizes button operation without contact, so that the terminal can be used even in these situations.
The virtual object displayed on the display screen may be not only a planar (two-dimensional) arrangement but also a display in a three-dimensional arrangement.
The virtual object does not necessarily have to be displayed on the display screen of the display device. If the user's hand (finger) is captured by the camera, the click operation can be detected based on the moving image signal. In this case, the clickable state may be detected by detecting that the user's hand (finger) displayed on the display screen is in a stopped state for a predetermined time or more within a predetermined area in the display screen. Further, a display device is not always necessary. If the movement of the hand (finger) is imaged by the camera, the click operation can be detected based on the output image signal. When there is no display device, it is difficult for the user to know that one of the plurality of virtual objects has been selected. In the case of a single virtual object or an application that detects a click operation as a cue without assuming the virtual object itself, a display device is not necessarily required.

　仮想物体のクリック動作検出装置，方法，プログラムは，ヘッドマウントディスプレイ，携帯端末装置，通常のパーソナルコンピュータ，大型ディスプレイ等に応用することができる。 The virtual object click motion detection device, method, and program can be applied to head mounted displays, portable terminal devices, ordinary personal computers, large displays, and the like.

Claims

あらかじめ作成された仮想物体の画像と撮像装置から入力される動画像とを合成し，表示装置に表示する表示制御手段，
　前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出する手領域抽出手段，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求める差分算出手段，および
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するクリック動作検出手段，
　を備えるクリック動作検出装置。 Display control means for synthesizing an image of a virtual object created in advance and a moving image input from an imaging device and displaying the synthesized image on a display device;
Hand region extracting means for extracting at least part of a hand region appearing in each frame image constituting the moving image;
A difference calculating means for obtaining a difference amount regarding a specific portion of the extracted hand region between temporally adjacent frames, and a movement of the specific portion of the hand region by examining a time transition of a state represented by the difference amount Click action detecting means for detecting the click action represented by
A click motion detection device comprising:
前記表示装置に表示されている仮想物体の画像の所定領域内で，前記抽出された手領域の前記特定部分が，所定時間以上，動きを停止していることを検出してクリック可能状態と判断するクリック可能状態検出手段をさらに備える，請求の範囲第１項に記載のクリック動作検出装置。 In the predetermined area of the image of the virtual object displayed on the display device, it is determined that the specific part of the extracted hand area has stopped moving for a predetermined time or more and is determined to be clickable. The click motion detection device according to claim 1, further comprising a clickable state detection means for performing the operation.
前記クリック可能状態検出手段がクリック可能状態と判断したときに，前記表示制御手段は前記仮想物体の画像の所定領域に関連する部分の表示態様を変化させる，請求の範囲第２項に記載のクリック動作検出装置。 The click according to claim 2, wherein the display control means changes a display mode of a portion related to a predetermined area of the image of the virtual object when the clickable state detecting means determines that the clickable state is detected. Motion detection device.
前記クリック動作検出手段がクリック動作を検出したときにその旨を報知する報知手段をさらに備える，請求の範囲第１項から第３項のいずれか一項に記載のクリック動作検出装置。 The click motion detection device according to any one of claims 1 to 3, further comprising notification means for notifying that when the click motion detection means detects a click motion.
前記差分算出手段は，抽出した手領域の特定部分の速度情報および加速度情報を算出するものである，請求の範囲第１項に記載のクリック動作検出装置。 2. The click motion detection device according to claim 1, wherein the difference calculation means calculates speed information and acceleration information of a specific part of the extracted hand region.
前記クリック動作検出手段は，前記手領域の前記特定部分の動きが，動いている状態から，急減速して停止したことを検出してクリック動作と判断するものである，請求の範囲第１項に記載のクリック動作検出装置。 The click operation detecting means detects that the movement of the specific part of the hand region has been suddenly decelerated and stopped from a moving state, and determines that it is a click operation. Click motion detection device described in 1.
前記クリック動作検出手段は，前記手領域の前記特定部分の動きに停止状態，運動状態，急減速，そして停止状態の遷移があったときにクリック動作と判断するものである，請求の範囲第１項に記載のクリック動作検出装置。 The click operation detecting means determines a click operation when the movement of the specific portion of the hand region includes a stop state, a motion state, a rapid deceleration, and a stop state transition. The click motion detection device according to the item.
前記クリック動作検出手段は，急減速から，表示されている仮想物体の画像の所定領域内で停止したときにクリック動作と判断する，請求の範囲第６項または第７項に記載のクリック動作検出装置。 The click operation detection unit according to claim 6 or 7, wherein the click operation detection means determines that the click operation is performed when the stop is performed within a predetermined area of the image of the displayed virtual object from sudden deceleration. apparatus.
撮像装置から入力される動画像を表示装置に表示する表示制御手段，
　前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出する手領域抽出手段，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求める差分算出手段，および
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するクリック動作検出手段，
　を備えるクリック動作検出装置。 Display control means for displaying a moving image input from the imaging device on a display device;
Hand region extracting means for extracting at least part of a hand region appearing in each frame image constituting the moving image;
A difference calculating means for obtaining a difference amount regarding a specific portion of the extracted hand region between temporally adjacent frames, and a movement of the specific portion of the hand region by examining a time transition of a state represented by the difference amount Click action detecting means for detecting the click action represented by
A click motion detection device comprising:
前記表示装置の表示画面内の所定領域内で，前記抽出された手領域の前記特定部分が所定時間以上，動きを停止していることを検出してクリック可動状態と判断するクリック可能状態検出手段をさらに備える，請求の範囲第９項に記載のクリック動作検出装置。 Clickable state detecting means for detecting that the specific portion of the extracted hand region has stopped moving for a predetermined time or more in a predetermined region in the display screen of the display device, and determining that it is in a click movable state The click motion detection device according to claim 9, further comprising:
撮像装置から入力される動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出する手領域抽出手段，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求める差分算出手段，および
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するクリック動作検出手段，
　を備えるクリック動作検出装置。 Hand region extraction means for extracting at least a part of a hand region appearing in each frame image constituting a moving image input from the imaging device;
A difference calculating means for obtaining a difference amount regarding a specific portion of the extracted hand region between temporally adjacent frames, and a movement of the specific portion of the hand region by examining a time transition of a state represented by the difference amount Click action detecting means for detecting the click action represented by
A click motion detection device comprising:
あらかじめ作成された仮想物体の画像と撮像装置から入力される動画像とを合成して表示装置に表示し，
　前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求め，そして
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出する，
　クリック動作検出方法。 The image of the virtual object created in advance and the moving image input from the imaging device are combined and displayed on the display device,
Extracting at least a part of a hand region appearing in each frame image constituting the moving image;
Represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region between temporally adjacent frames and examining the time transition of the state represented by the difference amount Detect click activity,
Click motion detection method.
撮像装置から入力される動画像を表示装置に表示し，
　前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求め，そして
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出する，
　クリック動作検出方法。 Display the moving image input from the imaging device on the display device,
Extracting at least a part of a hand region appearing in each frame image constituting the moving image;
Represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region between temporally adjacent frames and examining the time transition of the state represented by the difference amount Detect click activity,
Click motion detection method.
撮像装置から入力される動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求め，そして
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出する，
　クリック動作検出方法。 Extracting at least a part of a hand region appearing in each frame image constituting a moving image input from the imaging device;
Represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region between temporally adjacent frames and examining the time transition of the state represented by the difference amount Detect click activity,
Click motion detection method.
あらかじめ作成された仮想物体の画像と撮像装置から入力される動画像とを合成し，表示装置に表示させ，
　前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求め，そして
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するようにコンピュータを制御する，
　クリック動作検出のためのプログラム。 The virtual object image created in advance and the moving image input from the imaging device are combined and displayed on the display device.
Extracting at least a part of a hand region appearing in each frame image constituting the moving image;
Represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region between temporally adjacent frames and examining the time transition of the state represented by the difference amount Control the computer to detect click activity,
A program for detecting click motion.
撮像装置から入力される動画像を表示装置に表示させ，
　前記動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求め，そして
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するようにコンピュータを制御する，
　を備えるクリック動作検出のためのプログラム。 Display the moving image input from the imaging device on the display device,
Extracting at least a part of a hand region appearing in each frame image constituting the moving image;
Represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region between temporally adjacent frames and examining the time transition of the state represented by the difference amount Control the computer to detect click activity,
A program for detecting a click action.
撮像装置から入力される動画像を構成する各フレーム画像内に現われる手領域の少なくとも一部を抽出し，
　時間的に隣接するフレーム間の，抽出した手領域の特定部分に関する差分量を求め，そして
　前記差分量によって表わされる状態の時間遷移を調べることによって，前記手領域の前記特定部分の動きによって表わされるクリック動作を検出するようにコンピュータを制御する，
　クリック動作検出のためのプログラム。 Extracting at least a part of a hand region appearing in each frame image constituting a moving image input from the imaging device;
Represented by the movement of the specific part of the hand region by determining the amount of difference for the specific part of the extracted hand region between temporally adjacent frames and examining the time transition of the state represented by the difference amount Control the computer to detect click activity,
A program for detecting click motion.
請求の範囲第１５項，第１６項および第１７項のいずれか一項に記載のプログラムを記録したコンピュータ読取可能な記憶媒体。 A computer-readable storage medium storing the program according to any one of claims 15, 16, and 17.