JP2004272515A

JP2004272515A - Interface method, device, and program

Info

Publication number: JP2004272515A
Application number: JP2003061227A
Authority: JP
Inventors: Hidenori Sato; 秀則佐藤; Hidekazu Hosoya; 英一細谷; Yoshinori Kitahashi; 美紀北端; Ikuo Harada; 育生原田; Hisao Nojima; 久雄野島; Akira Onozawa; 晃小野澤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-03-07
Filing date: 2003-03-07
Publication date: 2004-09-30
Anticipated expiration: 2023-03-07
Also published as: JP3860550B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a non-mounted interface device for precisely acquiring depth information and realizing three-dimensional pointing to a real space. <P>SOLUTION: A distance information generating part 12 generates distance information from a standard camera 11<SB>1</SB>to an object to be photographed by using image processing by a stereo method from a standard image I<SB>1</SB>and an image I<SB>2</SB>obtained by photographing a real space. A voxel space generating part 13 generates a voxel space while hierarchically dividing it according to the distance information. A start/end point two-dimensional coordinate calculating part 14 calculates the two-dimensional coordinates of the two sites (start point and end point) of a user body by using the standard image I<SB>1</SB>, and a start/end point three-dimensional coordinate calculating part 15 calculates the three-dimensional coordinates of the start point/end point from the distance information and the two-dimensional coordinates. A three-dimensional pointing position detecting part 16 detects a three-dimensional position in a photographic real space instructed by the user. A display information reversing part 18 reverses the standard image, the voxel space and the three-dimensional information right to left and vice versa. An information display part 19 displays the reversed data on a display 10 by overlapping them. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザーが体の部位（指や腕等）を用いて指し示した撮影空間内の物体を検出するインタフェース装置に関する。
【０００２】
【従来の技術】
これまで、コンピュータと人間とのインタフェースに関し、人間の動作に基づくインタフェース方法や装置としては、以下に挙げるような手法がある。
【０００３】
第１の従来手法として、体に手や指の動作計測可能なセンサを装着し、センサ情報からユーザーの動きを検出する装置がある。例えば、磁気センサを用いた装置（ＡＳＣＥＮＳＩＯＮ社の「ＭｏｔｉｏｎＳｔａｒ」等）や、機械式センサを用いた装置（スパイス社の「Ｇｙｐｓｙ」、Ｉｍｍｅｒｓｉｏｎ社の「ＣｙｂｅｒＧｌｏｖｅ」等）等の市販製品がある。また、非特許文献１に記載された方法がある。これは加速度センサ等を取り付けたグローブを手に装着して、ユーザーの動作を認識するものである。
【０００４】
第２の従来手法として、非特許文献２に記載された手法がある。本手法は、腕の制約条件とユーザーの実際の腕の長さを利用して、１台のカメラの入力画像から、腕領域を抽出し、その長さの時間的変化から腕の動きの３次元的変化を推定するものである。
【０００５】
第３の従来手法として、非特許文献３に記載された手法がある。本手法は、体の中に求めた座標位置（「仮想投射中心」）と、検出した指先の座標位置を結び、延長した指示軸線がスクリーン（入力画像を表示している表示装置の画面）と交差する点をカーソル位置（指示位置）とする方法である。仮想投射中心位置は、スクリーンの４つの角位置から指先へ延長した直線の交点から求めているので、指示できる位置はスクリーン上のみである。指先の位置の抽出は、２台のカメラを用い、１台を上部から撮影する位置に設置することにより、スクリーンに最も近い物体を検出することで実現している。
【０００６】
第４の従来手法として、非特許文献４に記載された手法がある。本手法は、多眼ステレオカメラを用いて生成した距離情報を用いて、スクリーンに最も近い物体を指先として検出し、また色情報と距離情報を用いて眉間（目）の位置を検出し、これらを結んだ延長線がスクリーンと交差する点をカーソル位置（指示位置）とする方法である。
【０００７】
【非特許文献１】
塚田ら，“Ｕｂｉ−Ｆｉｎｇｅｒ：モバイル指向ジェスチャ入力デバイスの試作”，インタラクティブとソフトウェアに関するワークショップ（ＷＩＳＳ２００１），ｐｐ．１１９−１２４，２００１
【非特許文献２】
安部ら，“オプティカルフローと色情報を用いた腕の動作の３次元追跡”，画像の認識・理解シンポジウム（ＭＩＲＵ２００２），ｐｐ．Ｉ２６７−Ｉ２７２，２００２
【非特許文献３】
福本ら，“動画像処理による非接触ハンドリーダ”，第７回ヒューマン・インタフェース・シンポジウム論文集，ｐｐ．４２７−４３２，１９９１
【非特許文献４】
金次ら，“指さしポインターにおけるカーソル位置の特定法”，電子情報通信学会画像工学研究会，２００２．１
【０００８】
【発明が解決しようとする課題】
しかしながら、上述した従来の手法では、以下に示す問題があった。
【０００９】
第１の従来手法では、手または指の動作を認識できるが、認識したい部位に常に何らかのセンサを装着する必要があるため、実用的なインタフェース装置としての利便性に欠ける。
【００１０】
第２の従来手法では、体に何も装着せずに腕の動作を認識できるが、カメラ１台のみの情報を使っているので奥行き方向の情報が直接得られないため、３次元的なユーザーの腕の動きを精度良く抽出できない。
【００１１】
第３、第４の従来手法では、ユーザーが非装着かつ非接触に、３次元的な動作を認識して、スクリーン画面上の位置を指示することができるポインティング手法であるが、指し示せるのはスクリーン画面内に限定されており、それ以外の方向にある実物体や位置を指し示すことはできない。
【００１２】
本発明の目的は、これら従来手法の、装着型のため利便性に欠ける問題、カメラ１台利用の方法では奥行き精度が悪い問題、を解決し、かつ実空間との位置の対応付けが簡単であり、ユーザーが視覚的に実物体の位置、情報を登録することもできるし、基準カメラから直接見えない位置にある実物体情報を登録することもできる、インタフェース方法、装置、およびプログラムを提供することにある。
【００１３】
【課題を解決するための手段】
上記目的を達成するために、本発明のインタフェース装置は、
基準カメラを含む複数のビデオカメラを用いて、実空間を撮影した画像を入力する手段と、
基準カメラから撮影物体までの距離情報を前記撮影画像を用いたステレオ法により生成する手段と、
基準カメラで撮影した画像である基準画像からの距離情報を用いてボクセル空間を生成する手段と、
基準カメラからの撮影画像上で、ユーザーの体の予め定めた部位２箇所を始点と終点として検出し、該画像上でのそれらの２次元座標を算出する手段と、
距離情報と、始点・終点の画像上での２次元座標とから、実空間上における始点・終点の３次元座標を算出し、ユーザーが指し示す実空間上での方向情報を算出する手段と、
生成されたボクセル空間上の当該３次元位置に、実空間上に存在する実物体の情報、すなわち一部もしくは全体に渡る３次元位置情報およびその付加情報を登録する手段と、
ユーザーの指し示す実空間上での方向情報と、登録されている物体の３次元位置情報とを照合し、操作者が指し示す３次元指示位置情報である、ユーザーが指し示す方向の延長線と、登録された物体との交点に関する情報を検出する手段と、
生成されたボクセル空間、登録過程および結果、３次元指示方向情報、および指示された物体の情報と撮影画像とを重畳表示する手段を有している。
【００１４】
ここで、主な用語について説明する。
１．ボクセル空間
立方体形状を持つ単位３次元空間である“ボクセル”の集合である。図１５に、ボクセルとボクセル空間との関係、それの階層分割のイメージを示す。
２．ステレオ法
ステレオ視（三角測量）の原理を用いて、基準カメラからの画像を含む２枚以上の入力画像から、基準カメラから見た撮影物体までの距離を測定する手法である。
３．基準カメラ
ステレオ法実行時に、距離情報生成のための基準となるカメラである。基準カメラで撮影した画像を基準画像と呼ぶ。
【００１５】
本発明によれば、非装着であり、奥行き情報を精度良く得られ、実空間への３次元的なポインティングを実現できる。また、基準カメラから見たボクセル空間を生成し、それを実画像と重畳して視覚的に表示することにより、実物体情報の登録も簡単にできるし、基準画像から見えない位置にある実物体の情報も登録できる。
【００１６】
本発明の実施態様では、インタフェース装置は、基準画像を左右反転した画像を生成する手段を有し、ボクセル空間も前記反転画像に合わせて、左右反転させた状態で生成し、重畳表示手段における各種表示情報も、該反転画像上に重畳表示する。
【００１７】
そのため、前記の利点に加え、自己画像が写っている鏡を見ながらポインティング操作を行っているような、より直接性、直感性が高まったインタフェースとすることができる。
【００１８】
本発明の他の実施態様では、インタフェース装置は、前記距離情報を生成する手段の代りに、ビデオカメラと投光装置を用いた能動的なステレオ法により距離情報を生成する手段を有している。
【００１９】
そのため、同様に、前記問題を解決でき、かつ物体情報登録に関する利便性、簡便性も高いインタフェースとすることができる。
【００２０】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【００２１】
［第１の実施形態］
図１は本発明の第１の実施形態のインタフェース装置の構成図、図２はその処理の流れを示すフローチャートである。
【００２２】
本実施形態のインタフェース装置は、複数のカメラで撮影された画像を入力画像とし、ユーザーが体の部位（指や腕等）を用いて指し示した実空間内の位置（もしくは物体）を検出するインタフェース装置において、ユーザーの直接的で直感的な３次元的指示動作に基づき、３次元空間上での指示位置を認識することができる装置で、かつその操作時にユーザーが自己画像を見ながらインタフェース動作を行える装置である。
【００２３】
本インタフェース装置は画像入力部１１_１，１１_２と距離情報生成部１２とボクセル空間生成部１３と始終点２次元座標算出部１４と始終点３次元座標算出部１５と３次元指示位置検出部１６と空間情報登録部１７と反転情報生成部１８と情報表示部１９と空間情報データ２０から構成される。
【００２４】
画像入力部１１_１，１１_２としては、図１のように２台（もしくは３台以上）のビデオカメラを用いる。カメラは一般に用いられるビデオカメラやＣＣＤカメラでよく、白黒でもカラーでもよい。ただし、後述する色情報を使用した方法を用いる場合はカラーカメラが必要である。
【００２５】
距離情報生成部１２は、基準画像（基準画像のイメージを図３に示す）を含む２枚以上の入力画像から、ステレオ法による画像処理を用いて基準カメラから撮影物体までの距離情報を生成する（ステップ２１）。距離情報を生成する具体的な画像処理方法の例としては、市販の製品、ＰｏｉｎｔＧｒｅｙＲｅｓｅａｒｃｈ社のＤｉｇｉｃｌｏｐｓ（３眼カメラ式）やＢｕｍｂｌｅｂｅｅ（２眼カメラ式）等を用いる方法がある。これらは各々、３台もしくは２台のカメラが内蔵された画像入力機器であり、出力として距離情報を生成できるものである。また、ステレオ法を用いる手法は、画像処理分野において一般的である（発表文献多数）ので、任意の２台以上のカメラを用いて自作することも可能である。図４に、図３の基準画像上の一本の画素のライン上ｘ軸方向に沿って得られる距離情報のイメージを表す。図で、距離軸方向が基準カメラからの距離を表しており、離れていればより大きな値をとり、すなわち基準カメラからより遠い位置にあることを意味している。
【００２６】
ボクセル空間生成部１３は、距離情報生成部１２において求められた距離情報に合わせて、ボクセル空間を、階層分割しながら作成する（ステップ２２）。例えば、図５では、得られた距離情報に対し、４×３×３個のボクセルからなる初期ボクセル空間を生成している。ここでは、初期ボクセル空間を指定階層まで階層的に分割し、各階層のボクセルに、距離情報をもとに、実物体が存在し得るかどうかのフラグを立てていく。以下、説明を簡単にするため、図４の距離情報を表すグラフ（矩形グラフ）を用いて、ボクセルへのフラグの立て方、および２次元的な階層分割法を説明する。図６に分割例を示す。各ボクセルには、白、灰色、黒、の三種類のうち、いずれかひとつのフラグが与えられる。すなわち、図６に示したように、距離情報を表すグラフが着目ボクセルと交差する場合は灰色、完全に上側を通る場合は黒、完全に下側を通る場合は白、とそのボクセルのフラグとする。すなわち、距離グラフと各ボクセルとが交差するか否かを判定し、その結果に従い、フラグを決定する。このうち、灰色と判定されたボクセルだけを、図６に示すようにさらに４等分に再分割し（３次元のボクセルの場合は、図１５に示すように、８等分に再分割することとなる）、再びフラグを与えていく。以上をあらかじめ与えたボクセルの大きさになるか、あらかじめ決められた分割数に到達するまで繰り返し行う。このようにして再帰分割を行うと、距離グラフと交差するボクセルが再帰的に分割されていき、それ以外の距離的に意味をなさない領域が白または黒のフラグをとる分割されないボクセルで表されることとなる。このようにして生成されたボクセル空間は、図７に示したようなオクトリー構造をとることにより、効率的、高速に、親ボクセルや子ボクセルとの関係を引き出すことができる。なお、ボクセルのフラグについては第４のものとして、登録フラグも存在する。これについては後述する。
【００２７】
始終点２次元座標算出部１４は、距離情報生成部１２で用いた入力画像のうち、基準画像を用いて、ユーザーの体の予め定める部位２箇所（始点と終点）の画面上での２次元座標を算出する（ステップ２３）。始点と終点は、例えば、ユーザーの肩の位置を始点とし、手の位置を終点とすることが考えられる。これにより、この場合は腕を伸ばした手の先の方向が後述する３次元指示方向となる。右肩と右手の位置を始点・終点とした場合の画面上での具体的な検出方法について、以下に示す。
【００２８】
右手の位置を画像処理により検出する方法としては、例えば、入力画像をカラー画像とした場合、カラー画像中の色情報を用いて、肌色領域（肌色の取り得る範囲を任意に幅を持たせた色の値の範囲で指定する。）を抽出する方法がある。得られた複数の肌色領域の中から、大きさや位置等の制約情報（例えば、手の大きさから推測される可能性のある肌色領域の面積範囲を指定したり、画面上の天井付近や床付近等手が存在する可能性の低いところを除外したりする等の制約。または距離情報を用いることも考えられる。）を利用して、手の肌色領域の候補を選択する。さらに、１）通常ユーザーは衣服を着ていると考えることができ、肌色領域の候補となる可能性が高い肌色領域は両手と顔と考えられる、２）最も面積の大きい領域は顔と考えられる、といった性質を用いて、２番目と３番目に面積が大きい肌色領域を手の候補とする。右手は、顔の右側にある領域であり、その重心位置を右手位置とする。左手位置を求めたい場合は、左右逆に考えればよく、両手の場合は、ふたつの場合を組合わせて考えればよい。
【００２９】
また、右肩の位置を画像処理により検出する方法としては、例えば、上記の手段で顔の位置を抽出してから肩の位置を算出する方法がある。具体的には、まず、前記の肌色抽出処理を行った結果から、最も面積が大きい肌色領域は顔の可能性が高いので、その肌色領域を顔と判断し、その重心を求める。次に、右肩の位置は顔の重心位置から、下へある程度の距離、右へある程度の距離ずらしたものと仮定することができるので、予めそのずらす距離を決めておいて（個人差があるのでユーザーによって値を変えてもよい）、顔の重心位置から右肩の位置を算出することができる。これらにより、始点・終点の画面上での２次元座標を求めることができる。また、ここでは肩の位置を始点としているが、前記により求められる顔の重心位置をそのまま始点としてもよい。その場合、顔の位置と手の位置を結ぶ延長線がユーザーの指示方向となる。
【００３０】
始終点３次元座標算出部１５は、生成された距離情報と、始点・終点の２次元座標から、始点・終点の３次元座標値を求める（ステップ２４）。具体的な方法としては、例えば、基準画像上で、始点の画像上での２次元座標に相当する距離値を始点の距離値とする。終点も同様である。３次元の実空間上において、画像の２次元座標系と３次元座標系の変換式は、一般に予め容易に算出しておけるので、それに基づいて得られた画面上での始点および終点の２次元座標値とその各距離値から、始点と終点の３次元空間上での３次元座標値を求めることができる。さらに、得られた始点・終点の２つの３次元座標値から、２点を結ぶ３次元直線を求めることにより、ユーザーの指示方向を求めることができる。
【００３１】
３次元指示位置検出部１６は、ユーザーが指し示した撮影実空間中の３次元位置を検出する（ステップ２５）。具体的な方法としては、まず始終点３次元座標算出部１５で求められた、始点と終点を結ぶ３次元直線を手（終点）方向に延長していく。このとき、ボクセル空間を用いて、該延長線が、予め登録されている空間中の物体等の３次元位置情報が登録されているボクセルと交差するか否かを検出する。そのようなボクセルには、後述するように登録フラグが立っており、容易に判別できる。ここでも、図４の距離情報の例をもとに、２次元的に交差ボクセルを求める方法を述べる。今、図８中の黒いボクセルが、物体情報が登録されたボクセル（登録ボクセル）で、ユーザーがその方向を指しているものとする。この時、該ボクセルが指示方向の延長線と交わるか否かを判定するには、該延長線が登録ボクセルを通過するか否かを上位層のボクセルから計算していく。通過する登録ボクセルが見つかった場合、それの下位層のボクセルについても、登録ボクセルを通過するか否かを計算する。この通過判定を、情報が実際に登録されたボクセルが見つかるまで繰り返し行う。該延長線がボクセルを通過するか否かの判定は、ボクセルが規則的な立方体形状をとるため極めて簡単に計算でき、さらに上位層のボクセルから下位層に向かって判定していくため、計算の早い段階で情報が登録されていない大きな領域を除外することができ、極めて高速に、登録情報が存在する３次元指示位置情報を探し出すことができる。ここで、空間中の物体等の情報については、空間情報登録部１７の説明において説明する。得られた３次元指示位置情報は空間情報データ２０として蓄積され、情報表示部１９において、ディスプレイ１０上に表示される。
【００３２】
空間情報登録部１７は、ユーザーが指示する可能性のある撮影実空間中の物体等の情報を空間情報データ２０に登録する（ステップ２６）。実空間中の物体等としては、例えば、ユーザーが部屋の中にいる場合には、部屋の中にある家電機器等（テレビ、エアコン、コンピュータ、時計、窓、棚、椅子、机、引出し、書類、オーディオ機器、照明機器等）の物体や、また部屋自体の壁、床、天井、窓等、任意のものが対象として考えられる。ここでは、情報表示部１９を使って、ボクセル空間を基準画像に重畳表示させながら、会話処理により、実物体情報を３次元的に登録していく。図９に、基準画像上に初期ボクセル空間を重畳表示したイメージ図を示す。この結果に対し、情報を登録したい位置を表わすボクセルを選択し、それに情報を登録していくことを行う。この際には、図１０に示したように、ある着目ボクセルのみを表示させたり、マウスの移動操作により、それの隣接ボクセルを表示させたり、マウスのクリック操作により下位層のボクセルを表示させたりしながら、目的ボクセルを絞り込んでいくことが考えられる。なお、この表示法やボクセルの選択、移動法は本方法に限ったものではなく、例えば、実空間ではなく、ボクセル空間を指示対象とした本発明におけるポインティング操作を行うことも考えられる。また、同じ情報を複数のボクセルに登録することも考えられる。また、本会話処理時に、白フラグや黒フラグのボクセルも強制的に分割できる操作を加えれば、実物体がない空間や、基準画像から隠れた場所にも、情報を登録することができる。最後に、情報を登録したボクセルとそれが属する最上位レベルから最下位レベルまでの全ボクセルについては、第４のフラグとして、登録フラグを立てていく。
【００３３】
これらの情報（３次元位置の座標情報やその他物体に関する情報等）は、実座標を利用して、予め空間情報データ２０に登録・保存しておき、それをボクセルに自動で割り当てることも考えられる。また、情報の登録に関しては、予め固定の３次元位置座標としておくのではなく、対象とする実物体毎に位置認識可能なセンサ（市販されている磁気センサ、超音波センサ、赤外線タグ、無線タグ等）を取り付けておくことにより、各々の物体の位置をリアルタイムに認識することができるので、それらにより得られた３次元位置情報から該物体情報を生成し、常時その物体の３次元位置座標等の情報を更新していくことも可能である。この場合、物体を移動させても３次元位置情報等をリアルタイムに更新させることができる。
【００３４】
表示情報反転部１８は、基準画像を左右反転するとともに、ボクセル空間や３次元位置情報も合わせて左右反転する（ステップ２７）。基準画像とそれを左右反転させた画像のイメージを図１１に示す。この場合は、ユーザーが、鏡を見ながらポインティング動作を行うのに近い像が得られる。基準画像の左右反転は、コンピュータ内へ取り込んだ入力画像に対し市販の汎用画像処理ソフトウェア（例：ＨＡＬＣＯＮ）により、リアルタイムに実行することができる。または、入力画像を入力し反転画像をリアルタイムに生成する市販の機器（例：（株）朋栄の画面左右反転装置ＵＰＩ−１００ＬＲＦ、またはカメラ一体型でＳＯＮＹのＥＶＩ−Ｄ１００）でも実現できる。また、本情報反転部１８が無い実施形態も考えられる。その場合には、そのままの座標で表示される。
【００３５】
情報表示部１９は、例えば、ポインティング動作中においては、１）左右反転した基準画像、２）同ボクセル空間、３）該３次元指示方向、および位置を表すＣＧをディスプレイ１０に重畳表示する（ステップ２８）。その場合の各画像の例を図１２に、重畳表示した結果を図１に示す。ディスプレイ１０は、コンピュータ用に使われる汎用のディスプレイでよく、コンピュータの画面とカメラ画像を表示できるものであればよい。なお、表示方法は本例に限るものではなく、例えばボクセル空間の表示を、一部着目しているボクセルのみに限ったり、全く行わなかったりする、ということも考えられる。
【００３６】
［第２の実施形態］
図１３は本発明の第２の実施形態のインタフェース装置の構成図、図１４はその処理の流れを示すフローチャートである。
【００３７】
本実施形態のインタフェース装置は、第２の実施形態において、２個以上のカメラからの入力画像から生成する受動的なステレオ法を用いる代わりに、カメラ１１と投光装置３１を用いた能動的なステレオ法により距離情報を生成するものである。両者を混在させることも可能である。
【００３８】
２個以上の画像から生成する受動的なステレオ法とは、例えば視線方向の近い２個のカメラの入力画像同士間で、対応する点を探し（対応点探索を行い）、その座標値のずれの大きさ（視差）からその点の距離を求める手法である。距離の計算には、三角測量の原理を用いている。この手法は、対応点探索が難しく精度良い距離情報が得られにくい問題があるが、光を照射するなどの能動的な動作や装置は必要なく、撮影環境等に影響されない利点を持っている。例えば、市販の製品で、ＰｏｉｎｔＧｒｅｙＲｅｓｅａｒｃｈ社のＤｉｇｉｃｌｏｐｓ（３眼カメラ式）やＢｕｍｂｌｅｂｅｅ（２眼カメラ式）等がある。
【００３９】
これに対し、投光装置を用いた能動的なステレオ法とは、２個のカメラのうち１台を、光を投射する光源に置き換え、対応点探索のための手がかりとなる情報を対象物に直接投射する手法である。光は、スリット光、スポット光、多種に変化するパターン光など、各種の光を用いる方法もしくは製品が提案もしくは市販されている。この手法は、光を投射する複雑な装置が必要であり、また撮影環境にも影響される問題があるが、対応点探索は安定して行えるので、精度良く距離画像を求めることができる利点を持っている。例えば、市販の製品で、ＮＥＣエンジニアリング社のＤａｎａｅ−Ｒ（非接触型３次元形状計測用レンジファインダ）等がある。
【００４０】
これら２つのステレオ法はいずれも距離情報を求めることができるので、互いに置き換えることが可能である。よって、２台以上のカメラだけを使うのではなく、１台以上のカメラと投光装置を用いた能動的なステレオ法も利用可能とすることにより、利用できる手法も市販機器も広くなり汎用性を高めることができるとともに、応用先を広げることができる。
【００４１】
なお、本発明は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。
【００４２】
【発明の効果】
以上説明したように、本発明は、下記の効果がある。
【００４３】
請求項１、４、７の発明は、非装着なインタフェースであるため、ユーザーの利便性が向上する。また、ステレオ法で得られる距離情報を利用したボクセル空間を利用するため、実物体情報の登録や、ユーザーの３次元的なポインティングの指示位置検出も効率的に行うことができる。さらに、３次元的なポインティングの指示先として、画面上だけでなく実空間上の位置もポインティング可能であり、応用先を広げることができる。
【００４４】
請求項２、５、７の発明は、請求項１の効果に加え、鏡のメタファを用いたインタフェース動作を行えるため、ユーザーの利便性をより向上させることができる。
【００４５】
請求項３、６、７の発明は、請求項１、２の効果に加え、２台以上のカメラのみでなく、１台以上のカメラと投光装置を用いた能動的なステレオ法を利用した手法もしくは市販機器も使うことができるため、汎用性を高めるとともに、応用先を広げることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態のインタフェース装置の構成図である。
【図２】第１の実施形態のインタフェース装置の処理の流れを示すフローチャートである。
【図３】基準画像のイメージを表す図である。
【図４】図３の基準画像上のひとつのライン上に沿って得られる距離情報のイメージを表す図である。
【図５】距離情報をもとに生成したボクセル空間のイメージを表す図である。
【図６】ボクセル空間の階層分割とフラグの立て方のイメージを表す図である。
【図７】生成したボクセル空間をオクトリー構造として保持した場合のイメージを表す図である。
【図８】指示方向から情報を登録したボクセルを探すアルゴリズムのイメージを表す図である。
【図９】基準画像上に初期ボクセル空間を重畳表示したイメージを表す図である。
【図１０】空間情報登録部において、ひとつのボクセルを強調表示した状態から、隣接ボクセルを強調表示させたり、下の階層のボクセルを強調表示させたりする場合のイメージを表す図である。
【図１１】図３の基準画像を左右反転させた画像のイメージを表す図である。
【図１２】第１の実施形態における表示結果に対する構成画像のイメージを示す図である。
【図１３】本発明の第２の実施形態のインタフェース装置の構成図である。
【図１４】第２の実施形態のインタフェース装置の処理の流れを示すフローチャートである。
【図１５】ボクセルとボクセル空間との関係、それの階層分割の概念を表す図である。
【符号の説明】
１０ディスプレイ
１１，１１_１，１１_２画像入力部
１２距離情報生成部
１３ボクセル空間生成部
１４始終点２次元座標算出部
１５始終点３次元座標算出部
１６３次元指示位置検出部
１７空間情報登録部
１８表示情報反転部
１９情報表示部
２０空間情報データ
２１〜２８ステップ
３１投光部
Ｉ，Ｉ_１，Ｉ_２入力画像[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an interface device that detects an object in an imaging space pointed by a user using a body part (a finger, an arm, or the like).
[0002]
[Prior art]
Heretofore, as for an interface between a computer and a human, there are the following methods as an interface method and an apparatus based on a human operation.
[0003]
As a first conventional method, there is a device that attaches a sensor capable of measuring the movement of a hand or finger to a body and detects a user's movement from the sensor information. For example, there are commercially available products such as a device using a magnetic sensor ("MotionStar" of ASCENSION) or a device using a mechanical sensor ("Gypsy" of Spice, "CyberGlove" of Immersion). Also, there is a method described in Non-Patent Document 1. This is for recognizing a user's operation by wearing a glove to which an acceleration sensor or the like is attached on a hand.
[0004]
As a second conventional technique, there is a technique described in Non-Patent Document 2. In this method, an arm region is extracted from an input image of one camera by using the arm's constraint condition and the actual arm length of the user, and the arm movement is calculated from the temporal change of the arm length. This is to estimate a dimensional change.
[0005]
As a third conventional technique, there is a technique described in Non-Patent Document 3. In this method, the coordinate position obtained in the body ("virtual projection center") and the coordinate position of the detected fingertip are connected, and the extended pointing axis is connected to the screen (the screen of the display device displaying the input image). This is a method in which an intersecting point is set as a cursor position (pointed position). Since the virtual projection center position is obtained from the intersection of straight lines extending from the four corner positions of the screen to the fingertip, the position that can be specified is only on the screen. Extraction of the position of the fingertip is realized by detecting an object closest to the screen by using two cameras and setting one at a position where an image is taken from above.
[0006]
As a fourth conventional technique, there is a technique described in Non-Patent Document 4. This method detects the object closest to the screen as a fingertip using the distance information generated using a multi-view stereo camera, and detects the position of the eyebrows (eye) using color information and distance information. This is a method in which a point at which the extension line connecting is intersected with the screen is set as a cursor position (pointed position).
[0007]
[Non-patent document 1]
Tsukada et al., “Ubi-Finger: Prototype of Mobile Oriented Gesture Input Device”, Workshop on Interactive and Software (WISS2001), pp. 119-124, 2001
[Non-patent document 2]
Abe et al., "Three-dimensional tracking of arm movements using optical flow and color information", Image Recognition and Understanding Symposium (MIRU2002), pp. I267-I272, 2002
[Non-Patent Document 3]
Fukumoto et al., “Non-contact hand reader using moving image processing”, Proceedings of the 7th Human Interface Symposium, pp. 146-64. 427-432, 1991
[Non-patent document 4]
Kinji et al., "Method for specifying cursor position in pointing pointer", IEICE Technical Committee on Image Engineering, 2002.1.
[0008]
[Problems to be solved by the invention]
However, the above-described conventional method has the following problems.
[0009]
In the first conventional method, the movement of a hand or a finger can be recognized, but it is necessary to always attach some kind of sensor to a part to be recognized, so that it lacks the convenience as a practical interface device.
[0010]
In the second conventional method, the movement of the arm can be recognized without wearing anything on the body, but since information on only one camera is used, information in the depth direction cannot be directly obtained, so that a three-dimensional user can be recognized. Cannot accurately extract the movement of the arm.
[0011]
The third and fourth conventional methods are pointing methods in which a user can recognize a three-dimensional operation in a non-wearing and non-contact manner and specify a position on a screen screen. It is limited to the screen screen, and cannot point to a real object or position in any other direction.
[0012]
An object of the present invention is to solve the problems of these conventional methods, which are inconvenient due to the mounting type, and the problem that the depth accuracy is low in the method using one camera, and that the correspondence of positions with real space is simple. Provided are an interface method, an apparatus, and a program that enable a user to visually register the position and information of a real object and register real object information at a position that is not directly visible from a reference camera. It is in.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, an interface device of the present invention comprises:
Means for inputting an image of a real space using a plurality of video cameras including a reference camera,
Means for generating distance information from the reference camera to the photographing object by a stereo method using the photographed image,
Means for generating a voxel space using distance information from a reference image, which is an image taken by the reference camera,
Means for detecting two predetermined parts of the user's body as a start point and an end point on an image captured by the reference camera, and calculating their two-dimensional coordinates on the image;
Means for calculating three-dimensional coordinates of the start point and the end point in the real space from the distance information and the two-dimensional coordinates of the start point and the end point on the image, and calculating direction information in the real space indicated by the user;
Means for registering, at the generated three-dimensional position in the voxel space, information of a real object existing in the real space, that is, part or whole three-dimensional position information and additional information thereof;
The direction information in the real space indicated by the user is compared with the registered three-dimensional position information of the object, and the extended line in the direction indicated by the user, which is the three-dimensional designated position information indicated by the operator, is registered. Means for detecting information about the intersection with the object,
Means are provided for superimposing and displaying the generated voxel space, registration process and result, three-dimensional designated direction information, information of the designated object, and the captured image.
[0014]
Here, the main terms will be described.
1. A voxel space is a set of “voxels” that are unit three-dimensional spaces having a cube shape. FIG. 15 shows the relationship between voxels and voxel space, and an image of the hierarchical division.
2. Stereo Method This is a method of measuring the distance from two or more input images including an image from a reference camera to a photographed object viewed from the reference camera, using the principle of stereo vision (triangulation).
3. Reference camera This is a camera serving as a reference for generating distance information when the stereo method is executed. An image taken by the reference camera is called a reference image.
[0015]
ADVANTAGE OF THE INVENTION According to this invention, it is non-wearing, depth information can be obtained accurately, and three-dimensional pointing to real space can be realized. In addition, by generating a voxel space viewed from the reference camera and superimposing it on the real image and visually displaying it, registration of real object information can be easily performed, and real objects at positions that cannot be seen from the reference image can be registered. Information can also be registered.
[0016]
In an embodiment of the present invention, the interface device includes a unit that generates an image obtained by horizontally inverting the reference image. The voxel space is also generated in a state where the voxel space is horizontally inverted in accordance with the inverted image. The display information is also superimposed on the inverted image.
[0017]
Therefore, in addition to the above advantages, it is possible to provide an interface that is more direct and intuitive, such as performing a pointing operation while looking at a mirror in which a self-image is shown.
[0018]
In another embodiment of the present invention, the interface device has a means for generating distance information by an active stereo method using a video camera and a light projecting device instead of the means for generating the distance information. .
[0019]
Therefore, similarly, the above-mentioned problem can be solved, and an interface with high convenience and simplicity regarding object information registration can be provided.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0021]
[First Embodiment]
FIG. 1 is a configuration diagram of the interface device according to the first embodiment of the present invention, and FIG. 2 is a flowchart showing the flow of the processing.
[0022]
The interface device according to the present embodiment uses an image captured by a plurality of cameras as an input image and detects a position (or an object) in a real space indicated by a user using a body part (a finger, an arm, or the like). A device that can recognize a pointed position in a three-dimensional space based on a direct and intuitive three-dimensional pointing operation of a user. It is a device that can do it.
[0023]
This interface device includes image input units 11 ₁ and 11 ₂ , distance information generation unit 12, voxel space generation unit 13, start and end point two-dimensional coordinate calculation unit 14, start and end point three-dimensional coordinate calculation unit 15, and three-dimensional designated position detection unit 16. , A space information registration unit 17, an inversion information generation unit 18, an information display unit 19, and space information data 20.
[0024]
The image input unit ₁₁ 1, 11 _2, using a video camera of two as in FIG. 1 (or 3 or more units). The camera may be a commonly used video camera or CCD camera, and may be black and white or color. However, when a method using color information described later is used, a color camera is required.
[0025]
The distance information generation unit 12 generates distance information from the reference camera to the photographed object from two or more input images including the reference image (an image of the reference image is shown in FIG. 3) using stereo image processing. (Step 21). As an example of a specific image processing method for generating the distance information, there is a method using a commercially available product, Digiclops (three-lens camera type) or Bumblebee (two-lens camera type) manufactured by Point Gray Research. Each of these is an image input device having three or two cameras built therein, and can generate distance information as an output. In addition, since a method using the stereo method is common in the field of image processing (many published documents), it is possible to make a self-produced image using any two or more cameras. FIG. 4 illustrates an image of the distance information obtained along the x-axis direction on the line of one pixel on the reference image in FIG. In the figure, the direction of the distance axis indicates the distance from the reference camera, and the greater the distance, the larger the value, that is, the greater the distance from the reference camera.
[0026]
The voxel space generation unit 13 creates a voxel space according to the distance information obtained by the distance information generation unit 12 while dividing it into layers (step 22). For example, in FIG. 5, an initial voxel space including 4 × 3 × 3 voxels is generated for the obtained distance information. Here, the initial voxel space is hierarchically divided into designated hierarchies, and a flag is set on voxels of each hierarchy based on distance information as to whether a real object can exist. Hereinafter, for simplicity, a method of setting a flag for a voxel and a two-dimensional hierarchical division method will be described using a graph (rectangular graph) representing distance information in FIG. FIG. 6 shows an example of division. Each voxel is given one of three flags, white, gray, and black. That is, as shown in FIG. 6, when the graph representing the distance information intersects the voxel of interest, the gray color indicates that the graph passes completely above, the black color completely passes the lower side, and the white color completely passes the lower side. I do. That is, it is determined whether or not the distance graph and each voxel intersect, and a flag is determined according to the result. Of these, only the voxels determined to be gray are further subdivided into four equal parts as shown in FIG. 6 (in the case of three-dimensional voxels, they are subdivided into eight equal parts as shown in FIG. 15). ), And the flag is given again. The above is repeated until the size of the voxel given in advance is reached or the number of divisions reaches a predetermined number. When recursive division is performed in this way, voxels that intersect with the distance graph are recursively divided, and other insignificant areas are represented by undivided voxels that take the white or black flag. The Rukoto. The voxel space generated in this manner has an octree structure as shown in FIG. 7, so that a relationship with the parent voxel and the child voxel can be efficiently and quickly extracted. As a fourth voxel flag, there is also a registration flag. This will be described later.
[0027]
The start-end point two-dimensional coordinate calculation unit 14 uses the reference image of the input images used in the distance information generation unit 12 to display two-dimensional predetermined portions (start point and end point) of the user's body on the screen. The coordinates are calculated (step 23). The start point and the end point may be, for example, the position of the shoulder of the user as the start point and the position of the hand as the end point. Thus, in this case, the direction of the tip of the hand with the arm extended becomes the three-dimensional pointing direction described later. A specific detection method on the screen when the positions of the right shoulder and the right hand are set as the start point and the end point will be described below.
[0028]
As a method of detecting the position of the right hand by image processing, for example, when an input image is a color image, a flesh color region (a range in which flesh color can be taken has an arbitrary width using color information in the color image). (Specify within the range of color values.) From the plurality of obtained skin color regions, constraint information such as size and position (for example, designating the area range of the skin color region that may be inferred from the size of the hand, or near the ceiling or floor on the screen) A candidate for a skin color region of a hand is selected using constraints such as excluding a place where a hand is unlikely to exist, such as a nearby hand, or using distance information. Further, 1) a normal user can be considered to be wearing clothes, and a skin color region that is likely to be a candidate for a skin color region is considered to be both hands and a face. 2) A region having the largest area is considered to be a face. , And the skin color area having the second and third largest areas is set as a hand candidate. The right hand is an area on the right side of the face, and the position of the center of gravity is defined as the right hand position. When it is desired to obtain the left hand position, the left and right sides may be considered in reverse, and in the case of both hands, the two cases may be considered in combination.
[0029]
As a method of detecting the position of the right shoulder by image processing, for example, there is a method of calculating the position of the shoulder after extracting the position of the face by the above-described means. Specifically, first, based on the result of performing the above-described skin color extraction processing, the skin color region having the largest area is likely to be a face. Therefore, the skin color region is determined to be a face, and the center of gravity is obtained. Next, since the position of the right shoulder can be assumed to be shifted from the center of gravity of the face by a certain distance downward and to the right by a certain distance, the shift distance is determined in advance (there is an individual difference). Therefore, the value may be changed by the user), and the position of the right shoulder can be calculated from the position of the center of gravity of the face. Thus, two-dimensional coordinates of the start point and the end point on the screen can be obtained. Although the position of the shoulder is set as the starting point here, the center of gravity of the face obtained as described above may be set as the starting point. In this case, an extension line connecting the face position and the hand position is the direction specified by the user.
[0030]
The start-end point three-dimensional coordinate calculation unit 15 obtains three-dimensional coordinate values of the start point and the end point from the generated distance information and the two-dimensional coordinates of the start point and the end point (step 24). As a specific method, for example, on the reference image, a distance value corresponding to two-dimensional coordinates on the image of the start point is set as the distance value of the start point. The same applies to the end point. In a three-dimensional real space, the conversion formulas of the two-dimensional coordinate system and the three-dimensional coordinate system of the image can be easily calculated in advance, and the two-dimensional starting and ending points on the screen obtained based on the two-dimensional coordinate system can be easily calculated in advance. From the coordinate values and the respective distance values, three-dimensional coordinate values of the start point and the end point in a three-dimensional space can be obtained. Further, by obtaining a three-dimensional line connecting the two points from the obtained two three-dimensional coordinate values of the start point and the end point, it is possible to obtain the user's designated direction.
[0031]
The three-dimensional designated position detecting unit 16 detects a three-dimensional position in the actual shooting space pointed by the user (step 25). As a specific method, first, a three-dimensional line connecting the start point and the end point obtained by the start-end point three-dimensional coordinate calculation unit 15 is extended in the hand (end point) direction. At this time, using the voxel space, it is detected whether or not the extension line intersects with a voxel in which three-dimensional position information of an object or the like in a previously registered space is registered. Such a voxel has a registration flag as described later, and can be easily identified. Here also, a method for two-dimensionally obtaining an intersection voxel based on the example of the distance information in FIG. 4 will be described. Now, it is assumed that the black voxel in FIG. 8 is a voxel in which object information is registered (registered voxel) and the user is pointing in that direction. At this time, in order to determine whether or not the voxel intersects with the extension in the designated direction, it is calculated from the voxel of the upper layer whether or not the extension passes through the registered voxel. When a registered voxel that passes is found, it is calculated whether or not a voxel of a lower layer below the registered voxel also passes through the registered voxel. This passage determination is repeated until a voxel in which information is actually registered is found. The determination as to whether or not the extension passes through the voxel can be made very easily because the voxel takes a regular cubic shape, and the determination is made from the voxel of the upper layer toward the lower layer. A large area in which information is not registered can be excluded at an early stage, and the three-dimensional designated position information in which the registered information exists can be found very quickly. Here, the information of the objects and the like in the space will be described in the description of the space information registration unit 17. The obtained three-dimensional designated position information is accumulated as spatial information data 20 and displayed on the display 10 in the information display unit 19.
[0032]
The space information registration unit 17 registers information such as an object in the actual shooting space that may be instructed by the user in the space information data 20 (step 26). As an object in the real space, for example, when the user is in a room, home appliances in the room (TV, air conditioner, computer, clock, window, shelf, chair, desk, drawer, documents, etc.) , Audio equipment, lighting equipment, etc.) and any objects such as walls, floors, ceilings, windows, etc. of the room itself. Here, the real object information is registered three-dimensionally by conversation processing while the voxel space is superimposed and displayed on the reference image using the information display unit 19. FIG. 9 shows an image diagram in which the initial voxel space is superimposed and displayed on the reference image. In response to this result, a voxel indicating a position where information is to be registered is selected, and information is registered therein. In this case, as shown in FIG. 10, only a certain voxel of interest is displayed, a voxel adjacent thereto is displayed by a mouse moving operation, and a voxel of a lower layer is displayed by a mouse clicking operation. It is conceivable to narrow down the target voxel while doing so. Note that the display method and the voxel selection and movement methods are not limited to the present method. For example, a pointing operation according to the present invention may be performed in which a voxel space is designated as an instruction target instead of a real space. It is also conceivable to register the same information in a plurality of voxels. In addition, at the time of the main conversation process, if an operation for forcibly dividing the voxels of the white flag and the black flag is added, information can be registered in a space where there is no real object or in a place hidden from the reference image. Finally, a registration flag is set as a fourth flag for the voxel for which information has been registered and for all voxels from the highest level to the lowest level to which the voxel belongs.
[0033]
It is also conceivable that such information (such as coordinate information of a three-dimensional position and other information relating to an object) is registered and stored in advance in the spatial information data 20 using real coordinates, and is automatically assigned to voxels. . In addition, regarding registration of information, instead of using fixed three-dimensional position coordinates in advance, a sensor capable of recognizing the position of each target real object (a commercially available magnetic sensor, ultrasonic sensor, infrared tag, wireless tag, ), The position of each object can be recognized in real time, so that the object information is generated from the three-dimensional position information obtained thereby, and the three-dimensional position coordinates and the like of the object are constantly generated. Information can be updated. In this case, even if the object is moved, the three-dimensional position information and the like can be updated in real time.
[0034]
The display information inverting unit 18 inverts the reference image horizontally, and also inverts the voxel space and the three-dimensional position information (step 27). FIG. 11 shows an image of a reference image and an image obtained by inverting the reference image. In this case, an image similar to that of a user performing a pointing operation while looking at a mirror can be obtained. The left-right inversion of the reference image can be executed in real time by a commercially available general-purpose image processing software (e.g., HALCON) on the input image taken into the computer. Alternatively, the present invention can also be realized by a commercially available device that inputs an input image and generates a reverse image in real time (for example, a screen left / right reversing device UPI-100LRF of FOR-A Co., Ltd., or a Sony EVI-D100 with an integrated camera). Further, an embodiment without the information reversing unit 18 is also conceivable. In that case, the coordinates are displayed as they are.
[0035]
For example, during the pointing operation, the information display unit 19 superimposes and displays the CG representing the 1) left-right inverted reference image, 2) the same voxel space, 3) the three-dimensional designated direction, and the position on the display 10 (step). 28). FIG. 12 shows an example of each image in that case, and FIG. 1 shows the result of superimposed display. The display 10 may be a general-purpose display used for a computer, as long as it can display a computer screen and a camera image. Note that the display method is not limited to this example. For example, it is conceivable that the display of the voxel space is limited to only a part of the voxel of interest or is not performed at all.
[0036]
[Second embodiment]
FIG. 13 is a configuration diagram of the interface device according to the second embodiment of the present invention, and FIG. 14 is a flowchart showing the flow of the processing.
[0037]
The interface device of the present embodiment differs from the second embodiment in that an active stereo system using a camera 11 and a light projecting device 31 is used instead of using a passive stereo method of generating from input images from two or more cameras. The distance information is generated by the stereo method. It is also possible to mix both.
[0038]
The passive stereo method of generating from two or more images means, for example, searching for a corresponding point (performing a corresponding point search) between input images of two cameras having close gaze directions, and displacing the coordinate values. Is a method of calculating the distance of the point from the size (parallax) of the point. The calculation of the distance uses the principle of triangulation. This method has a problem that it is difficult to find a corresponding point and it is difficult to obtain accurate distance information, but it has an advantage that it does not require an active operation such as irradiating light or an apparatus and is not affected by an imaging environment or the like. For example, commercially available products such as Digiglops (three-lens camera type) and Bumblebee (two-lens camera type) of Point Gray Research Inc. are available.
[0039]
On the other hand, the active stereo method using a light projecting device means that one of the two cameras is replaced with a light source that projects light, and information serving as a clue for searching for a corresponding point is used as an object. This is a direct projection method. As the light, methods or products using various lights such as slit light, spot light, and variously changing pattern lights have been proposed or marketed. This method requires a complicated device for projecting light and has a problem that it is affected by the shooting environment.However, since the corresponding point search can be performed stably, the advantage that the distance image can be obtained with high accuracy can be obtained. have. For example, a commercially available product such as Danae-R (a non-contact type three-dimensional shape measurement range finder) manufactured by NEC Engineering is available.
[0040]
Since both of these two stereo methods can obtain distance information, they can be interchanged with each other. Therefore, instead of using only two or more cameras, an active stereo method using one or more cameras and a light projecting device can be used. Can be increased, and applications can be expanded.
[0041]
In addition, the present invention records a program for realizing the function other than that realized by dedicated hardware on a computer-readable recording medium, and stores the program recorded on the recording medium in a computer system. It may be read and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in a computer system. Further, the computer-readable recording medium is one that dynamically holds the program for a short time (transmission medium or transmission wave), such as a case where the program is transmitted via the Internet, and serves as a server in that case. It also includes those that hold programs for a certain period of time, such as volatile memory inside a computer system.
[0042]
【The invention's effect】
As described above, the present invention has the following effects.
[0043]
According to the first, fourth, and seventh aspects of the present invention, the non-wearable interface improves user convenience. Further, since the voxel space using the distance information obtained by the stereo method is used, the registration of the real object information and the detection of the three-dimensional pointing position of the user can be efficiently performed. Furthermore, not only the position on the screen but also the position in the real space can be pointed as a three-dimensional pointing instruction destination, and the application destination can be expanded.
[0044]
According to the second, fifth and seventh aspects of the invention, in addition to the effect of the first aspect, since the interface operation using the metaphor of the mirror can be performed, the convenience for the user can be further improved.
[0045]
The inventions of claims 3, 6, and 7 use an active stereo method using not only two or more cameras but also one or more cameras and a light projecting device in addition to the effects of the first and second inventions. Since a method or a commercially available device can also be used, versatility can be improved, and applications can be expanded.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of an interface device according to a first embodiment of the present invention.
FIG. 2 is a flowchart illustrating a flow of processing of the interface device according to the first embodiment.
FIG. 3 is a diagram illustrating an image of a reference image.
4 is a diagram showing an image of distance information obtained along one line on the reference image of FIG. 3;
FIG. 5 is a diagram illustrating an image of a voxel space generated based on distance information.
FIG. 6 is a diagram illustrating an image of hierarchical division of a voxel space and how to set a flag.
FIG. 7 is a diagram illustrating an image when the generated voxel space is held as an octree structure.
FIG. 8 is a diagram illustrating an image of an algorithm for searching for a voxel in which information is registered from a designated direction.
FIG. 9 is a diagram illustrating an image in which an initial voxel space is superimposed and displayed on a reference image.
FIG. 10 is a diagram illustrating an image in a case where one voxel is highlighted in a space information registration unit, adjacent voxels are highlighted, and voxels in a lower hierarchy are highlighted.
11 is a diagram showing an image of an image obtained by inverting the reference image of FIG. 3 from side to side.
FIG. 12 is a diagram illustrating an image of a constituent image with respect to a display result in the first embodiment.
FIG. 13 is a configuration diagram of an interface device according to a second embodiment of the present invention.
FIG. 14 is a flowchart illustrating a processing flow of the interface device according to the second embodiment.
FIG. 15 is a diagram illustrating the relationship between voxels and voxel spaces and the concept of hierarchical division thereof.
[Explanation of symbols]
Reference Signs List 10 Display 11, 11 ₁ and 11 ₂ Image input unit 12 Distance information generation unit 13 Voxel space generation unit 14 Start and end point 2D coordinate calculation unit 15 Start and end point 3D coordinate calculation unit 16 3D designated position detection unit 17 Space information registration unit 18 display information reversing section 19 information display section 20 spatial information data 21 to 28 step 31 light projecting sections I, I ₁ , I ₂ input image

Claims

ビデオカメラによりユーザーを撮影し、そのユーザーが体の部位を用いて指し示した先の、撮影空間内の位置もしくは実物体を認識するインタフェース方法であって、
基準カメラを含む複数のビデオカメラを用いて、実空間を撮影した画像を入力する段階と、
前記基準カメラから撮影物体までの距離情報を、前記撮影画像を用いたステレオ法により生成する段階と、
前記基準カメラで撮影した画像である基準画像からの前記距離情報を用いてボクセル空間を生成する段階と、
前記基準カメラからの撮影画像上で、ユーザーの体の予め定めた部位２箇所を始点と終点として検出し、該画像上でのそれらの２次元座標を算出する段階と、
前記距離情報と、前記始点・終点の画像上での２次元座標とから、実空間上における前記始点・終点の３次元座標を算出し、ユーザーが指し示す実空間上での方向情報を算出する段階と、
前記生成されたボクセル空間上の任意の３次元位置に、実空間上に存在する実物体の情報である、一部もしくは全体に渡る３次元位置情報およびその付加情報を登録する段階と、
前記ユーザーの指し示す実空間上での方向情報と、登録されている物体の３次元位置情報とを照合し、操作者が指し示す３次元指示位置情報である、ユーザーが指し示す方向の延長線と、登録された物体との交点に関する情報を検出する段階と、
前記生成されたボクセル空間、前記登録過程および結果、前記３次元指示方向情報、および指示された物体の情報と撮影画像とを重畳表示する段階と
を有するインタフェース方法。An interface method for photographing a user with a video camera and recognizing a position or a real object in a photographing space to which the user points using a body part,
Using a plurality of video cameras including a reference camera, inputting an image of the real space,
Generating distance information from the reference camera to the photographing object by a stereo method using the photographed image,
Generating a voxel space using the distance information from a reference image that is an image taken by the reference camera,
Detecting two predetermined parts of the body of the user as a start point and an end point on a captured image from the reference camera, and calculating their two-dimensional coordinates on the image;
Calculating three-dimensional coordinates of the start point and end point in real space from the distance information and two-dimensional coordinates of the start point and end point on the image, and calculating direction information in real space indicated by the user; When,
Registering, at an arbitrary three-dimensional position in the generated voxel space, three-dimensional position information, which is information of a real object existing in the real space, partially or entirely, and additional information thereof;
The direction information in the real space indicated by the user is compared with the registered three-dimensional position information of the object, and an extension line in the direction indicated by the user, which is the three-dimensional designated position information indicated by the operator, is registered. Detecting information about the intersection with the performed object;
Superimposing and displaying the generated voxel space, the registration process and result, the three-dimensional designated direction information, the information of the designated object, and the captured image.

前記基準画像を左右反転した画像を生成する段階を有し、
前記ボクセル空間を生成する段階は前記ボクセル空間も前記反転画像に合わせて、左右反転させた状態で生成し、前記重畳表示する段階は各種表示情報も、該反転画像上に重畳表示する、請求項１記載のインタフェース方法。Generating a horizontally inverted image of the reference image,
The step of generating the voxel space includes generating the voxel space in a state where the voxel space is also horizontally inverted in accordance with the inverted image, and the step of superimposing and displaying various display information is also superimposed on the inverted image. 2. The interface method according to 1.

前記の距離情報を生成する段階の代りに、ビデオカメラと投光装置を用いた能動的なステレオ法により距離情報を生成する段階を有する、請求項１または２に記載のインタフェース方法。The interface method according to claim 1 or 2, further comprising a step of generating distance information by an active stereo method using a video camera and a light projector instead of the step of generating the distance information.

ビデオカメラによりユーザーを撮影し、そのユーザーが体の部位を用いて指し示した先の、撮影空間内の位置もしくは実物体を認識するインタフェース装置であって、
基準カメラを含む複数のビデオカメラを用いて、実空間を撮影した画像を入力する手段と、
前記基準カメラから撮影物体までの距離情報を、前記撮影画像を用いたステレオ法により生成する手段と、
前記基準カメラで撮影した画像である基準画像からの前記距離情報を用いてボクセル空間を生成する手段と、
前記基準カメラからの撮影画像上で、ユーザーの体の予め定めた部位２箇所を始点と終点として検出し、該画像上でのそれらの２次元座標を算出する手段と、
前記距離情報と、前記始点・終点の画像上での２次元座標とから、実空間上における前記始点・終点の３次元座標を算出し、ユーザーが指し示す実空間上での方向情報を算出する手段と、
前記生成されたボクセル空間上の任意の３次元位置に、実空間上に存在する実物体の情報である、一部もしくは全体に渡る３次元位置情報およびその付加情報を登録する手段と、
ユーザーの指し示す実空間上での方向情報と、登録されている物体の３次元位置情報とを照合し、操作者が指し示す３次元指示位置情報である、ユーザーが指し示す方向の延長線と、登録された物体との交点に関する情報を検出する手段と、
前記生成されたボクセル空間、前記登録過程および結果、前記３次元指示方向情報、および指示された物体の情報と撮影画像とを重畳表示する手段
を有するインタフェース装置。An interface device that shoots a user with a video camera and recognizes a position or a real object in a shooting space to which the user points using a body part,
Means for inputting an image of a real space using a plurality of video cameras including a reference camera,
Means for generating distance information from the reference camera to the photographing object by a stereo method using the photographed image,
Means for generating a voxel space using the distance information from a reference image that is an image taken by the reference camera,
Means for detecting two predetermined parts of the body of the user as a start point and an end point on an image captured from the reference camera, and calculating their two-dimensional coordinates on the image;
Means for calculating three-dimensional coordinates of the start point and the end point in real space from the distance information and two-dimensional coordinates of the start point and the end point on the image, and calculating direction information in the real space indicated by the user When,
Means for registering, at an arbitrary three-dimensional position in the generated voxel space, three-dimensional position information, which is information of a real object existing in a real space, partially or entirely, and additional information thereof;
The direction information in the real space indicated by the user is compared with the registered three-dimensional position information of the object, and the extended line in the direction indicated by the user, which is the three-dimensional designated position information indicated by the operator, is registered. Means for detecting information about the intersection with the object,
An interface device having means for superimposing and displaying the generated voxel space, the registration process and result, the three-dimensional designated direction information, information of a designated object, and a captured image.

前記基準画像を左右反転した画像を生成する手段を有し、
前記ボクセル空間を生成手段は前記ボクセル空間も前記反転画像に合わせて、左右反転させた状態で生成し、前記重畳表示手段は各種表示情報も、該反転画像上に重畳表示する、請求項４記載のインタフェース装置。Means for generating an image obtained by horizontally inverting the reference image,
5. The voxel space generating unit generates the voxel space in a state where the voxel space is horizontally inverted in accordance with the inverted image, and the superimposed display unit also superimposes and displays various display information on the inverted image. Interface device.

前記距離情報を生成する手段の代りに、ビデオカメラと投光装置を用いた能動的なステレオ法により距離情報を生成する手段を有する、請求項４または５に記載のインタフェース装置。The interface device according to claim 4, further comprising a unit that generates distance information by an active stereo method using a video camera and a light projecting device, instead of the unit that generates the distance information.

請求項１から３のいずれかに記載のインタフェース方法をコンピュータに実行させるためのインタフェースプログラム。An interface program for causing a computer to execute the interface method according to claim 1.