JP3668168B2

JP3668168B2 - Moving image processing device

Info

Publication number: JP3668168B2
Application number: JP2001280637A
Authority: JP
Inventors: 恭一岡本; チポラロベルト; 義徳久野
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-09-14
Filing date: 2001-09-14
Publication date: 2005-07-06
Anticipated expiration: 2020-07-06
Also published as: JP2002150292A

Description

【０００１】
【発明の属する技術分野】
この発明は、複数の画像情報を入力し、特徴点の位置の変化から対象物体の動きおよび構造を検出する動画像処理装置に関する。
【０００２】
【従来の技術】
複数の画像に撮影された物体の構造を検出する方式としては、既に幾つかの方式が提案されている。
【０００３】
例えば、S.Ullmanは、The interpretaion of visual motion.MIT Press Cambridge,USA,1919には、３枚以上の平行投影した画像であり、剛体である物体の同一平面上にない４点の対応が決まっている場合に、４点の構造および動きを完全に求める方法が紹介されている。
【０００４】
また、H.C.Longuest-HigginsはA computer algorithm for reconstructing a scene from two projections Nature,293:133-135,1981には、透視変換した２枚の画像上で８つの対応点がある場合に、構造および動きを検出する線形計算方式が開示されている。
【０００５】
他に、O.D.FaugerasとS.J.MaybankはMotion from point matches:multiplicity of solutions,IEEE Workshop on Motion 248-255 1989には、中心投影した２画像に５つの対応点があれば、それらの対応を満たす構造および動きは有限になることが記載されている。
【０００６】
また、特開平３−６７８０号には、２枚の画像上の対応点から、まず３次元の回転運動を求め、次に、その回転運動情報から対応点の一つを基準とする３次元の位置関係を求める方式が開示されている。
【０００７】
これらの方式は、すべて、物体の３次元座標とこの物体が中心投影で投影された画像上の座標との間に方程式を立て、その方程式を解いて答を求める方式である。
【０００８】
また、Jan J KoenderinkとAndrea J.van DoolrnのAffine structure from motion, Journal of Optiical Society of America pp. 377-385 vol.8, No.2 1991に開示されているように、物体の運動をアフィン(affine)変換（１次変換）で表わし、そこから物体の構造を検出する方式も計算されている。この方式では、動画像の２枚のフレームから物体のおよその構造を計算することができる。この方式により計算した物体の構造は、奥行き方向の情報がカメラから物体までの距離に比例する未知の係数を掛け合わせることによって得られる構造となる。
【０００９】
【発明が解決しようとする課題】
上述した中心投影の方程式を解く方法は、撮影対象となる物体が撮影装置に非常に近く、大きく写っている場合には、効率良く物体の運動および構造を計算することができるが、実際の処理画像で起きるように、画像中で撮影対象となる物体が写っている面積が小さい場合や、撮影装置から対象物体までの距離が遠い場合には、中心投影による画像の変形が小さくなり、その変形をもとに物体の運動を計算するため、計算結果が不安定になってしまうという欠点があった。例えば、視線方向に垂直な向きへの平行移動と、その移動方向に垂直な軸の周りの回転とを区別することが難しくなったり、それ以外にも、中心投影による効果が小さいと、深さ方向の曖昧性が発生し、近くにある彫りの浅い物体か、遠くにある彫りの深い物体かの判別が難しくなったり、観察者の近くで小さな物体が運動しているのか、遠くで大きな物体が運動しているのか判別が難しくなるようなことが起きた。
【００１０】
また、Koendoerinkの方法は、検出した物体の構造に未知の係数を含んでいるので、ここから物体の運動を計算することは難しかった。
【００１１】
本発明は、このような問題点を解決するためになされたものであり、観察者の動きによる画像の変形をアフィン変換（１次変換）で近似して表現し、かつ、ある特徴点の実際の運動によって移動した位置と周囲の特徴の運動によるアフィン変形によって移動した位置との差である仮想視差を計算し、仮想視差情報から物体の運動を直接計算することにより、中心投影の曖昧性に影響されることなく、精度良く物体の運動パラメータを算出する動画像処理装置を提供することを目的とする。
【００１２】
【課題を解決するための手段】
本発明は、画像に対応する画像情報を入力する画像入力手段と、前記画像から同一平面上の少なくとも３つの第１の特徴点と該同一平面上にない１つの第２の特徴点を抽出するため前記画像情報に対して特徴点抽出処理を行う特徴点抽出手段と、前記第２の特徴点と同じ座標にあって、前記第１の特徴点の移動速度のアフィン変換で決まる移動速度を持つ仮想点を求め、この仮想点と前記第２の特徴点との差に基づいて仮想視差を求め、この仮想視差の方向から並進運動方向を求める並進フロー計算部と、上記並進フロー計算部で計算した並進運動方向を用いて異なる運動をしている領域を分割する独立運動分割部とを具備する動画像処理装置を提供する。
【００１５】
【発明の実施の形態】
以下、本発明による実施例を図に基づいて説明する。
【００１６】
図１に示される数値表現された物体モデルの運動姿勢の指示に用いた簡単な一実施例によると、画像入力部１は特徴点抽出部２を介してポイティング情報生成部４に接続される。このポイティング情報生成部４には、アフィンフロー解析部３およびジェスチャパタンマッチング部６が接続される。ジェスチャパタンマッチング部６はジェスチャパタン記憶部７に接続され、更にポインタ表示部８とともにステータス切換部５に接続される。ポインタ表示部８は３Ｄ（３次元）モデル記憶部９、画像合成部１０およびポインタモデル記憶部１２に接続される。画像合成部１０は画像入力部１に接続されるとともに画像表示部１１に接続される。
【００１７】
画像入力部１は、運動している物体をテレビカメラなどで撮影することによって得られる時系列画像情報を入力し、これを特徴点抽出部２に転送する。この画像は運動している物体を撮影しやすいように画像表示部１１の前に座った人間を天井から撮影した画像、画像表示部１１のディスプレイの枠状のカメラで撮影した画像および中心投影の効果が大きく出ないように長い焦点距離で撮像範囲を狭くした複数のカメラを並べ、それらのカメラから入力した画像を繋ぎ合わせた画像などである。また、ここで、画像として入力する物体は、人間の手等の体の一部、もしくは、図２のように後の特徴点抽出部２にて処理しやすいように他の部分と容易に区別できるように、例えば一部に色を塗ったり、色のついた物をつけたりして特徴をつけた手袋などをした手などの体の一部、あるいは、人間が手に持って動かすことができ、画像処理で他の部分と区別できる特徴を持った器具などである。図３はそのような器具の一例で、同じ大きさの球を４つ、他の球と区別できるようにそれぞれ異なる色に塗り、３次元空間内で同一平面上にない位置関係で接続したものである。図４は別の器具の一例であり、箱の表面にＬＥＤ等の発光素子を埋め込んだ器具であり、内蔵する電池による電力で発光する。この器具を掴んだときに指で押させる位置にスイッチがあり、スイッチを押している間は発光を止めてその間には運動を指示することなく器具を動かすことができる。図２の手袋上の特徴、図４の発光素子などは、３次元的なあらゆる視線方向から見て、近傍にない特徴が４点以上見えるように、配置されている。
【００１８】
特徴点抽出部２は、画像入力部１から時系列画像情報を入力し、画像処理によって、他の領域と容易に弁別できて物体上の同一の点が投影されたと特定できる、複数の特徴点を抽出し、追跡し、その座標値をポインティング情報生成部４に出力する。撮影した物体が手などであれば、画像中の近傍の画素との明度差が大きいような点を特徴点として抽出する。図２や図３のように色を塗った部分を特徴とする場合は、どのような色を特徴とするか、色情報を記憶するためのメモリを用意しておき、そこに記憶された色情報と同じ色を持った領域を画像情報から抽出し、その重心座標をポインティング情報生成部４に出力する。また、この場合のように、色を持った領域の大きさが判る場合には、領域の大きさも補助情報として、ポインティング情報生成部４に出力する。このメモリに記憶させる色情報はあらかじめメモリに書き込んでおくか、この実施例による装置を起動した後、色情報を学習するための学習手段を起動し、入力した画像情報を画像表示部に表示し、ユーザに画像中のどの部分の色を特徴とするかをカーソルによって領域を選択させるか、あるいは入力した動画像情報とウインドウを重ねて画像表示部１１に表示し、ユーザに特徴となる色がウインドウ内に入るように手や器具を操作させ、キー入力などのタイミングでウインドウによって指定した領域の画像を取り込むなどして、特徴となる色を含んだ部分画像を得てそこから色を学習し、メモリに記憶する。色の学習は、例えば、Ｉ（明度）Ｈ（色相）Ｓ（彩度）の３成分で表現された画像であれば、次のような２次方程式による表現を用意し、指定した部分画像の画素値に最小２乗推定を行なってパラメータを推定するなどすれば、色の学習を行なうことができる。
【００１９】
Ｈ＝ｈ0＋ｈ1Ｉ＋ｈ2Ｉ2
Ｓ＝ｓ0＋ｓ1Ｉ＋ｓ2Ｉ2
図４の器具のように発光素子を使う場合は、適当な閾値を設けて画像を２値化し、閾値より明るい領域のそれぞれの重心を取って特徴の座標値を計算し、ポインティング情報生成部４に出力する。
【００２０】
ポインティング情報生成部４は特徴点抽出部２から特徴点座標の時系列データを入力し、そこからある２つの時点での複数の特徴点の座標を選んで、アフィンフロー解析部３で運動パラメータを解析し、その結果を用いてポインタを動かすために必要な情報を生成する。
【００２１】
このアフィンフロー解析部３で行なっている処理について以下に詳しく説明する。
【００２２】
アフィンフロー解析部３では、２枚の画像中での４つの特徴点の座標を入力し、４つの特徴点により構成される物体の画像を撮影したカメラ（観察者）の２枚の画像間での運動を計算する。この場合、図５のような中心投影で撮影している撮影系のモデルを考える。ここで、３次元空間上の座標（Ｘ，Ｙ，Ｚ）にある点が焦点距離ｆにある画像平面上の点（ｘ，ｙ）に投影されている。この状態で、観察者側が速度｛Ｕ₁，Ｕ₂，Ｕ₃｝で並進運動し、｛Ω₁，Ω₂，Ω₃｝で回転運動をしているとする。特徴点抽出部２から入力した特徴点（ｘ，ｙ）が視線方向に十分近いものとして、特徴点（ｘ，ｙ）の画像平面上での移動速度Ｖ（ｘ，ｙ）の成分を（ｕ，ｖ）で表す。
【００２３】
この速度成分を観察者側の運動パラメータで表現してみる。３次元座標（Ｘ，Ｙ，Ｚ）と速度パラメータとの関係は、
【数１】

であるので、移動後の座標Ｘ^１、Ｙ^１、Ｚ^１として
【００２４】
【数２】

が得られる。これを、３次元座標と画像平面上の点の投影関係、
【００２５】
【数３】

の微分、
【００２６】
【数４】

に代入すれば、
【００２７】
【数５】

を得る。
【００２８】
ｖについても同様に計算を行なって、
【数６】

【００２９】
と、表すことができる。この画像の移動速度は、並進運動に依存するシーンの情報（Ｚ）を含んだ成分と回転運動に依存する成分とに分けることができる。回転運動に依存する成分は、画素の位置によって変化するだけで対象物の場所や形には依存しないので、従来例でも述べたように、数式的に解く方法はあっても、中心投影による効果が出ないと、画像の上に現れる変化が小さくなるので、実際に回転運動のパラメータを求めることは難しい。そのため、回転運動による変化が並進運動に誤差となって加わり、並進運動の計算精度も悪くなる。その結果として、物体の形や運動を精度良く計算することは難しかった。
【００３０】
しかし、２つの特徴点が画像上の同じ場所に投影されていたと仮定し、その２つの特徴点の移動速度の差（Δu，Δv）、（以後、これを運動視差と呼ぶ）を考えると、運動視差の大きさは、
【数７】

【００３１】
である。但し、Ｚ_１，Ｚ_２は運動視差の計算に用いる２つの特徴点のＺ座標である。この運動視差は、物体までの距離と観察者の並進運動だけに依存し、観察者の回転運動には、依存しない。また、この式１２から、
【数８】

【００３２】
のように、ｘ，ｙ，Ｕ₃が十分に小さければ、運動視差から並進運動の方向が求められることが判る。Ｕ₃が小さくない場合は、座標値（ｘ，ｙ）の異なる複数の点の運動視差をこの式に代入して解けば，Ｕ₁とＵ₂の比から並進運動の方向を求めることができる。
【００３３】
一方、観察者の運動が十分滑らかで、また、撮影している物体の表面も十分滑らかであれば、式（１１）の画像の速度場は、ある小さな領域の中で、線形方程式で近似することができる。つまり、画像上のある座標（ｘ，ｙ）の近傍での画像速度場は、アフィン変換（一次変換）を使って、
【数９】

【００３４】
で表すことができる。このうち、０（ｘ²，ｘｙ，ｙ²）は２次の非線形成分を表すが、この部分は十分小さいものと考えて以後の計算では無視する。
【００３５】
最初の項の［ｕ0，ｙ0］は、画像の平行移動を表し、２番目の項の２×２テンソルは、画像の形の変形を表す。第２項の添字は、添字で示したパラメータで偏微分したことを示す。この第２項の２×２テンソルは、図６にあるような、いくつかの幾何学的に意味のある成分に分割される。向きの変化を示す画像平面上での回転（Curl）curlＶ、スケールの変化（Divergence）を示す等方的変化divＶ、画像の変形（Deformation）（面積を一定に保ったまま、ある軸の方向に引き延ばし、それと垂直な軸の方向に縮める変化）の大きさを示すdefＶ、および画像変形の拡張する方向を示す変形の主軸μなどである。これらの特徴量は、ある座標（ｘ，ｙ）での画像の速度をＶ（ｘ，ｙ）としたときに、
【数１０】

【００３６】
で表される。これらの特徴量のうち、divＶ，curlＶ，defＶの値は、画像中で座標系をどのように取っても、取り方による変化のない不変特徴量である。変形の主軸μは、座標系の軸の向きだけに依存する特徴量である。
【００３７】
図７のように、ある画像平面上にある特徴点Ｐと、その近傍にある３つの特徴点を考える。既に示したように、十分小さい領域の中では画像平面上の速度は、３つの特徴点から求められるアフィン変換（一次変換）で近似できる。点Ｐと同じ座標にあって、他の３点の移動速度のアフィン変換で決まる移動速度を持つ仮想的な点Ｐ′を考える。実際の点Ｐと仮想的な点Ｐ′の運動視差は、点Ｐと同じ座標に投影されるが、観察者までの距離の違う点との運動の違いである。この点ＰとＰ′との運動視差を以後、仮想視差と呼ぶことにする。運動視差は、式（１２）に示したように、観察者の回転運動には影響されず、並進運動と距離だけに依存するので、ここから、安定した運動パラメータと、３次元構造情報を計算することができる。
【００３８】
この仮想視差からどのような情報が得られるか、画像の移動速度（式１１）を、
アフイン変換式（１４）にあてはめて求める。撮影している物体は観察者から十分遠いと仮定しているので、観察者から物体までの距離に比べて物体表面の３次元座標間の距離の変化は非常は小さい。そこで、ｆが１のときの画像中心から物体までの距離をλとし、物体表面までの距離の変化を
【数１１】

【００３９】
で表すことにより、Ｚを深さを表す変数λで正規化しておく。これによって、観察者の並進運動の成分と、アフイン変換の各パラメータは、
【００４０】
【数１２】

のように表される。
【００４１】
この結果を式（１５）から（１８）までの不変特徴量を表す式に代入すれれば、
【数１３】

【００４２】
となる。式を見てわかるように、これらのパラメータは、観察者の動き、深さ、表面の向きに依存している。これを、２つのベクトルＡとＦを使って、座標系に依存しないように書き換えることができる。Ａは、下の式のような、深さλで正規化された、画像平面に平行な並進速度ベクトルである。Ｕは並進運動ベクトル、Ｑは視線方向の単位ベクトルである。
【００４３】
【数１４】

【００４４】
Ｆは、やはり深さλで正規化した、物体表面の最大勾配の方向を示す２次元ベクトルである。
【００４５】
【数１５】

【００４６】
このＦは、図８にあるように、大きさが、物体表面の傾斜σのtangent（視線方向と物体表面の法線のなす角のtangent）を表す。またＦの方向は、tangent平面とｘ軸とのなす角τを表す。
【００４７】
【数１６】

【００４８】
以上のような性質を持つ、ベクトルＡとＦを使って、上記の不変特徴量の表現を書き換えると、
【数１７】

と表される。画像変形の主軸を示す角μは、ＡとＦの中点を通る角度で表される。
【００４９】
【数１８】

【００５０】
この式（３４）から式（３７）を使って得られる情報は、中心投影をweak-perspective投影で近似したため、曖昧性を含んだものになっている。例えば、物体の運動速度は、近くを運動している小さな物体と遠くを運動している大きな物体との判別ができなく、大きさと速度の曖昧性があるので、速度の代わりに、現在の運動速度で運動した時に物体に衝突するまでの距離ｔc
【数１９】

【００５１】
で表すことになる。式（３６）は近くにある彫りの浅い物体か、遠くにある彫りの深い物体かの判別ができなく、深さの曖昧性を含んでおり、この式の値からは、画像の変形が、大きく動いた（｜Ａ｜が大きい）表面の傾きの小さい（｜Ｆ｜が小さい）物体か、小さく動いた表面の傾きの大きい物体かの区別はできなくなっている。このように曖昧性が存在する部分を明らかにしておくことにより、残りの必要な情報をノイズの影響を受けずに精度良く求めることができる。
【００５２】
次に、アフィンフロー解析部３で行なっている処理を図９のフローチャートに従って説明する。
【００５３】
まず、入力した４つの特徴点から３点を抽出して組み合わせたときに、３点を結んで構成される領域の面積が最大となる３つの特徴点を選び、選んだ３点を参照点、残りの１点を基準点とする（ステップＳＴ１０１）。
【００５４】
３つの参照点の運動速度を代入して式（１４）を解き、一次近似したアフィン変換パラメータｕ0，ｖ0，ｕx，ｕy，ｖx，ｖyを求める。物体の運動が小さくて滑らかな場合には、参照点の３フレーム以上の画像での位置情報を使って最小２乗法を使ってアフイン変換パラメータを求める（ステップＳＴ１０２）。
【００５５】
次に、基準点とアフイン変換で補間した仮想点の運動速度の仮想視差を求める。物体がカメラから十分遠く、視線方向の並進運動Ｕ3が大きくないと仮定できるときには、この仮想視差の方向が、Ａの方向θAを表わす。そうでないときには、複数の点の仮想視差を式（１３）を代入してθA＝Δｕ／Δｖを求める（ステップＳＴ１０３）。
【００５６】
３つの参照点から式（１５），（１６），（１７）および（１８）を使って、curl,div,defの各不変特徴量を求める。これらの不変特徴量は、物体の並進運動や回転運動によって起きる変化に物体表面の向きと画像平面上での動きによって起きる変化が加わったものである（ステップＳＴ１０４）。
【００５７】
変形の主軸μと並進運動の画像平面への投影θAから、式（３７）を使って、参照点の３点で定められる平面の傾きτを求める（ステップＳＴ１０５）。
【００５８】
式（３５）から、表面方向と画像平面上での動きの関係による形の伸縮を差し引く。
【００５９】
これまでに判った値を用いて、式（３５）からＦ・Ａ＝｜ｄｅｆｖ｜ｃｏｓ（τ−θA）を引く。残った成分は、視線方向に沿った物体の動きによる画像のスケールの変化を示し、ここから、衝突までの時間ｔc が求められる（ステップＳＴ１０６）。
【００６０】
式（３４）から表面の方向と画像平面上での動きの影響を差し引く。これまでに判った値を用いて、式（３４）からＦ×Ａ＝｜ｄｅｆｖ｜ｓｉｎ（τ−θA）を引くと、残った成分は、物体と撮影者間の視線方向の周りの回転によるものだけになる（ステップＳＴ１０７）。
【００６１】
アフィンフロー解析部３は、このようにして、並進運動方向θA、スケールの変化ｔｃ、視線方向の周りの回転Ω・Ｕ、など、画像情報から安定して計算することのできる観察者の運動パラメータを計算し、ポインティング情報生成部４に出力する（ステップＳＴ１０８）。
【００６２】
先に述べたように、ポインティング情報生成部４は特徴点抽出部２から特徴点座標の時系列データを入力し、そこから適当な２つの時点の特徴点座標を選んでアフィンフロー解析部３で運動パラメータを計算し、３次元の空間を指示するために必要な情報を生成する。以下、フローチャート１０に従って、この処理を説明する。
【００６３】
まず、特徴点抽出部２から特徴点座標（補助情報がある場合は、補助情報も）を入力する。入力した特徴点の数をｎとし、座標を（ｘi，ｙi）とする（ステップＳＴ２０１）。
【００６４】
撮影対象の物体は動いているので、特徴点が他の部分に隠されて見えなくなったり、隠されていた特徴点が出現したりする場合がある。特徴点が４点より少ない場合は何もせず、特徴点が４点以上になった場合は、前回ポインティング情報を生成した時に抽出した特徴点と今回抽出した特徴点の共通集合から、物体上に均等に位置されるような特徴点を４点選択する（ステップＳＴ２０２）。
【００６５】
選択した特徴点について、前回ポインティングに使った時の座標値（ｌｘi，ｌｙi）からの移動距離（（ｘi−ｌｘi）²＋（ｙi−ｌｙi）²）を計算し、この距離を一定の閾値と比較する。特徴の大きさなど、補助情報がある場合には、その値を使って閾値を決める。選んだ特徴点を以前にポインティングに使ったことがなければ、ｌｘi，ｌｙiに、ｘi，ｙiを代入する。４点の移動距離で、閾値以上の距離を持つ点が１つでもあれば、以降の処理を行ない、全て閾値以下であれば、ステップＳＴ２０１に戻る（ステップＳＴ２０３）。
【００６６】
このようにして求めた、４点の過去の座標値（ｌｘi，ｌｙi）と現在の座標値（ｘi，ｙi）をアフィンフロー解析部３に入力して運動パラメータを計算する（ステップＳＴ２０４）。
【００６７】
アフィンフロー解析部３で計算した運動パラメータは物体が静止し、観察者（カメラ）が運動していると仮定した時のパラメータである。これを、物体の運動を表す値に置き換えると、重心の動きは物体のＸ，Ｙ方向への並進運動、スケールの変化を表すｔｃはＺ方向への並進運動、Ω・Ｕは、Ｚ軸周りの回転運動、Ａは物体のＸ軸周りの回転運動とＹ軸周りの回転運動の比を表す。これらパラメータそれぞれについて、閾値と比較し、閾値より大きな動きがあれば、そのパラメータの示す物体の運動を一定の大きさだけ起こすようなポインティングイベントを発生する（ステップＳＴ２０５）。その際、画面上に見るポインタの動きと、人間が、自分の手や、ポインティングに利用する器具を見た時の運動の方向を一致させるようにポインティングイベントでの運動方向の符合を決める。
【００６８】
ここで発生したポインティングイベントは、ポインタ表示部８およびジェスチャパタンマッチング部６に送られる。アフィンフロー解析部３で求めた運動パラメータは、中心投影を仮定しないと計算できないパラメータは深さλを使った相対的な表現をしているが、ポインティング情報生成部４では物体の絶対的な動きを必要とする時のために、中心投影の方程式（１１）にλを使って相対値で表現したパラメータを代入して位置と姿勢を計算し、この情報もポインタ表示部６に出力する（ステップＳＴ２０６）。
【００６９】
ポインタ表示部８は、後で述べるステータス切替部５からの指示によって、ポインタモデル記憶部１２に記憶されている。例えば図１１のように容易に３次元的に向きのわかるポインタの３Ｄモデルか、３Ｄモデル記憶部９に記憶されているモデルのうち、ステータス切替部５によって指定された３Ｄモデルかを選択し、選択した３Ｄモデルの現在の位置と姿勢から、入力したポインティングイベントに従って並進、回転運動させたグラフィクス画像情報を生成し、出力する。
【００７０】
ポインタモデル記憶部１２には、前述したようにポインタの３Ｄモデルと現在の位置と姿勢が記憶されており、３Ｄモデル記憶部９には、現在、画像に表示されている３Ｄモデルとモデルの位置と姿勢が記憶されている。
【００７１】
ジェスチャパタンマッチング部６では、ポインティング情報生成部４から入力した最新のポインティングイベントの時系列のリストで、ユーザからのキーボード入力などで途切れていないパタンと、ジェスチャパタン記憶部７に記憶されたジェスチャパタンとを比較して、ユーザによるポインタの操作が、あらかじめ登録された何かの意味を持った動きかどうかを判定する。合致したジェスチャパタンがあれば、そのパタンと一緒に記憶されているオペレーションを実行する。
【００７２】
ジェスチャパタン記憶部７の中では、ジェスチャパタンは、図１２に示すようなリスト構造の並んだ表で記憶されている。１つのジェスチャは、ジェスチャのパタンと、それが起きた時に呼び出されるオペレーションを示す文字列から構成されている。１つのジェスチャパタンは、ポインティングイベントのリストで表現されており、１つのポインティングイベントは並進運動｛Ｕ₁，Ｕ₂，Ｕ₃｝と、回転運動｛Ω₁，Ω₂，Ω₃｝の６つのパラメータについて、正負の方向への運動があることを示す＋か−、あるいは運動がないことを示す０の３種類のシンボルで表されている。図１２で、ジェスチャパタンのリストにリストの次の要素を示す２つのsucessorがあり、自分で閉ループを構成しているものがあるが、これは、この閉ループで同じポインティングイベントの繰り返しを許容する仕組みである。sucessorの横の変数ｎは４つの閉ループがみな同じ回数だけ繰り返すことを示す。図１２の例では、ジェスチャパタンは、ｘｙ平面上での任意の大きさの正方形を示し、このジェスチャによって／ｕｓｒ／ｂｉｎ／Ｘ１１／ｋｔというオペレーションが起動されることを示している。
【００７３】
ステータス切替部５は、ディスプレイに表示された３次元空間内を自由にポインタを動かして操作する、ポインタ操作状態か、表示されたモデルのうちの一つをポインタによって指定した後、モデルの位置や姿勢を変更するモデル把握状態の、どちらかの現在のポインタの状態を記憶し、また、ユーザからの指示か、ポインタ表示部からの指示によって、ポインタの状態を切替え、それにともなって他の部分の設定変更を行なう。
【００７４】
ポインタ操作状態の時には、ポインタ表示部８に、ポインタモデル記憶部に記憶されたモデルを使うように指示し、発生したポインティングイベントに従ってポインタモデルの位置と姿勢を変更する。ポインティングイベントはジェスチャパタンマッチング部にも入力され、ジェスチャの認識が行なわれ、イベント列にジェスチャが含まれると認識されれば、そのジェスチャに対応したオペレーションが実行される。ユーザのキーボード入力やジェスチャなどによる指示か、ポインタの３次元位置が３Ｄモデル記憶部に記憶してある３Ｄモデルの１つの位置と一致した時に、ポインタの状態はモデル把握状態に切り替わる。
【００７５】
モデル把握状態では、３Ｄモデル記憶部９に記憶されたモデルを位置姿勢を変更させて表示する。まず、モデル把握状態に入ると、指示された３Ｄモデルを３Ｄモデル記憶部９から取り出して、ポインタ表示部８に送り、これを他のモデルと区別できるよう色などを変えて表示するように指示する。次に、画像合成部１０に、モデルの位置や形などを入力して、３Ｄモデルの画像情報と入力した画像情報から、手でモデルを掴んでいたり、ポインティング用の器具に３Ｄモデルがはめ込まれていたりするように見える画像を合成し、画像表示部１１で表示する。モデルの移動や回転には、ポインタ操作状態とは異なり、ポインティングイベントではなく、中心投影に当てはめて計算した位置や姿勢の情報を用いる。
【００７６】
画像合成部１０では、まず、画像入力部１から入力した画像情報から特徴点を囲む閉領域を切り出すことによって、手や器具の写っている部分を取り出す。次に、手や器具の運動が、実際の運動と画像上での運動の方向がユーザから見て一致するように、取り出した画像の左右を反転する。入力した３Ｄモデルの位置や形などの情報と特徴点の座標を元に、手や器具の画像に平行移動、スケールの変更などの処理を行なって、特徴点の位置を３Ｄモデルのグラフィクス画像の頂点などに合わせる。その後、モデルのグラフィクス画像と手や器具の画像を半透明で重ね表示を行なって、図１３のように、モデルを掴んでいるように見える画像を合成し、画像表示部１１に出力する。
【００７７】
図１４を参照して本発明の他の実施例を説明する。
【００７８】
動画像入力部２１は、カメラ（観察者）が３次元空間内を未知の運動をしながら画像を撮影し、撮影した動画像に対応する画像情報を特徴点抽出部２２に転送する。この動画像入力部２１が撮影している撮影環境は、基本的に静止環境であるが、運動物体が含まれていても構わない。
【００７９】
特徴点抽出部２２は、動画像入力部２１からの時系列画像情報を受け、画像処理によって、明度や色が近傍の領域と急激に変化し、２枚の画像で物体の同一の点が投影されたと特定できる多数の特徴点を抽出し、抽出した特徴点を並進フロー計算部２３に入力する。
【００８０】
並進フロー計算部２３は、入力された特徴点のそれぞれの座標を比較し、最近傍の４点を結んだネットワークを構成し、最近傍の４点の組合わせの全てに対してアフィンフロー解析部３が行う処理と同様な処理を行なって、仮想視差を求め、観察者の運動パラメータを計算する。計算した運動パラメータのうち、並進運動方向θAを見ると、この値は、カメラが撮影環境に対して並進運動し得ている方向を示すものであるから、動画像入力部２１が撮影した画像が静止環境であれば、どの４点の特徴点の組合わせを取っても同じ値を示している。実際には、中心投影を狭い視野範囲に限定してアフィン変換で近似しているので、互いに近傍にある特徴点の組合わせで同じ値を示す。従って、画像全体に分布する特徴点の組合わせから並進運動方向θAだけを抜き出して、図１５のような分布図を作成し、出力する。尚、図１５の矢印は、２つの物体の各々の複数の点の動きを示している。
【００８１】
独立運動分割部２４は、並進フロー計算部２３で計算した並進運動方向のフロー図において、近傍にある特徴点の組合わせの並進運動方向θAを比較し、その差が一定閾値より大きいところで領域分割を行なう。これによって、図１５の実線で囲まれた領域のように、動いている背景画像の中から、異なる動きをしている物体を示す領域を取り出すことができる。この後、異なる動きをしている領域を取り除き、残った、背景画像を示す領域からθA以外の運動パラメータも計算し、カメラの運動を求めて出力する。
【００８２】
図１６を参照して本発明に係る他の実施例を説明する。
【００８３】
画像入力部３１は、１つの物体を複数の方向から撮影した画像に対応する画像情報を入力する。ここで入力する画像情報の画像は、時間的に連続していなくても構わない。また、物体を撮影した時の観察者の位置関係も不明である。
【００８４】
特徴点抽出部３２は、画像入力部３１より入力した画像情報に、各点抽出処理などの画像処理を行ない、明度や色が近傍の領域と急激に変化している多数の特徴点に対応する特徴点情報を抽出し、対応特徴探索部３３に出力する。抽出した特徴点情報は、入力した画像情報に重ね合わせて画像表示部３６に表示される。
【００８５】
初期対応探索部３３は複数の画像間において特徴点情報を比較し、撮影対象となった物体上の同一の点が投影されたものかどうか調べる。まず、特徴点抽出部３２から入力した全ての特徴点情報に未対応を示すフラグをつける。次に、画像間において、特徴点を中心とする小領域の相関マッチングを行ない、相関係数が一定閾値より高い特徴点情報同士を対応させ、対応した特徴点情報には、対応したことを示すフラグをつけ、特徴点の集合情報を対応修正部、即ち対応特徴更新部３７に出力する。画像表示部３６に、対応した特徴点情報を、色を変えるなどして未対応の特徴点情報と区別できるように入力画像情報と重ね合わせて表示し、また、複数の画像情報間においてどの特徴点情報とどの特徴点情報が対応しているかが分かるように表示する。
【００８６】
インタフェイス部３４は、初期対応探索部３３から対応のついた特徴点を含む特徴点集合情報を入力し、特徴点の対応関係の修正作業を行なう。初期対応探索部３２によって作成した対応関係が、十分正確で誤りが少なければ、対応点の修正は行なわずに後で処理を行なうことも可能である。
【００８７】
初期対応探索の結果、画像表示部３６に表示している特徴点を重ね表示した入力画像に、マウスなどのポインティングデバイスで制御されたカーソルを表示し、ユーザが特徴点を選択できるようにする。既に対応フラグのついた特徴点を選んだ場合には、その特徴点とそれに対応する特徴点のフラグを未対応につけかえ、対応関係を取り消す。入力した複数の画像情報において未対応フラグのついた特徴点情報を１つづつ連続して選んだ場合には、それらの特徴点情報に対応が確定したことを示すフラグをつけ、それらの特徴点情報間に対応関係を設定する。また、特徴点のない領域でも、複数の画像において、１つづつ連続して選んだ画素があれば、その画素の座標に対応が確定した特徴点を生成して対応関係を設定する。また、画像と特徴点以外に、対応点の更新および物体の構造抽出並びに画像の合成を呼び出すボタンを表示し、ポインティングデバイスによってこれらを選択できるようにしておく。ユーザが対応点の更新ボタンを選んだ場合には、対応特徴更新部３７に特徴点の集合情報を渡して、特徴点の対応関係を更新する。物体の構造抽出のボタンを選んだ場合には、構造抽出部に特徴点の集合を渡して、撮影した物体の３次元構造を抽出する。画像の合成のボタンを選んだ場合には、さらに、どこから見た画像を合成するかを質問して、画像を合成する視点までの並進、回転の運動ベクトルを入力し、画像合成部３５に入力した画像、特徴点の集合、構造抽出部（？）で抽出した物体の３次元構造、運動ベクトルを渡して、画像を合成し、画像表示部に表示する。
【００８８】
画像表示部３６は、複数の入力画像、特徴点、画像から抽出した物体構造の３Ｄモデル、視線方向を変えて合成した画像などを表示し、そこにカーソルを重ね表示して、画像上の座標や特徴点を指示できるようにする。
【００８９】
対応特徴更新部３７は、インタフェイス部３４から特徴点の集合情報を受け、新たな基準に従って、未対応の特徴点の対応づけを行なう。まず、入力した特徴点集合情報の中から、対応が確定し、対応フラグのついた特徴点（これを点Ａ0とする）を選択し、特徴点Ａ0の近傍にある未対応フラグのついた特徴点Ｂを選択する。特徴点Ａ0と対応する、他の画像中の特徴点Ａ0′の近傍にある特徴点Ｂ′と特徴点Ｂを比較し、対応するかどうかを判定する。ＢとＢ′の比較は、両方の近傍に、対応のついた特徴点が２点以下しかなければ、初期対応探索と同様に特徴点を中心とする小領域の相関マッチングを行ない、初期対応探索部３３で使った閾値より低い閾値で対応するかどうかを判定する。図１７のように、ＢとＢ′の両方の近傍に対応のついた特徴点が３点以上あれば、画像の変形を考慮したマッチングを行なう。まず、三角形Ａ0，Ａ1，Ａ2を三角形Ａ0′，Ａ1′，Ａ2′に変形するアフィン変換を計算する。Ｂを中心とする小領域を、計算したアフィン変換で変形し、変形した小領域とＢ′を中心とする小領域との間で相関マッチングを行ない、近傍に対応済みの点が２点以下しかない場合と同じ閾値で判定する。このようにして見つけた特徴点には、初期対応探索部３３と同様に特徴点に対応を示すフラグをつけ、特徴点間を対応関係で結ぶ。この処理を、新たに対応する特徴点がなくなるまで、繰り返してインタフェイス部３４に復帰する。
【００９０】
構造抽出部３８は、形状が不明の物体を撮影して得られる複数の画像情報とこれらの画像情報間で対応をつけた特徴点の集合情報を受け、これら情報から物体の３次元形状モデルと、物体表面のテクスチャパタンを抽出し、出力する。まず、特徴点の集合情報から近傍にある４点の組合わせを取り出し、その４点にアフィンフロー解析部３による処理を行なって、運動パラメータを計算する。この４点を、図７のように、３角形を構成する３点と、それ以外の１点Ｐに分けて考える。点Ｐ′は一方の画像で点Ｐと同じ座標にあって、もう一方の画像では、他の３点の移動で表現されるアフィン変換で移動する仮想的な点である。この点Ｐ′は、３次元空間では３点で決まる３次元平面上にあって、画像平面には点Ｐと同じ座標に投影される点を示している。この仮想視差ＰＰ′の大きさは、式（１２）より、
【００９１】
【数２０】

【００９２】
で表される。但し、Ｚ_ｐは点ｐの座標、Ｚ_ｐ１は点Ｐ′のＺ座標である。ここに、この４点の近傍にある別の特徴点Ｑを追加した図１８の状況を想定する。点Ｑ′は点Ｐ′と同様に、ほかの３点の移動で決まるアフィン変換で移動する仮想的な点であって、やはり３次元空間では３点で決まる３次元平面上の点である。２つの仮想視差ＰＰ′とＱＱ′を考えると、画像中で近傍にあって、また、観察者から物体までの距離が十分遠いことから、この２つの仮想視差の長さの比は、（ＺP−ＺP′）／（ＺQ−ＺQ′）となり、これからＰとＱのある３次元平面に対する深さ方向の座標の比が求められる。この処理を全ての近傍にある特徴点の組合わせに対して行なって、物体の深さ方向の長さが、ある長さλに対する比で表現された、物体の３Ｄモデルを算出する。この後、あるλに対応する物体の３Ｄモデルがグラフィクス表示され、ユーザが、この３Ｄモデルを見ながらλの値を調整するか、３Ｄモデルの表面で、平面が交差している部分を見つけ、交差角が直角になるように、λの大きさを推定する、などして、完全な物体の３Ｄモデルが計算される。また、λの大きさが計算されると、式（３１）より、画像上で、近傍にある３点の特徴点で囲まれた３角形の物体表面の勾配が求められ、これから、物体表面のテクスチャ情報も取り出すことができる。
【００９３】
構造抽出部３８は、このようにして計算した物体の３Ｄモデルとテクスチャ情報を出力する。
【００９４】
画像合成部３５は、入力した物体の３次元構造と視点までの運動ベクトルを使って、入力画像を変形し、新たな視点から物体を見た時の画像を合成する。観察者が動いた時の画像の変形は、式（３４）から式（３６）によって、観察者の並進運動ベクトルと回転運動ベクトル、物体表面の勾配と、画像平面上の並進運動で表現される。観察者の運動と画像平面上での並進運動は、視点までの運動ベクトルから計算され、物体表面の勾配は、物体の３次元構造から得られるので、これから、視点を変更したことによる、画像の変形を表現するアフィン変換行列を計算することができる。まず、入力画像において対象物体が写っている領域を、領域内にある特徴点を結んだ直線で区切られた３角パッチに分割する。各３角パッチの画像に対して、上記のアフィン変換を適用して、新たな３角パッチ画像を作成する。作成したパッチ画像をつなぎ合わせたものが新たな視線方向から見た時の物体の画像であり、これを画像表示部に表示する。
【００９５】
【発明の効果】
本発明によれば、動いている背景画像の中から、異なる動きをしている物体を示す領域を取り出すことができ、異なる動きをしている領域を取り除き、背景画像を示す残りの領域から並進運動の画像平面への投影θA以外の運動パラメータも計算しカメラの運動を求めることができる。
【図面の簡単な説明】
【図１】本発明の一実施例であり、ＣＡＤシステムでのモデルの操作などのヒューマンインタフェイスに用いる動画像処理装置のブロック図。
【図２】ポインティングに用いる第１の例の器具の図。
【図３】ポインティングに用いる第２の例の器具の図。
【図４】ポインティングに用いる第３の例の器具の図。
【図５】中心投影による撮像形での３次元座標と画像座標の関係を示す図。
【図６】画像から抽出できる不変特徴量を示す図。
【図７】運動視差を説明する図。
【図８】物体表面の傾斜の表現方法を説明する図。
【図９】アフィンフロー解析部の動作を説明するフローチャート図。
【図１０】ポインティング情報生成部の動作を説明するフローチャート図。
【図１１】ポインタの３Ｄ（３次元）モデルを示す図。
【図１２】ジェスチャパタン記憶部に記憶されたジェスチャパタンのデータ構造を示す図。
【図１３】３Ｄモデルと入力した画像の一部を重ね合わせて合成した画像を示す図。
【図１４】本発明の他の実施例であり、独立下運動を弁別するシステムのブロック図。
【図１５】並進運動方向の分布図。
【図１６】本発明の他の実施例であり、複数画像から３Ｄモデルを獲得し、物体を他の方向から見た時の画像を合成するシステムに適用した動画像処理装置の構成図。
【図１７】アフィン変換による画像変形を考慮したマッチングを示す図。
【図１８】２つの仮想視差の関係を示す図。
【符号の説明】
１…画像入力部
２…特徴点抽出部
３…アフィンフロー解析部
４…ポインティング情報生成部
５…ステータス切替部
６…ジェスチャパタンマッチング部
７…ジェスチャパタン記憶部
８…ポインタ表示部
９…３Ｄモデル記憶部
１０…画像合成部
１１…画像表示部
１２…ポインタモデル記憶部
２１…動画像入力部
２２…特徴点抽出部
２３…並進フロー計算部
２４…独立運動分割部
３１…画像入力部
３２…特徴点抽出部
３３…初期対応探索部
３４…インタフェイス部
３５…画像合成部
３６…画像表示部
３７…対応特徴更新部
３８…構造抽出部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a moving image processing apparatus that inputs a plurality of pieces of image information and detects the movement and structure of a target object from a change in the position of a feature point.
[0002]
[Prior art]
Several methods have already been proposed for detecting the structure of an object photographed in a plurality of images.
[0003]
For example, S. Ullman is the interpretation of visual motion. MIT Press Cambridge, USA, 1919 is an image of three or more parallel projections, and the correspondence of four points that are not on the same plane of a rigid object is determined. In this case, a method for completely determining the structure and movement of the four points is introduced.
[0004]
In addition, HCLonguest-Higgins describes the structure and motion of A computer algorithm for reconstructing a scene from two projections Nature, 293: 133-135, 1981 when there are 8 corresponding points on the two perspective images. A linear calculation method for detection is disclosed.
[0005]
In addition, ODFaugeras and SJMaybank have proposed that Motion from point matches: multiplicity of solutions, IEEE Workshop on Motion 248-255 1989, if there are five corresponding points in the two centrally projected images, It is described that it becomes finite.
[0006]
In Japanese Patent Laid-Open No. 3-6780, first, a three-dimensional rotational motion is obtained from corresponding points on two images, and then, the three-dimensional rotational motion information based on one of the corresponding points is obtained from the rotational motion information. A method for obtaining a positional relationship is disclosed.
[0007]
In all of these methods, an equation is established between the three-dimensional coordinates of the object and the coordinates on the image on which the object is projected by central projection, and the answer is obtained by solving the equation.
[0008]
Also, as disclosed in Jan J Koenderink and Andrea J. van Doolrn's Affine structure from motion, Journal of Optiical Society of America pp. 377-385 vol.8, No.2 1991, the motion of an object is affine ( (affine) transformation (primary transformation), and a method for detecting the structure of an object from there is also calculated. In this method, an approximate structure of an object can be calculated from two frames of a moving image. The structure of the object calculated by this method is a structure obtained by multiplying information in the depth direction by an unknown coefficient proportional to the distance from the camera to the object.
[0009]
[Problems to be solved by the invention]
The above-described method for solving the central projection equation can efficiently calculate the motion and structure of an object when the object to be imaged is very close to the image capturing apparatus and is reflected in a large image. As occurs in the image, when the area of the object to be photographed is small in the image or when the distance from the photographing device to the target object is long, the deformation of the image by the center projection is small, and the deformation Since the motion of the object is calculated based on the above, the calculation result becomes unstable. For example, it becomes difficult to distinguish parallel translation in the direction perpendicular to the line-of-sight direction and rotation around an axis perpendicular to the movement direction. Directional ambiguity occurs, making it difficult to distinguish between a nearby carved shallow object or a distantly carved object, whether a small object is moving near the observer, or a large object far away It happened that it was difficult to determine whether the person was exercising.
[0010]
Also, Koendoerink's method includes an unknown coefficient in the structure of the detected object, so it was difficult to calculate the motion of the object from here.
[0011]
The present invention has been made to solve such a problem, and represents the deformation of an image due to the movement of the observer by approximating it with an affine transformation (primary transformation), and the actual feature point. By calculating the virtual parallax that is the difference between the position moved by the movement of the object and the position moved by the affine deformation due to the movement of the surrounding features, and calculating the motion of the object directly from the virtual parallax information, the ambiguity of the central projection It is an object of the present invention to provide a moving image processing apparatus that accurately calculates a motion parameter of an object without being affected.
[0012]
[Means for Solving the Problems]
The present invention extracts image input means for inputting image information corresponding to an image, and at least three first feature points on the same plane and one second feature point not on the same plane from the image. Therefore, the feature point extracting means for performing the feature point extraction process on the image information and the moving speed determined by the affine transformation of the moving speed of the first feature point are at the same coordinates as the second feature point. A virtual flow is obtained, a virtual parallax is obtained based on the difference between the virtual point and the second feature point, and a translational flow calculation unit for obtaining a translational motion direction from the direction of the virtual parallax is calculated by the translational flow calculation unit. There is provided a moving image processing apparatus including an independent motion dividing unit that divides regions performing different motions using the translational motion directions.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0016]
According to a simple embodiment used to indicate the motion posture of a numerically represented object model shown in FIG. 1, the image input unit 1 is connected to a pointing information generation unit 4 via a feature point extraction unit 2. . An affine flow analysis unit 3 and a gesture pattern matching unit 6 are connected to the pointing information generation unit 4. The gesture pattern matching unit 6 is connected to the gesture pattern storage unit 7 and further connected to the status switching unit 5 together with the pointer display unit 8. The pointer display unit 8 is connected to a 3D (three-dimensional) model storage unit 9, an image composition unit 10, and a pointer model storage unit 12. The image composition unit 10 is connected to the image input unit 1 and to the image display unit 11.
[0017]
The image input unit 1 inputs time-series image information obtained by photographing a moving object with a television camera or the like, and transfers this to the feature point extraction unit 2. This image is an image obtained by photographing a person sitting in front of the image display unit 11 from the ceiling so as to easily photograph a moving object, an image photographed by a frame-shaped camera of the display of the image display unit 11, and a central projection. For example, an image obtained by arranging a plurality of cameras having a long focal distance and a narrow imaging range so that the effect is not large and connecting images input from these cameras is connected. Also, here, an object to be input as an image is easily distinguished from a part of a body such as a human hand or another part so that it can be easily processed by the feature point extraction unit 2 as shown in FIG. For example, a part of the body such as a gloved hand that has been colored by attaching a colored part or a colored object, etc., or a person can move it by hand An instrument having a feature that can be distinguished from other parts by image processing. FIG. 3 shows an example of such an instrument, in which four spheres of the same size are painted in different colors so that they can be distinguished from other spheres, and connected in a positional relationship that is not on the same plane in a three-dimensional space. It is. FIG. 4 is an example of another device, which is a device in which a light emitting element such as an LED is embedded on the surface of a box, and emits light by electric power from a built-in battery. There is a switch at a position to be pushed with a finger when grasping the instrument, and the instrument can be moved without instructing movement during that period while the switch is depressed. The features on the glove in FIG. 2 and the light emitting elements in FIG. 4 are arranged so that four or more features that are not in the vicinity can be seen when viewed from all three-dimensional viewing directions.
[0018]
The feature point extraction unit 2 receives time-series image information from the image input unit 1 and can easily identify it from other regions by image processing and can identify that the same point on the object has been projected. Is extracted and tracked, and the coordinate value is output to the pointing information generation unit 4. If the photographed object is a hand or the like, a point having a large brightness difference with a neighboring pixel in the image is extracted as a feature point. 2 or 3, if the feature is a colored part, prepare a memory for storing the color information about what kind of color is featured, and the color stored there A region having the same color as the information is extracted from the image information, and the barycentric coordinates thereof are output to the pointing information generation unit 4. Further, in this case, when the size of an area having a color is known, the size of the area is also output as auxiliary information to the pointing information generation unit 4. The color information to be stored in this memory is written in the memory in advance, or after the apparatus according to this embodiment is activated, a learning means for learning color information is activated, and the input image information is displayed on the image display unit. The user can select a region of the image to be characterized by a cursor, or select an area with the cursor, or display the input moving image information and the window on the image display unit 11 so that the color characteristic of the user is displayed. Operate your hands and instruments to enter the window, capture an image of the area specified by the window at the timing of key input, etc., obtain a partial image containing the characteristic color, and learn the color from there Store in memory. For learning colors, for example, if the image is expressed by three components of I (lightness), H (hue), and S (saturation), the following quadratic equation expression is prepared, and the specified partial image For example, color learning can be performed by estimating a parameter by performing least square estimation on a pixel value.
[0019]
H = h0 + h1I + h2I2
S = s0 + s1I + s2I2
When a light emitting element is used as in the instrument of FIG. 4, an appropriate threshold value is provided, the image is binarized, the center of gravity of each area brighter than the threshold value is calculated, the feature coordinate value is calculated, and the pointing information generation unit 4 Output to.
[0020]
The pointing information generation unit 4 inputs time-series data of feature point coordinates from the feature point extraction unit 2, selects the coordinates of a plurality of feature points at two points in time from there, and the affine flow analysis unit 3 determines the motion parameters. Analyze and use the result to generate the information needed to move the pointer.
[0021]
Processing performed by the affine flow analysis unit 3 will be described in detail below.
[0022]
The affine flow analysis unit 3 inputs the coordinates of the four feature points in the two images, and between the two images of the camera (observer) that captured the image of the object composed of the four feature points. Calculate the motion. In this case, an imaging system model in which the center projection as shown in FIG. 5 is used is considered. Here, a point at coordinates (X, Y, Z) in the three-dimensional space is projected onto a point (x, y) on the image plane at the focal length f. In this state, the observer side sets the speed {U ₁ , U ₂ , U _Three } To translate, {Ω ₁ , Ω ₂ , Ω _Three } Is rotating. Assuming that the feature point (x, y) input from the feature point extraction unit 2 is sufficiently close to the line-of-sight direction, the component of the moving speed V (x, y) on the image plane of the feature point (x, y) is represented by (u , V).
[0023]
This velocity component is expressed by the motion parameter on the observer side. The relationship between the three-dimensional coordinates (X, Y, Z) and the speed parameter is
[Expression 1]

Therefore, the coordinate X after movement ¹ , Y ¹ , Z ¹ As
[0024]
[Expression 2]

Is obtained. This is the projection relationship between the 3D coordinates and the points on the image plane,
[0025]
[Equation 3]

The derivative of
[0026]
[Expression 4]

Is assigned to
[0027]
[Equation 5]

Get.
[0028]
Do the same calculation for v,
[Formula 6]

[0029]
It can be expressed as. The moving speed of the image can be divided into a component including scene information (Z) that depends on translational motion and a component that depends on rotational motion. The component that depends on the rotational motion only changes depending on the pixel position and does not depend on the location or shape of the object. If the value does not appear, the change appearing on the image becomes small, so it is difficult to actually determine the parameters of the rotational motion. Therefore, a change due to the rotational motion is added to the translational motion as an error, and the calculation accuracy of the translational motion is also deteriorated. As a result, it has been difficult to accurately calculate the shape and motion of an object.
[0030]
However, assuming that two feature points are projected at the same location on the image, and considering the difference in movement speed (Δu, Δv) between the two feature points (hereinafter referred to as motion parallax), The magnitude of motion parallax is
[Expression 7]

[0031]
It is. However, Z ₁ , Z ₂ Is the Z-coordinate of two feature points used for motion parallax calculation. This motion parallax depends only on the distance to the object and the translational motion of the observer, and does not depend on the rotational motion of the observer. From this equation 12,
[Equation 8]

[0032]
X, y, U _Three If is sufficiently small, it can be seen that the direction of translational motion is obtained from motion parallax. U _Three Is not small, substituting the motion parallax of a plurality of points having different coordinate values (x, y) into this equation, U ₁ And U ₂ The direction of translational motion can be obtained from the ratio of
[0033]
On the other hand, if the observer's motion is sufficiently smooth and the surface of the object being photographed is also sufficiently smooth, the velocity field of the image of Equation (11) is approximated by a linear equation in a small region. be able to. In other words, the image velocity field in the vicinity of a certain coordinate (x, y) on the image is calculated using affine transformation (primary transformation),
[Equation 9]

[0034]
Can be expressed as Of these, 0 (x ² , Xy, y ² ) Represents a second-order nonlinear component, but this portion is considered to be sufficiently small and is ignored in the subsequent calculations.
[0035]
The first term [u0, y0] represents the translation of the image, and the 2 × 2 tensor of the second term represents the deformation of the image shape. The subscript of the second term indicates partial differentiation with the parameter indicated by the subscript. This second term 2 × 2 tensor is divided into several geometrically meaningful components, as in FIG. Rotation (Curl) curlV on the image plane showing the change in orientation, Isotropic change divV showing the change in scale (Divergence), Deformation (in the direction of a certain axis while keeping the area constant) DefV indicating the magnitude of the change (the change that extends and contracts in the direction of the axis perpendicular thereto), and the deformation main axis μ indicating the direction in which the image deformation extends. These feature amounts are obtained when the image speed at a certain coordinate (x, y) is V (x, y).
[Expression 10]

[0036]
It is represented by Among these feature amounts, the values of divV, curlV, and defV are invariant feature amounts that do not change depending on how the coordinate system is taken in the image. The principal axis μ of deformation is a feature quantity that depends only on the direction of the axis of the coordinate system.
[0037]
As shown in FIG. 7, a feature point P on a certain image plane and three feature points in the vicinity thereof are considered. As already indicated, in a sufficiently small region, the speed on the image plane can be approximated by affine transformation (primary transformation) obtained from three feature points. Consider a virtual point P ′ that has the same coordinates as the point P and has a moving speed determined by affine transformation of the moving speeds of the other three points. The motion parallax between the actual point P and the virtual point P ′ is projected on the same coordinates as the point P, but is a difference in motion from a point having a different distance to the observer. Hereinafter, the motion parallax between the points P and P ′ will be referred to as virtual parallax. As shown in Equation (12), the motion parallax is not affected by the rotational motion of the observer, but depends only on the translational motion and the distance. From this, stable motion parameters and three-dimensional structure information are calculated. can do.
[0038]
What information can be obtained from this virtual parallax, the moving speed of the image (Equation 11),
It is obtained by applying to the affine transformation equation (14). Since it is assumed that the object being photographed is sufficiently far from the observer, the change in the distance between the three-dimensional coordinates on the object surface is very small compared to the distance from the observer to the object. Therefore, let λ be the distance from the image center to the object when f is 1, and change the distance to the object surface.
[Expression 11]

[0039]
By normalizing Z, Z is normalized by a variable λ representing depth. As a result, the translational component of the observer and the parameters of the Affine transformation are
[0040]
[Expression 12]

It is expressed as
[0041]
If this result is substituted into the expressions representing the invariant feature values of Expressions (15) to (18),
[Formula 13]

[0042]
It becomes. As can be seen from the equation, these parameters depend on the observer's movement, depth, and surface orientation. This can be rewritten using two vectors A and F so as not to depend on the coordinate system. A is a translational velocity vector parallel to the image plane, normalized by depth λ, as in the equation below. U is a translational motion vector, and Q is a unit vector in the line-of-sight direction.
[0043]
[Expression 14]

[0044]
F is a two-dimensional vector indicating the direction of the maximum gradient of the object surface, also normalized by the depth λ.
[0045]
[Expression 15]

[0046]
As shown in FIG. 8, F represents the tangent of the inclination σ of the object surface (the tangent of the angle formed by the line-of-sight direction and the normal of the object surface). The direction F represents an angle τ formed by the tangent plane and the x axis.
[0047]
[Expression 16]

[0048]
Using the vectors A and F having the above properties, rewriting the invariant feature expression above,
[Expression 17]

It is expressed. An angle μ indicating the main axis of image deformation is represented by an angle passing through the midpoint between A and F.
[0049]
[Expression 18]

[0050]
The information obtained by using the equations (34) to (37) includes ambiguity because the center projection is approximated by the weak-perspective projection. For example, the movement speed of an object cannot be distinguished from a small object moving near and a large object moving far, and there is ambiguity in size and speed. The distance tc until it hits the object when moving at speed
[Equation 19]

[0051]
It will be expressed as Equation (36) cannot determine whether a nearby carved shallow object or a distant carved object is deep and includes depth ambiguity. From the value of this equation, the deformation of the image is It is impossible to distinguish between an object that has moved a lot (with a large | A |) and a surface with a small inclination (small | F |) or an object that has moved a small surface with a large inclination. Thus, by clarifying the portion where ambiguity exists, the remaining necessary information can be accurately obtained without being affected by noise.
[0052]
Next, processing performed by the affine flow analysis unit 3 will be described with reference to the flowchart of FIG.
[0053]
First, when three points are extracted and combined from the four input feature points, the three feature points that maximize the area of the region formed by connecting the three points are selected, and the selected three points are used as reference points. The remaining one point is set as a reference point (step ST101).
[0054]
Equation (14) is solved by substituting the motion speeds of the three reference points, and first-order approximated affine transformation parameters u0, v0, ux, uy, vx, vy are obtained. When the motion of the object is small and smooth, the affine transformation parameter is obtained by using the least square method using the position information of the image of the reference point in three frames or more (step ST102).
[0055]
Next, the virtual parallax of the motion speed of the virtual point interpolated by the affine transformation with the reference point is obtained. When it can be assumed that the object is sufficiently far from the camera and the translational motion U3 in the viewing direction is not large, the direction of this virtual parallax represents the direction θA of A. If not, θA = Δu / Δv is obtained by substituting Equation (13) for the virtual parallax of a plurality of points (step ST103).
[0056]
Using the equations (15), (16), (17), and (18) from the three reference points, the invariant feature amounts of curl, div, and def are obtained. These invariant feature amounts are obtained by adding the change caused by the object surface orientation and the movement on the image plane to the change caused by the translational or rotational movement of the object (step ST104).
[0057]
From the deformation main axis μ and the projection θA of the translational motion onto the image plane, the inclination τ of the plane determined by the three reference points is obtained using equation (37) (step ST105).
[0058]
From Expression (35), the expansion and contraction of the shape due to the relationship between the surface direction and the movement on the image plane is subtracted.
[0059]
Using the values obtained so far, F · A = | defv | cos (τ−θA) is subtracted from Equation (35). The remaining component indicates a change in the scale of the image due to the movement of the object along the line-of-sight direction. From this, the time tc until the collision is obtained (step ST106).
[0060]
Subtract the influence of the direction of the surface and the movement on the image plane from Equation (34). When F × A = | defv | sin (τ−θA) is subtracted from Expression (34) using the values obtained so far, the remaining component is due to rotation around the line-of-sight direction between the object and the photographer. It becomes only a thing (step ST107).
[0061]
In this way, the affine flow analysis unit 3 can calculate the motion parameters of the observer that can be stably calculated from the image information, such as the translational motion direction θA, the scale change tc, and the rotation Ω · U around the line-of-sight direction. Is output to the pointing information generator 4 (step ST108).
[0062]
As described above, the pointing information generation unit 4 inputs time-series data of feature point coordinates from the feature point extraction unit 2, selects appropriate two feature point coordinates from the time point, and the affine flow analysis unit 3 selects them. The motion parameters are calculated and information necessary for indicating a three-dimensional space is generated. Hereinafter, this process will be described with reference to the flowchart 10.
[0063]
First, feature point coordinates (if there is auxiliary information, also auxiliary information) are input from the feature point extraction unit 2. The number of input feature points is n, and the coordinates are (xi, yi) (step ST201).
[0064]
Since the object to be imaged is moving, the feature points may be hidden by other portions and may not be visible, or the hidden feature points may appear. If the number of feature points is less than 4, nothing is done. If the number of feature points is 4 or more, the feature points extracted when the previous pointing information was generated and the feature points extracted this time Four feature points that are evenly positioned are selected (step ST202).
[0065]
The movement distance ((xi-lxi) from the coordinate value (lxi, lyi) when the selected feature point was used for the previous pointing. ² + (Yi-lyi) ² ) And compare this distance with a certain threshold. If there is auxiliary information such as the size of a feature, the threshold is determined using that value. If the selected feature point has not been used for pointing before, xi and yi are substituted into lxi and lyi. If there is at least one point having a distance equal to or greater than the threshold among the four moving distances, the subsequent processing is performed, and if all are equal to or less than the threshold, the process returns to step ST201 (step ST203).
[0066]
The four past coordinate values (lxi, lyi) and the current coordinate values (xi, yi) obtained in this way are input to the affine flow analysis unit 3 to calculate motion parameters (step ST204).
[0067]
The motion parameters calculated by the affine flow analysis unit 3 are parameters when it is assumed that the object is stationary and the observer (camera) is moving. When this is replaced with a value representing the motion of the object, the motion of the center of gravity is the translational motion of the object in the X and Y directions, tc representing the change in scale is the translational motion in the Z direction, and Ω · U is around the Z axis. A represents the ratio of the rotational motion around the X axis to the rotational motion around the Y axis. Each of these parameters is compared with a threshold value, and if there is a movement larger than the threshold value, a pointing event that causes the movement of the object indicated by the parameter by a certain amount is generated (step ST205). At that time, the movement direction of the pointing event is determined so that the movement of the pointer seen on the screen matches the direction of movement when a human sees his / her hand or an instrument used for pointing.
[0068]
The pointing event generated here is sent to the pointer display unit 8 and the gesture pattern matching unit 6. The motion parameters obtained by the affine flow analysis unit 3 are expressed by using the depth λ for the parameters that cannot be calculated unless the central projection is assumed, but the pointing information generation unit 4 uses the absolute motion of the object. To calculate the position and orientation by substituting the parameters expressed by relative values using λ into the central projection equation (11), and also outputs this information to the pointer display unit 6 (step ST206).
[0069]
The pointer display unit 8 is stored in the pointer model storage unit 12 in accordance with an instruction from the status switching unit 5 described later. For example, as shown in FIG. 11, a pointer 3D model whose direction is easily understood in three dimensions or a 3D model designated by the status switching unit 5 among the models stored in the 3D model storage unit 9 is selected. From the current position and orientation of the selected 3D model, graphics image information that has been translated and rotated according to the input pointing event is generated and output.
[0070]
As described above, the pointer model storage unit 12 stores the 3D model of the pointer and the current position and orientation. The 3D model storage unit 9 stores the 3D model currently displayed in the image and the position of the model. And the attitude is remembered.
[0071]
The gesture pattern matching unit 6 includes a time series list of the latest pointing events input from the pointing information generation unit 4, a pattern that is not interrupted by a keyboard input from the user, and a gesture pattern stored in the gesture pattern storage unit 7. Are compared with each other to determine whether or not the operation of the pointer by the user is a movement having some registered meaning. If there is a matching gesture pattern, the operation stored with the pattern is executed.
[0072]
In the gesture pattern storage unit 7, the gesture patterns are stored in a table with a list structure as shown in FIG. One gesture is composed of a gesture pattern and a character string indicating an operation to be called when the gesture occurs. One gesture pattern is represented by a list of pointing events, and one pointing event is translated {U ₁ , U ₂ , U _Three } And rotational motion {Ω ₁ , Ω ₂ , Ω _Three } Are represented by three types of symbols, + or-indicating that there is movement in the positive and negative directions, or 0 indicating that there is no movement. In FIG. 12, there are two successors indicating the next element of the list in the gesture pattern list, and there is one that constitutes a closed loop by itself. This is a mechanism that allows the same pointing event to be repeated in this closed loop. It is. The variable n next to the sucessor indicates that all four closed loops repeat the same number of times. In the example of FIG. 12, the gesture pattern indicates a square of an arbitrary size on the xy plane, and indicates that the operation “/ usr / bin / X11 / kt” is activated by this gesture.
[0073]
The status switching unit 5 is operated by freely moving the pointer in the three-dimensional space displayed on the display. After the pointer operation state or one of the displayed models is designated by the pointer, the position of the model or Stores either the current pointer state of the model grasping state for which the posture is to be changed, and switches the pointer state according to an instruction from the user or an instruction from the pointer display unit. Change the settings.
[0074]
In the pointer operation state, the pointer display unit 8 is instructed to use the model stored in the pointer model storage unit, and the position and orientation of the pointer model are changed according to the generated pointing event. The pointing event is also input to the gesture pattern matching unit, the gesture is recognized, and if it is recognized that the gesture is included in the event sequence, the operation corresponding to the gesture is executed. The pointer state is switched to the model grasping state when an instruction is given by a user's keyboard input or gesture or when the three-dimensional position of the pointer coincides with one position of the 3D model stored in the 3D model storage unit.
[0075]
In the model grasping state, the model stored in the 3D model storage unit 9 is displayed with the position and orientation changed. First, when entering the model grasping state, the instructed 3D model is taken out from the 3D model storage unit 9 and sent to the pointer display unit 8 to instruct it to be displayed in a different color so that it can be distinguished from other models. To do. Next, the model position and shape are input to the image composition unit 10, and the model is grasped by hand from the image information of the 3D model and the input image information, or the 3D model is inserted into a pointing device. The images that appear to flutter are combined and displayed on the image display unit 11. For the movement and rotation of the model, unlike the pointer operation state, information on the position and orientation calculated by applying to the central projection is used instead of the pointing event.
[0076]
In the image composition unit 10, first, a closed region surrounding a feature point is cut out from the image information input from the image input unit 1, thereby extracting a portion where a hand or a tool is shown. Next, the left and right sides of the extracted image are reversed so that the movement of the hand or instrument matches the actual movement and the movement direction on the image when viewed from the user. Based on the input information such as the position and shape of the 3D model and the coordinates of the feature point, processing such as parallel movement and scale change is performed on the image of the hand or instrument, and the position of the feature point is converted to Fit to the vertex. Thereafter, the graphics image of the model and the image of the hand or instrument are displayed in a translucent manner, and an image that seems to hold the model is synthesized and output to the image display unit 11 as shown in FIG.
[0077]
Another embodiment of the present invention will be described with reference to FIG.
[0078]
The moving image input unit 21 captures an image while the camera (observer) performs an unknown motion in the three-dimensional space, and transfers image information corresponding to the captured moving image to the feature point extraction unit 22. The shooting environment in which the moving image input unit 21 is shooting is basically a stationary environment, but may include a moving object.
[0079]
The feature point extraction unit 22 receives the time-series image information from the moving image input unit 21, and the image processing rapidly changes the brightness and color from the neighboring region, and the same point of the object is projected on the two images. A number of feature points that can be identified as being extracted are extracted, and the extracted feature points are input to the translation flow calculation unit 23.
[0080]
The translation flow calculation unit 23 compares the coordinates of the input feature points to form a network connecting the nearest four points, and an affine flow analysis unit for all combinations of the four nearest points A process similar to the process performed by 3 is performed to obtain a virtual parallax, and the motion parameter of the observer is calculated. Looking at the translational motion direction θA among the calculated motion parameters, this value indicates the direction in which the camera can translate relative to the imaging environment. In a static environment, any combination of four feature points shows the same value. Actually, since the center projection is limited to a narrow visual field range and approximated by affine transformation, the same value is shown by a combination of feature points in the vicinity. Therefore, only the translational motion direction θA is extracted from the combination of feature points distributed over the entire image, and a distribution diagram as shown in FIG. 15 is created and output. In addition, the arrow of FIG. 15 has shown the motion of the some point of each of two objects.
[0081]
The independent motion dividing unit 24 compares the translational motion direction θA of the combination of neighboring feature points in the translational motion direction flow diagram calculated by the translational flow calculation unit 23, and divides the region when the difference is larger than a certain threshold. To do. As a result, it is possible to extract a region indicating an object that is moving differently from the moving background image, such as a region surrounded by a solid line in FIG. Thereafter, the regions that are moving differently are removed, and the motion parameters other than θA are calculated from the remaining region indicating the background image, and the motion of the camera is obtained and output.
[0082]
Another embodiment of the present invention will be described with reference to FIG.
[0083]
The image input unit 31 inputs image information corresponding to an image obtained by photographing one object from a plurality of directions. The image information image input here may not be temporally continuous. In addition, the positional relationship of the observer when the object is photographed is unknown.
[0084]
The feature point extraction unit 32 performs image processing such as point extraction processing on the image information input from the image input unit 31, and corresponds to a large number of feature points whose brightness and color are rapidly changing from neighboring regions. Feature point information is extracted and output to the corresponding feature search unit 33. The extracted feature point information is superimposed on the input image information and displayed on the image display unit 36.
[0085]
The initial correspondence search unit 33 compares the feature point information among a plurality of images, and checks whether the same point on the object to be imaged is projected. First, a flag indicating non-correspondence is attached to all feature point information input from the feature point extraction unit 32. Next, correlation matching of small regions centering on feature points is performed between images, and feature point information having a correlation coefficient higher than a certain threshold is associated with each other, and the corresponding feature point information is indicated as corresponding. The flag is attached, and the feature point set information is output to the correspondence correction unit, that is, the correspondence feature update unit 37. The corresponding feature point information is displayed on the image display unit 36 so as to be superposed on the input image information so that it can be distinguished from unsupported feature point information by changing the color. It is displayed so that it can be understood which point information corresponds to which feature point information.
[0086]
The interface unit 34 receives the feature point set information including the feature points with the correspondence from the initial correspondence search unit 33, and corrects the correspondence between the feature points. If the correspondence created by the initial correspondence search unit 32 is sufficiently accurate and has few errors, it is possible to perform later processing without correcting the corresponding points.
[0087]
As a result of the initial correspondence search, a cursor controlled by a pointing device such as a mouse is displayed on the input image in which the feature points displayed on the image display unit 36 are displayed so as to allow the user to select the feature points. When a feature point already having a corresponding flag is selected, the feature point and the corresponding feature point flag are changed to unsupported, and the corresponding relationship is canceled. When feature point information with an unsupported flag is selected one by one in a plurality of input image information, a flag indicating that correspondence has been established is attached to the feature point information, and those feature points Set correspondence between information. Also, even in an area without feature points, if there are pixels selected one by one in a plurality of images one after another, feature points whose correspondence is determined at the coordinates of the pixels are generated and the correspondence relationship is set. In addition to images and feature points, buttons for updating corresponding points, extracting the structure of an object, and synthesizing an image are displayed so that these can be selected by a pointing device. When the user selects the corresponding point update button, the feature point set information is passed to the corresponding feature update unit 37 to update the feature point correspondence. When the object structure extraction button is selected, a set of feature points is passed to the structure extraction unit to extract the three-dimensional structure of the photographed object. When the image composition button is selected, it is further asked where the image to be synthesized is to be synthesized, and the translational and rotational motion vectors up to the viewpoint for synthesizing the image are input and input to the image composition unit 35. The obtained image, the set of feature points, the three-dimensional structure of the object extracted by the structure extraction unit (?), And the motion vector are passed, and the images are synthesized and displayed on the image display unit.
[0088]
The image display unit 36 displays a plurality of input images, feature points, a 3D model of an object structure extracted from the image, an image synthesized by changing the line-of-sight direction, etc. And feature points can be specified.
[0089]
Corresponding feature update unit 37 receives feature point set information from interface unit 34 and associates unsupported feature points according to a new standard. First, from the inputted feature point set information, correspondence is determined, a feature point with a correspondence flag (this is designated as point A0) is selected, and a feature with an unsupported flag in the vicinity of the feature point A0 is selected. Point B is selected. The feature point B ′ corresponding to the feature point A0 in the vicinity of the feature point A0 ′ in another image is compared with the feature point B to determine whether or not they correspond. In comparison between B and B ′, if there are two or less corresponding feature points in the vicinity of both, as in the initial correspondence search, correlation matching of small regions centering on the feature points is performed, and the initial correspondence search is performed. It is determined whether or not the threshold is lower than the threshold used in the unit 33. As shown in FIG. 17, if there are three or more feature points corresponding to the vicinity of both B and B ′, matching is performed in consideration of image deformation. First, an affine transformation that transforms the triangles A0, A1, and A2 into triangles A0 ', A1', and A2 'is calculated. The small area centered on B is transformed by the calculated affine transformation, and correlation matching is performed between the deformed small area and the small area centered on B ′, and there are no more than two corresponding points in the vicinity. Judgment is made with the same threshold as in the case of no. Like the initial correspondence search unit 33, the feature points found in this way are attached with a flag indicating the correspondence to the feature points, and the feature points are connected in a correspondence relationship. This process is repeated to return to the interface unit 34 until there are no more corresponding feature points.
[0090]
The structure extraction unit 38 receives a plurality of pieces of image information obtained by photographing an object of unknown shape and a set of feature points associated with these pieces of image information, and from these pieces of information, a three-dimensional shape model of the object Extract the texture pattern of the object surface and output it. First, a combination of four neighboring points is extracted from the feature point set information, and the four parameters are processed by the affine flow analysis unit 3 to calculate motion parameters. These four points are divided into three points constituting the triangle and one other point P as shown in FIG. The point P ′ is a virtual point that is in the same coordinates as the point P in one image and moves in the other image by affine transformation expressed by the movement of the other three points. This point P ′ is on a three-dimensional plane determined by three points in the three-dimensional space, and indicates a point projected on the same coordinates as the point P on the image plane. The size of this virtual parallax PP ′ is given by equation (12):
[0091]
[Expression 20]

[0092]
It is represented by However, Z _p Is the coordinate of point p, Z _p1 Is the Z coordinate of the point P ′. Here, the situation of FIG. 18 is assumed in which another feature point Q in the vicinity of the four points is added. Like the point P ′, the point Q ′ is a virtual point that moves by affine transformation determined by the movement of the other three points, and is also a point on a three-dimensional plane that is determined by three points in the three-dimensional space. Considering the two virtual parallaxes PP ′ and QQ ′, since they are close in the image and the distance from the observer to the object is sufficiently far, the ratio of the lengths of the two virtual parallaxes is (ZP −ZP ′) / (ZQ−ZQ ′), from which the ratio of the coordinate in the depth direction to the three-dimensional plane with P and Q can be obtained. This process is performed on combinations of feature points in all the vicinity to calculate a 3D model of the object in which the length in the depth direction of the object is expressed as a ratio to a certain length λ. After this, the 3D model of the object corresponding to a certain λ is displayed graphically, and the user adjusts the value of λ while looking at the 3D model, or finds the part where the plane intersects on the surface of the 3D model, A complete 3D model of the object is calculated, such as estimating the magnitude of λ so that the intersection angle is a right angle. When the magnitude of λ is calculated, the gradient of the triangular object surface surrounded by the three feature points in the vicinity on the image is obtained from the equation (31). Texture information can also be extracted.
[0093]
The structure extraction unit 38 outputs the 3D model and texture information of the object calculated in this way.
[0094]
The image synthesizer 35 deforms the input image using the three-dimensional structure of the input object and the motion vector to the viewpoint, and synthesizes an image when the object is viewed from a new viewpoint. The deformation of the image when the observer moves is expressed by the translational motion vector and the rotational motion vector of the observer, the gradient of the object surface, and the translational motion on the image plane by Expressions (34) to (36). . The observer's motion and the translational motion on the image plane are calculated from the motion vector to the viewpoint, and the gradient of the object surface is obtained from the three-dimensional structure of the object. An affine transformation matrix representing the deformation can be calculated. First, an area where the target object is shown in the input image is divided into triangular patches separated by a straight line connecting feature points in the area. The above affine transformation is applied to each triangular patch image to create a new triangular patch image. An image of an object when viewed from a new line-of-sight direction is formed by joining the created patch images, and this is displayed on the image display unit.
[0095]
【The invention's effect】
According to the present invention, it is possible to extract a region indicating an object that is moving differently from a moving background image, remove the region that is moving differently, and translate from the remaining region indicating the background image. Motion parameters other than projection θA of motion onto the image plane can also be calculated to determine camera motion.
[Brief description of the drawings]
FIG. 1 is a block diagram of a moving image processing apparatus which is an embodiment of the present invention and is used for a human interface such as a model operation in a CAD system.
FIG. 2 is a diagram of a first example instrument for use in pointing.
FIG. 3 is a diagram of a second example instrument for use in pointing.
FIG. 4 is a diagram of a third example instrument used for pointing.
FIG. 5 is a diagram showing a relationship between three-dimensional coordinates and image coordinates in an imaging form by center projection.
FIG. 6 is a diagram illustrating invariant feature amounts that can be extracted from an image.
FIG. 7 is a diagram illustrating motion parallax.
FIG. 8 is a view for explaining a method of expressing the inclination of the object surface.
FIG. 9 is a flowchart illustrating the operation of an affine flow analysis unit.
FIG. 10 is a flowchart illustrating an operation of a pointing information generation unit.
FIG. 11 is a diagram showing a 3D (three-dimensional) model of a pointer.
FIG. 12 is a diagram showing a data structure of gesture patterns stored in a gesture pattern storage unit.
FIG. 13 is a view showing an image obtained by superimposing a part of an input image with a 3D model.
FIG. 14 is a block diagram of a system for discriminating independent movement according to another embodiment of the present invention.
FIG. 15 is a distribution map of translational motion directions.
FIG. 16 is a configuration diagram of a moving image processing apparatus according to another embodiment of the present invention, which is applied to a system that acquires a 3D model from a plurality of images and synthesizes an image when an object is viewed from another direction.
FIG. 17 is a diagram illustrating matching in consideration of image deformation by affine transformation.
FIG. 18 is a diagram illustrating a relationship between two virtual parallaxes.
[Explanation of symbols]
1. Image input unit
2 ... Feature point extraction unit
3 ... Affine flow analysis unit
4 ... Pointing information generator
5 ... Status switching part
6 ... Gesture pattern matching section
7 ... Gesture pattern storage
8 Pointer display section
9 ... 3D model storage
10: Image composition unit
11 ... Image display section
12 ... Pointer model storage unit
21 ... Moving image input unit
22 ... Feature point extraction unit
23 ... Translational flow calculator
24 ... Independent motion division
31. Image input unit
32. Feature point extraction unit
33 ... Initial correspondence search unit
34 ... Interface part
35. Image composition unit
36. Image display section
37 ... Corresponding feature update unit
38. Structure extraction unit

Claims

画像に対応する画像情報を入力する画像入力手段と、前記画像から同一平面上の少なくとも３つの第１の特徴点と該同一平面上にない１つの第２の特徴点を抽出するため前記画像情報に対して特徴点抽出処理を行う特徴点抽出手段と、前記第２の特徴点と同じ座標にあって、前記第１の特徴点の移動速度のアフィン変換で決まる移動速度を持つ仮想点を求め、この仮想点と前記第２の特徴点との差に基づいて仮想視差を求め、この仮想視差の方向から並進運動方向を求める並進フロー計算部と、上記並進フロー計算部で計算した並進運動方向を用いて異なる運動をしている領域を分割する独立運動分割部とを具備する動画像処理装置。Image input means for inputting image information corresponding to an image, and the image information for extracting at least three first feature points on the same plane and one second feature point not on the same plane from the image A feature point extracting means for performing a feature point extraction process on the image, and a virtual point having a moving speed determined by affine transformation of the moving speed of the first feature point at the same coordinates as the second feature point A translational flow calculation unit that obtains a virtual parallax based on a difference between the virtual point and the second feature point, and obtains a translational motion direction from the direction of the virtual parallax, and a translational motion direction calculated by the translational flow calculation unit A moving image processing apparatus comprising: an independent motion dividing unit that divides regions that perform different motions using the.