JP3788969B2

JP3788969B2 - Real-time facial expression tracking device

Info

Publication number: JP3788969B2
Application number: JP2002311660A
Authority: JP
Inventors: 昭二田中; 聡田中
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-10-25
Filing date: 2002-10-25
Publication date: 2006-06-21
Anticipated expiration: 2021-09-28
Also published as: JP2003178311A

Description

【０００１】
【発明の属する技術分野】
本発明は、本人の顔を送信する代わりにＣＧキャラクタの映像を相手に送信することによって人物映像を互いに通信するテレビ電話など通信システムに適用され、特にカメラによって撮像された顔の映像から頭部の３次元的な姿勢情報と顔の表情を計測し、この計測結果に基づいてＣＧキャラクタの動きを制御する代理応答によるリアルタイム表情追跡装置に関するものである。
【０００２】
【従来の技術】
例えば、図３０は、特許文献１に示された従来の仮想変身装置（第１の従来技術）を示すものであり、この仮想変身装置は、顔画像を入力するビデオカメラと、ビデオカメラを回転させる電動雲台と、ビデオカメラから入力された顔画像から顔の軸の回転、あるいは顔の軸周りの回転と視線方向を検出し、両目および口の形状変化を検出する顔画像認識装置と、この計測結果に基づいてＣＧ（コンピュータグラフィックス）で構築された仮想空間のキャラクタを制御する仮想環境合成装置とを備えている。
【０００３】
この第１の従来技術では、ビデオカメラから入力された顔画像を、予め設定したＲＧＢ空間上に構築された肌色モデルに従って肌色を１、肌色以外を０とする２値化処理を行う。次に、２値化した顔領域の重心を求め、重心が画像の中心になるように電動雲台装置を制御し、カメラのアングルを修正する。次に、重心位置に基づき顔領域内に存在する穴を両目および口として検出する。次に、予め設定したテンプレートを用いたテンプレートマッチングにより目領域を追跡し、黒目の位置から視線方向を求める。また、両目を結んだ直線と画像の水平軸との角度を計測し、さらに、両目間の距離から、顔の軸周りの回転を検出する。そして、両目および口の周囲画像を離散コサイン変換したときの各周波数帯域での電力変化を捉えることで、両目および口の形状変化を計測する。以上の計測結果に基づいてＣＧで構築された仮想空間のキャラクタの頭部および表情を制御する。
【０００４】
また、特許文献２の表情検出装置（第２の従来技術）では、連続する各フレームの画像において、選択した複数の特徴点を追跡し、各フレーム毎に前記複数の特徴点を頂点とするドロネー網を構成し、このドロネー網を用いて表情筋モデルを特徴点の移動に基づき変位させることにより、表情筋モデルの変化を求めるようにしている。
【０００５】
また、特許文献３（第３の従来技術）においては、大きさが固定のウィンドウマスクを画像全体に走査し、マスク内の輝度分散を正規化することにより、照明条件が変化しても安定して対象物の特徴量を抽出可能とした対象物検出装置に関する発明が開示されている。
【０００６】
【特許文献１】
特開２０００−３３１１９０号公報
【特許文献２】
特開２０００−２５９８３１号公報
【特許文献３】
特開平１１−３０６３４８号公報
【０００７】
【発明が解決しようとする課題】
第１の従来技術では、カメラで撮影した顔画像を肌色モデルに基づいて２値化し、顔領域内の穴を見つけ、顔領域の重心位置からそれらを目および口に対応させている。しかしながら、本来、顔の凹凸から生じる影やハイライトの影響があるので、第１の従来技術では、照明条件を慎重に設定しなければ目および口のみを穴として検出するのは非常に困難である。また、この第１の従来技術は、頭部の３軸（Ｘ軸、Ｙ軸、Ｚ軸）周りの回転を同時に計測することができず、さらに、顔の軸周りの回転を、両目間の距離により求めているため、例えば顔がカメラから遠ざかるあるいは近づくと、必然的に両目間の距離が変化することから、実際には回転させていないのにも関わらず、回転しているとみなされるなど問題があった。
【０００８】
また、第２の従来技術では、３次元姿勢を計測するために顔画像中の多数の特徴点を追跡する必要があるため、計算能力の低いハードウェアではリアルタイム処理が困難である問題があった。
【０００９】
また、第３の従来技術では、大きさが固定されたマスク領域を用いることから、個人差や撮影距離によって顔領域の大きさが変化することへの対応処理が困難である。
【００１０】
この発明は上記に鑑みてなされたもので、計算能力が低いハードウェアでも実時間で頭部の３次元的な動きを計測し、かつ両目および口の開閉状態を計測し、その結果を用いてＣＧキャラクタの頭部の動きおよび表情をリアルタイム制御するリアルタイム表情追跡装置を得ることを目的としている。
【００１１】
【課題を解決するための手段】
上記目的を達成するため、この発明にかかるリアルタイム表情追跡装置は、順次所定のフレームレートで入力される映像をキャプチャする映像入力手段と、前記キャプチャした画像から頭部画像を抽出する頭部領域検出手段と、前記抽出した頭部領域から両目および口を含む各部位の候補領域を抽出する部位領域候補抽出手段と、抽出した候補領域の中から各部位の位置を検出する部位検出追跡手段と、前記検出した両目、口の検出位置に基づいて頭部の３次元姿勢を計測するとともに、両目および口の開閉状態を計測する頭部３次元姿勢・表情計測手段とを備え、前記計測した頭部の３次元姿勢および両目および口の開閉状態に基づいてＣＧキャラクタの動きを制御するリアルタイム表情追跡装置であって、前記頭部３次元姿勢・表情計測手段は、最初に検出した両目および口の位置から３次元空間上の仮想平面を設定するアフィン基底設定手段と、前記検出した両目および口位置から頭部の左右および上下方向の回転量を推定する頭部回転量推定手段と、前記検出した両目および口位置から得た４点の座標を結ぶ矩形を前記推定した頭部の左右および上下方向の回転量を用いて歪ませ、該歪ませた矩形の４点の座標を用いて頭部の３次元姿勢を推測する姿勢計測手段と、頭部の動きに応じて両目および口の開閉状態を推測する開閉状態計測手段とを備えることを特徴とする。
【００１２】
この発明によれば、最初に検出した両目および口の位置から３次元空間上の仮想平面を設定し、検出した両目および口位置から頭部の左右および上下方向の回転量を推定し、前記検出した両目および口位置から得た４点の座標を結ぶ矩形を前記推定した頭部の左右および上下方向の回転量を用いて歪ませ、該歪ませた矩形の４点の座標を用いて頭部の３次元姿勢を推測し、さらに頭部の動きに応じて両目および口の開閉状態を推測するようにしている。
【００１３】
つぎの発明にかかるリアルタイム表情追跡装置は、上記の発明において、前記頭部回転量推定手段は、両目領域を結ぶ直線をＸ軸とし、Ｘ軸に垂直で口領域の中心位置を通る直線をＹ軸とした頭部のローカル座標系を設定し、このローカル座標系において求めた頭部領域の外接矩形の左右の辺と片目との相対距離と、外接矩形の上下の辺と口領域との相対距離から頭部の左右、上下方向の回転量をそれぞれ推定することを特徴とする。
【００１４】
この発明によれば、両目領域を結ぶ直線をＸ軸とし、Ｘ軸に垂直で口領域の中心位置を通る直線をＹ軸とした頭部のローカル座標系を設定し、このローカル座標系において求めた頭部領域の外接矩形の左右の辺と片目との相対距離と、外接矩形の上下の辺と口領域との相対距離から頭部の左右、上下方向の回転量をそれぞれ推定するようにしている。
【００１５】
つぎの発明にかかるリアルタイム表情追跡装置は、上記の発明において、前記開閉状態計測手段は、前記姿勢計測手段によって推定した頭部の３次元姿勢情報を用いて対象人物が正面を向いたときの両目および口領域を再現することに基づき両目および口の開閉状態を計測することを特徴とする。
【００１６】
この発明によれば、推定した頭部の３次元姿勢情報を用いて対象人物が正面を向いたときの両目および口領域を再現することに基づき両目および口の開閉状態を計測するようにしている。
【００１７】
つぎの発明にかかるリアルタイム表情追跡装置は、順次所定のフレームレートで入力される映像をキャプチャする映像入力手段と、前記キャプチャした画像から頭部画像を抽出する頭部領域検出手段と、前記抽出した頭部領域から両目および口を含む各部位の候補領域を抽出する部位領域候補抽出手段と、抽出した候補領域の中から各部位の位置を検出する部位検出追跡手段と、前記検出した両目、口の検出位置に基づいて頭部の３次元姿勢を計測するとともに、両目および口の開閉状態を計測する頭部３次元姿勢・表情計測手段とを備え、前記計測した頭部の３次元姿勢および両目および口の開閉状態に基づいてＣＧキャラクタの動きを制御するリアルタイム表情追跡装置であって、前記部位検出追跡手段は、前記部位領域候補抽出手段によって検出された両目、口の候補領域から両目および口の領域を夫々特定する部位検出手段と、前記部位検出手段によって、両目および口の領域が特定できない場合に、現フレームで特定された部位領域の位置と、この特定された部位領域の前フレームでの位置とを用いて移動ベクトルを求め、この移動ベクトルを用いて前記特定できなかった部位の位置を特定する部位追跡手段とを備えることを特徴とする。
【００１８】
この発明によれば、部位検出手段によって両目および口の領域が特定できない場合には、現フレームで特定された部位領域の位置と、この特定された部位領域の前フレームでの位置とを用いて移動ベクトルを求め、この移動ベクトルを用いて特定できなかった部位の位置を特定している。
【００１９】
つぎの発明にかかるリアルタイム表情追跡装置は、上記の発明において、前記部位検出手段は、前フレームについての部位領域の中心座標を中心に一定の大きさの矩形領域を設定し、その矩形領域中に存在する現フレーム候補領域を求め、求めた候補領域夫々について、判別式Ｅ＝｜ＳＰ−ＳＣ｜＋ＯＰ＋Ｄを用いて評価値Ｅを夫々取得し、
ＳＰ：前フレームおける部位領域の画素数、
ＳＣ：現フレームにおける候補領域の画素数、
ＯＰ：現フレームにおける候補領域のみを非肌色に対応する論理値レベル
とした部位領域マスク画像と、前フレームにおける部位領域のみを非肌色に対応する論理値レベルとした部位領域マスク画像との排他的論理和を求めたときに、画素値が非肌色に対応する論理値レベルとなる画素数、
Ｄ：前フレームにおける部位領域の中心と候補領域の中心との距離、
評価値Ｅが最も小さな候補領域を各部位領域として特定することを特徴としている。
【００２０】
この発明によれば、前フレームについての部位領域の中心座標を中心に一定の大きさの矩形領域を設定し、その矩形領域中に存在する現フレーム候補領域を求め、求めた候補領域夫々について、判別式Ｅ＝｜ＳＰ−ＳＣ｜＋ＯＰ＋Ｄを用いて評価値Ｅを夫々取得し、評価値Ｅが最も小さな候補領域を各部位領域として特定する。
【００２１】
つぎの発明にかかるリアルタイム表情追跡装置は、順次所定のフレームレートで入力される映像をキャプチャする映像入力手段と、前記キャプチャした画像から頭部画像を抽出する頭部領域検出手段と、前記抽出した頭部領域から両目および口を含む各部位の候補領域を抽出する部位領域候補抽出手段と、抽出した候補領域の中から各部位の位置を検出する部位検出追跡手段と、前記検出した両目、口の検出位置に基づいて頭部の３次元姿勢を計測するとともに、両目および口の開閉状態を計測する頭部３次元姿勢・表情計測手段とを備え、前記計測した頭部の３次元姿勢および両目および口の開閉状態に基づいてＣＧキャラクタの動きを制御するリアルタイム表情追跡装置であって、前記部位領域候補抽出手段は、前記頭部領域検出手段によって抽出された頭部領域の輝度を平均化・正規化する頭部領域輝度平均化手段と、この輝度平均化・正規化後の画像を用いて頭部領域中の両目および口の候補領域を抽出する画素選別手段とを備えることを特徴としている。
【００２２】
この発明によれば、頭部領域検出手段によって抽出された頭部領域の輝度を平均化・正規化し、該輝度平均化・正規化された画像を用いて頭部領域中の両目および口の候補領域を抽出するようにしている。
【００２３】
つぎの発明にかかるリアルタイム表情追跡装置は、上記の発明において、前記頭部領域輝度平均化手段は、頭部領域を複数の小領域に分割し、各小領域毎にヒストグラム平均化処理を行うことを特徴とする。
【００２４】
この発明によれば、頭部領域を複数の小領域に分割し、各小領域毎にヒストグラム平均化処理を行うようにしている。
【００２５】
つぎの発明にかかるリアルタイム表情追跡装置は、上記の発明において、前記ヒストグラム平均化処理では、所定の閾値を越えた頻度をもつ画素値の頻度を他の画素値に分散させる処理を加えることを特徴とする。
【００２６】
この発明によれば、ヒストグラム平均化処理では、所定の閾値を越えた頻度をもつ画素値の頻度を他の画素値に分散させるようにしている。
【００２７】
【発明の実施の形態】
以下に添付図面を参照して、この発明にかかるリアルタイム表情追跡装置の好適な実施の形態を詳細に説明する。このリアルタイム表情追跡装置は、本人の顔を送信する代わりにＣＧキャラクタの映像を相手に送信することによって人物映像を互いに通信するテレビ電話など通信システムに適用される。
【００２８】
以下、本発明の実施の形態を図１〜図２０を用いて説明する。図１は、本実施の形態のリアルタイム表情追跡装置の概念的構成を示すものである。
【００２９】
この図１に示すリアルタイム表情追跡装置は、例えばパーソナルコンピュータ、ワークステーションに実行させるプログラムの機能構成を示すものである。この図１に示すリアルタイム表情追跡装置は、ビデオカメラ８０などの映像取込手段から入力された映像を取り込むための映像入力手段１と、映像入力手段１を介して入力された人物映像から頭部領域を検出する頭部領域検出手段２と、頭部領域検出手段２で抽出された頭部領域から両目および口となる候補領域を抽出する部位領域候補抽出手段３と、部位領域候補抽出手段３で抽出した候補領域から両目、口領域を検出し、毎時変化する位置を追跡し、さらに各部位の開閉状態を計測する部位検出追跡手段４と、部位検出追跡手段４で検出した両目および口位置から頭部の３次元姿勢および表情を計測する頭部３次元姿勢・表情計測手段５とを備えている。
【００３０】
さらに、頭部領域検出手段２は、撮影される環境下（照明環境下など）で人物の肌色をサンプリングする肌色サンプリング手段６と、肌色サンプリング手段６でサンプリングした肌色情報に基づいて肌色抽出パラメータを調整する肌色抽出パラメータ調整手段７と、肌色抽出パラメータ調整手段７で調整された肌色抽出パラメータに基づいて入力映像から肌色画素を抽出し、抽出した画素を塊（領域）ごとに分類する肌色領域抽出手段８と、抽出した肌色領域の中から頭部領域を選択し頭部領域中の穴、裂け目などの小領域などを全て埋める（肌色に置換する）ことにより人物の頭部に関わる全ての画素を領域として抽出する頭部領域抽出手段９とを備えている。
【００３１】
部位領域候補抽出手段３は、頭部領域の輝度値を平均化する頭部領域輝度平均化手段１０と、両目および口の候補領域を抽出する画素選別手段１１とを備えている。
【００３２】
部位検出追跡手段４は、部位領域候補抽出手段３で抽出された両目および口の候補領域からそれぞれに対応する領域を特定する部位検出手段１２と、部位検出手段１２で検出した両目および口の初期位置を記憶する初期位置設定手段１３と、前フレームまでに記憶した各部位の位置から現フレームにおける位置を検出する部位追跡手段１４とを備えている。
【００３３】
頭部３次元姿勢・表情計測手段５は、初期位置設定手段１３で設定された各部位の初期位置に基づき頭部３次元姿勢を求めるための基準となるアフィン基底を設定するアフィン基底設定手段１５と、頭部の水平軸および垂直軸周りの暫定的な回転量を求める頭部回転量推定手段１６と、部位検出手段１２で検出した各部位の位置からアフィン基底設定手段１５で設定した仮想３次元空間上の点に対応する映像中の２次元の点から頭部の３次元姿勢を計測する姿勢計測手段１７と、各部位（両目、口）の開閉状態を計測することで表情を追跡する開閉状態計測手段１８とを備えている。
【００３４】
キャラクタ制御装置９０は、頭部３次元姿勢・表情計測手段５から入力された頭部の３次元姿勢および各部位（両目、口）の開閉状態を用いて三次元のＣＧキャラクタを制御することで、ビデオカメラ８０で撮像した利用者の動き、表情に追従させてＣＧキャラクタの動き、表情をリアルタイムに変化させる。
【００３５】
図２は、図１のリアルタイム表情追跡装置のキャリブレーションフェーズの動作の概要を説明するためのフローチャートである。図３は、図１のリアルタイム表情追跡装置のトラッキングフェーズの動作の概要を説明するためのフローチャートである。これら図２および図３を用いてリアルタイム表情追跡装置の動作の概略を説明する。
【００３６】
リアルタイム表情追跡装置で行われる動作手順には、頭部の動きを追跡するための情報として両目および口の位置および無表情時の状態等を取得するキャリブレーションフェーズと、実際に頭部の動きおよび両目および口を追跡し、頭部姿勢と両目および口の開閉状態つまり表情を計測するトラッキングフェーズがある。
【００３７】
キャリブレーションフェーズでは、まず、映像入力手段１によってビデオカメラ８０からの映像をキャプチャする（ステップＳ１００）。なお、人物の映像をビデオカメラ８０で撮像する際に、ユーザに対して「カメラに対して正面を向き、両目を開け、口を閉じる」ように指示することで、無表情時の人物映像を得る。つぎに、頭部領域検出手段２において、撮影環境下におけるユーザの肌色をサンプリングし（ステップＳ１１０）、このサンプリングデータを用いて予め設定した肌色抽出パラメータの調整を行う（ステップＳ１２０）。そして、調整した肌色抽出パラメータを用いて実際に肌色領域を抽出し（ステップＳ１３０）、抽出した領域の中から頭部領域を検出する（ステップＳ１４０）。次に、部位領域候補抽出手段３において、抽出した頭部領域から両目、口の候補領域を抽出し（ステップＳ１５０）、部位検出追跡手段４において両目領域および口領域をそれぞれ検出し（ステップＳ１６０）、検出した両目および口領域から各部位の位置、大きさ、テンプレートの初期値を記憶する（ステップＳ１７０）。最後に、頭部３次元姿勢・表情計測手段５において、求めた両目および口の位置に基づき、トラッキングフェーズにおいて頭部の３次元的姿勢情報を求めるためのアフィン基底（３次元空間上の仮想点）を設定する（ステップＳ１８０）。
【００３８】
トラッキングフェーズでは、映像入力手段１によってビデオカメラ８０からの映像をキャプチャする（ステップＳ２００）。頭部領域検出手段２においては、キャリブレーションフェーズで設定した肌色抽出パラメータを用いてキャプチャした映像中から肌色を抽出し（ステップＳ２１０）、抽出した領域から頭部領域を検出する（ステップＳ２２０）。次に、部位領域候補抽出手段３において、両目および口の候補領域を抽出する（ステップＳ２３０）。つぎに、部位検出追跡手段４は、前フレームで検出した両目および口位置に基づき、部位領域候補抽出手段３で抽出した候補領域の中から現フレームにおける両目および口領域を検出する（ステップＳ２４０）。次に、頭部３次元姿勢・表情計測手段５において、部位検出追跡手段４で検出した両目および口位置（２次元画像点）と予め設定した３次元空間上の仮想点から頭部の３次元的姿勢情報を計測し（ステップＳ２５０）、その計測情報に基づいて両目および口の開閉状態を計測する（ステップＳ２６０）。最後に、計測した両目および口の開閉状態情報及び頭部の姿勢情報はキャラクタ制御装置９０に入力され、キャラクタ制御装置９０によってＣＧキャラクタの頭部の動きおよび表情が制御される（ステップＳ２７０）。
【００３９】
［キャリブレーションフェーズ］
次に、図１のリアルタイム表情追跡装置のキャリブレーションフェーズにおける動作を図４〜図１７を用いて詳細に説明する。
【００４０】
（ａ）頭部領域検出手段２での処理
まず、図４〜図１０を用いて頭部領域検出手段２が行う図２のステップＳ１１０〜Ｓ１４０の処理の詳細について説明する。
【００４１】
図４は、頭部領域検出手段２における肌色サンプリング手段６の動作を説明するための図である。図５は、肌色サンプリング手段６および肌色抽出パラメータ調整手段７の動作を説明するためのフローチャートである。
【００４２】
まず、使用する照明環境下におけるユーザの肌色をサンプリングするために、図４に示すように、キャプチャ映像１９に重ねて、サンプリング領域を指定するためのサンプリングウィンドウ２０を表示する（ステップＳ３００）。次に、ユーザは、マウスあるいはその他のポインティングデバイスやキーボード等を用いて、サンプリングウィンドウ２０を頬あるいは額などの肌色のみ抽出可能な位置に移動させ、サンプリング可能であることをシステムに伝える（ステップＳ３１０）。なお、最初に表示したサンプリングウィンドウ２０の位置に合わせてユーザ自身が頭を動かして位置を調整しても良い。
【００４３】
次に、サンプリングウィンドウ２０内の全ての画素の色を肌色抽出のための色空間（肌色モデル空間）に写像し（ステップＳ３２０）、写像画素の写像空間での最大値および最小値を用いて予め設定した肌色抽出パラメータを調整する（ステップＳ３３０）。
【００４４】
ここで、肌色抽出空間は、例えば、輝度変化に比較的ロバストな色空間を新たに構築するとか、画素の色データ空間（Ｒ、Ｇ、Ｂ空間）上で構築するなどの方法を用いる。ここでは、下記のような、輝度変化に比較的ロバストな色空間を用いることにする。
【００４５】
Ｒ（レッド），Ｇ（グリーン），Ｂ（ブルー）を各画素の色の３原色の成分だとすると、まず、次式により色を正規化する。
ｃ１＝arctan（Ｒ／max（Ｇ，Ｂ））......式（１）
ｃ２＝arctan（Ｇ／max（Ｒ，Ｂ））......式（２）
ｃ３＝arctan（Ｂ／max（Ｒ，Ｇ））......式（３）
【００４６】
上記式で正規化した色をさらに次式で変換する。
Ｃ１＝ｃ２／ｃ１ ......式（４）
Ｃ２＝ｃ３／ｃ２ ......式（５）
【００４７】
肌色領域抽出手段８では、式（４）および式（５）でＲＧＢ空間からＣ１−Ｃ２空間に変換した色が、次式（６），（７）で定義した肌色範囲に入っているか否かを判断することにより入力画像から肌色領域を抽出する。
ｔｈ１＜Ｃ１＜ｔｈ２ ......式（６）
ｔｈ３＜Ｃ２＜ｔｈ４ ......式（７）
【００４８】
肌色抽出パラメータ調整手段７では、この肌色抽出の際に用いる肌色抽出パラメータ（閾値）ｔｈ１〜ｔｈ４を、肌色サンプリング手段６のサンプリングデータを用いて異なる照明条件あるいは各人の肌色の違いに適応して可変するようにしている。すなわち、肌色抽出パラメータ調整手段７は、肌色サンプリング手段６でサンプリングした画素のＲＧＢデータをＣ１-Ｃ２空間に写像し、その時の最大値、最小値をＣ１、Ｃ２についてそれぞれ求め、Ｃ１についての最小値で閾値ｔｈ１を、Ｃ１についての最大値で閾値ｔｈ２を、Ｃ２についての最小値で閾値ｔｈ３を、Ｃ２についての最大値で閾値ｔｈ４を夫々変更する。
【００４９】
以上のように、使用する照明環境下において利用者の肌色をサンプリングすることにより肌色抽出性能を向上させることができ、また、照明の輝度変化に頑強な色空間を用いることにより簡易なパラメータ調整でも肌色抽出性能をさらに向上させることが可能となる。
【００５０】
次に図６〜図１０を用いて肌色領域抽出手段８および頭部領域抽出手段９の動作を説明する。図６は、肌色領域抽出手段８と頭部領域抽出手段９の動作を説明するためのフローチャートである。
【００５１】
肌色抽出パラメータ調整手段７で調整した肌色抽出パラメータを用いてもなお照明環境によっては顔の一部にハイライトが発生したり、皺や影などにより頭部領域を肌色抽出のみで正確に抽出することは困難である。そのため、肌色領域抽出手段８で抽出された肌色領域の中で最も大きい領域を頭部領域として判定し、抽出漏れによる穴や裂け目などの目、鼻、口などの部位以外の小領域を頭部領域から除去する頭部領域修復処理を行うことにより頭部全体を適切に抽出可能とする。
【００５２】
肌色領域抽出手段８においては、キャプチャした画像の全ての画素の色データを肌色モデル空間に写像し（ステップＳ４００）、式（６）および式（７）で定めた閾値ｔｈ１〜ｔｈ４内にある画素を抽出し（ステップＳ４１０）、抽出した画素を４連結あるいは８連結で統合するラベリング処理（連続した図形をグループ分けして番号付けする処理）を実行することにより、個々のブロック領域（塊）に領域分割する（ステップＳ４２０）。そして、ラベリング処理の結果、得られるブロック領域の中から面積（画素数）が最大の領域を選択し、それを頭部領域とする（ステップＳ４３０）。
【００５３】
図７に、このようにして選択された頭部領域を含む画像を示す。この時点では、ハイライトや影、両目、口、鼻などの暗い部分が抽出されていないため、頭部領域には、図７に示すように、穴や裂け目などの小領域２１が発生している場合が多い。
【００５４】
そこで、頭部領域抽出手段９は、まず裂け目部分を修復する。裂け目部分の修復は、肌色領域抽出後の肌色画素を１、それ以外を０とした２値画像に対して、膨張収縮処理を行うことで達成する。膨張収縮処理は、図８に示すような膨張マスク２２および収縮マスク２３を設定し、以下の膨張処理と収縮処理を繰り返し行うことで、前述の裂け目や小さい穴などを埋めるものである。膨張処理は、注目画素の近傍の画素値を膨張マスク２２で設定した画素値に置き換えることにより領域を膨張させるものである。収縮処理は、注目画素の近傍画素の内、収縮マスク２３で設定した０でない画素の画素値が収縮マスク２３の画素値と同値である場合に注目画素を残し、同値で無い場合に注目画素の値を０とすることにより領域を収縮するものである。上記膨張収縮処理により、図９（ａ）に示すような裂け目２４が修復され、図９（ｂ）のようになる。また、この処理により微小の穴も埋めることが可能である。
【００５５】
膨張収縮処理により頭部領域に発生した裂け目が修復されたことにより、後は頭部領域内の全ての穴に対応する小領域を埋めることにより頭部全体を一領域として抽出することが可能となる。この穴埋め処理には、図１０に示すような、論理演算処理が用いられる。
【００５６】
まず、裂け目修復処理により得られた頭部領域画像２６と、画素値が全て１のマスク２７との排他的論理和を求める。その結果、背景領域と頭部領域内の穴が得られる。次に、得られた画像２８から、画像の外辺に接している領域（背景領域）を除去し、除去した画像２９と元の頭部領域画像２６との論理和を求めることにより、頭部全体を一領域として抽出することが可能となる（３０が論理和がとられた画像、ステップＳ４４０）。このように、簡単な論理演算処理により頭部領域を抽出できるので、高速処理が可能となる。
【００５７】
（ｂ）部位領域候補抽出手段３での処理
つぎに、図１１および図１２を用いて部位領域候補抽出手段３が行う図２のステップＳ１５０の処理の詳細について説明する。図１１は、部位領域候補抽出手段３の動作を説明するためのフローチャートである。
【００５８】
部位領域候補抽出手段３では、照明条件が変化することに応じた輝度変化に頑強に対応可能とするために、頭部領域検出手段２によって抽出された頭部領域に対して適応型ヒストグラム平均化法を用いて頭部領域のコントラストを一定に保つ処理を行う。まず、頭部領域輝度平均化手段１０は、頭部領域の外接矩形を求め、その外接矩形領域を例えば８×８の小領域に分割する（ステップＳ５００）。つぎに、頭部領域輝度平均化手段１０は、各小領域毎にヒストグラム平均化処理を行う（ステップＳ５１０）。
【００５９】
ヒストグラム平均化処理は、次のようにして行う。まず、各小領域毎に画素値と頻度の関係を示すヒストグラムを求める。次に、累積頻度（頻度の各階級（画素値）までの累計）を求め、各累積頻度を累積頻度の最大値で割って、各累積頻度の比率を求める。そして、求めた比率に小領域内の画素値の最大値を掛け合わせ、四捨五入により小数点以下を丸める。ここで得られた値が、平均化後の画素値となる。最後に、平均化後の画素値の頻度を、平均化前の頻度から求める。
【００６０】
例えば、図１３に示すように小領域内の画素値が０から７の範囲内にあり、その頻度が図１３に示す通りであった場合、平均化後のそれぞれの画素値の頻度は図１４に示す通りになる。例えば、平均化後の画素値が４の場合、画素値４に対応する平均前の画素値は２と３であるため、その頻度は、９＋２＝１１となる。
【００６１】
ここで、上記のとおり適応型ヒストグラム平均化法では、特にコントラストが低い小領域において、領域内の大半の画素値がヒストグラムの極大点に割り当てられることから、ノイズが多く発生する可能性がある。そこで、図１２（ａ）に示すようにある閾値を超えた頻度をもつ画素値３１が存在する場合には、図１２（ｂ）に示すように、それらの頻度を他の画素値に分散させる処理を行うようにしており、これによりノイズの発生を抑えることが可能である。
【００６２】
以上の処理により、常に一定のコントラストを得られることから、画素選別手段１１では、一定の閾値ｔｈａを用い、頭部領域内の輝度値が閾値ｔｈａ以下の画素（暗い画素）を論理レベル１とし、それ以外を論理レベル０とし（ステップＳ５２０）、さらに、画素値が１の画素を４連結あるいは８連結で結合し領域分割する（ステップＳ５３０）。最後に、微小領域を除去することにより、各部位（両目と口と鼻）の候補領域を抽出できる（ステップＳ５４０）。
【００６３】
以上のように、頭部全体を一領域として抽出し、その頭部領域のコントラストを常に一定にする処理を施すことにより、両目や口の部位領域の抽出処理を固定の閾値ｔｈａを用いて実行することができる。したがって、高速処理が可能となり、かつ輝度変化に頑強なシステムを構築することができる。
【００６４】
（ｃ）部位検出追跡手段４での処理
次に、図１５および図１６を用いて部位検出追跡手段４がキャリブレーションフェーズにおいて行う図２のステップＳ１６０およびＳ１７０の動作を説明する。図１５は、キャリブレーションフェーズにおける部位検出追跡手段４の動作を説明するためのフローチャートである。
【００６５】
まず、部位検出手段１２は、頭部領域検出手段２で抽出した頭部領域の重心を求める（ステップＳ６００）。この重心位置は、周知の距離変換処理などを用いて求める。
【００６６】
距離変換処理とは、画像中のオブジェクトの各画素値を、各画素位置から背景領域への最短距離に置き換える変換処理である。距離の概念としては、最も単純な市街地距離（４連結距離）とチェス盤距離（８連結距離）がよく使われる。ここでは、市街地距離を用いたアルゴリズムを説明する。
【００６７】
Step1. まず、入力画像を二値化した各画素データをｆ_i,jとし、Ｄ_i,jを初期化変換された多値データとした場合、次のように初期化変換する。すなわち、画素値が１の頭部領域内の画素は、多値データ∞（実際には、１００などの大きな値）に置換し、画素値が０の背景画素は、０に置換する。
【数１】

Step2. 初期化した画像を左上から右下に向かって走査し、次の規則で逐次Ｄ´_i,jを更新する。
Ｄ″_i,j＝min（Ｄ′_i,j _，Ｄ″_i-1,j＋１，Ｄ″_i,j-1＋１）......式（９）
Step3. 先のStep2で得られたＤ″_i,jに対して、右下から左上に向かって走査し、次の規則で逐次Ｄ″_i,jを更新する。
Ｄ_i,j＝min（Ｄ′_i,j _，Ｄ″_i+1,j＋１，Ｄ″_i,j+1＋１）......式（１０）
【００６８】
上式（１０）によって得られたＤ_i,jが距離画像の各画素データとなる。したがって、これら得られた距離画像から、距離値が最大となる画素を求め、この画素を頭部領域の重心とする。
【００６９】
距離画像変換の特徴は、領域の形が変化しても安定した重心位置を求めることがある。なお、距離画像変換を用いず、画素の座標値の平均により重心を求めても良い。
【００７０】
部位検出手段１２は、部位領域候補抽出手段３で抽出された両目、口、鼻についての候補領域の中から、先のステップＳ６００で求めた頭部領域の重心に最も近い候補領域を鼻領域とみなす（ステップＳ６１０）。
【００７１】
つぎに、部位検出手段１２は、図１６に示すように、上記特定した鼻領域から一定の方向と距離に頭部領域の大きさに比例した大きさの左目マスク３３、右目マスク３４、口マスク３５を設定する。
【００７２】
設定したマスク領域の中からそれぞれ重心位置に最も近い領域をそれぞれ右目、左目、口領域とする（ステップＳ６２０）。
【００７３】
次に初期位置設定手段１３において、各部位領域の中心位置と両目の外側の端点３６ａ、３７ａの位置を記憶する（ステップＳ６３０）。最後に、右目、左目および鼻に関する検出領域のうち、右目、左目、口領域内の画素値を１とし、それ以外を０とした部位領域マスク画像を各部位について夫々生成し、これらの部位領域マスク画像を記憶する。（ステップＳ６４０）。この部位領域マスク画像は、トラッキングフェーズでの第１番目のフレームについての部位追跡処理に用いられる。また、部位検出手段１２は、各部位領域（左目、右目、口）の、中心位置における画像垂直方向（Ｙ方向）の長さを測定し、これら測定値を初期位置設定手段１３に記憶する。この記憶された各部位領域（左目、右目、口）の画像垂直方向（Ｙ方向）の長さは、その後のトラッキングフェーズで、各部位の開閉状態情報を得るために利用される。
【００７４】
（ｄ）頭部３次元姿勢・表情計測手段５での処理
次に、図１７〜図１９を用いて頭部３次元姿勢・表情計測手段５がキャリブレーションフェーズにおいて行う図２のステップＳ１８０の動作を説明する。図１７は、キャリブレーションフェーズにおける頭部３次元姿勢・表情計測手段５の動作を説明するためのフローチャートである。
【００７５】
アフィン基底設定手段１５は、図１８に示すように、部位検出追跡手段４で求めた両目の外側の端点３６ａ，３７ａを結ぶ直線３８を求める（ステップＳ７００）。次に、左目あるいは右目どちらかの端点を基準に直線３８が水平になるように画像を回転させる（ステップＳ７１０）。そして、口の中心位置を通り、求めた直線に平行でかつ同じ長さの直線３９を求める（ステップＳ７２０）。この２つの直線３８，３９の両端点、すなわち４点３６ａ，３７ａ，３６ｂ，３７ｂでできる矩形の中心座標４０を求める（ステップＳ７３０）。さらに、矩形３９の中心４０を基準に、矩形の４頂点の相対座標を求め、これらを３次元空間上の仮想点として記憶する（ステップＳ７４０）。
【００７６】
この３次元空間上の仮想点は、トラッキングフェーズにおける頭部３次元姿勢計測のための基準点となる。
【００７７】
次に、頭部回転量推定手段１６は、図１９に示すように、両目の端点３６ａ，３７ａを結ぶ直線をＸ軸、口の中心を通りＸ軸に垂直な直線をＹ軸として座標系を規定し、頭部領域に外接する外接矩形のＸ軸方向の長さを１としたときに、左目あるいは右目の内側の端点と外接矩形の左右の辺との距離Ｌａ，Ｌｂを求める（ステップＳ７５０）。同様に、外接矩形のＹ軸方向の長さを１としたときに、口の中心位置から外接矩形の上下の辺までの距離Ｌｃ、Ｌｄを求める（ステップＳ７６０）。
【００７８】
この相対位置がトラッキングフェーズにおける頭部の上下左右方向の回転量を予測するための基準となる。
以上がキャリブレーションフェーズにおけるリアルタイム表情追跡装置の動作である。
【００７９】
［トラッキングフェーズ］
次に、図１のリアルタイム表情追跡装置のトラッキングフェーズにおける動作を図２０〜図２９を用いて詳細に説明する。
（ａ）′頭部領域検出手段２での処理
頭部領域検出手段２では、肌色領域抽出手段８と頭部領域抽出手段９を動作させることで、映像入力手段１を介して所定のフレームレートで順次入力される現フレームの映像に対し、キャリブレーションフェーズ同様の処理を行い、肌色領域を抽出し、頭部領域を抽出する（図３ステップＳ２００〜Ｓ２２０）。ただし、このトラッキングフェーズでは、肌色サンプリング手段６による肌色サンプリングおよび肌色抽出パラメータ調整手段７による肌色パラメータの調整は行わない。
【００８０】
（ｂ）′部位領域候補抽出手段３での処理
部位領域候補抽出手段３では、キャリブレーションフェーズと同様の処理を実行することにより、現フレームの映像から部位（目、口、鼻）領域候補を抽出する（図３ステップＳ２３０）。すなわち、頭部領域検出手段２によって抽出された頭部領域に対して適応型ヒストグラム平均化法を用いて頭部領域のコントラストを一定に保つ処理を行い、さらに、一定の閾値ｔｈａを用い、頭部領域内の輝度値が閾値ｔｈａ以下の画素（暗い画素）を１、それ以外を０とし、さらに、画素値が１の画素を４連結あるいは８連結で結合して領域分割し、最後に、微小領域を除去することにより、各部位（両目と口と鼻）の候補領域を抽出する。
【００８１】
（ｃ）′部位検出追跡手段４での処理
図２０〜図２３を用いて部位検出追跡手段４のトラッキングフェーズにおける動作を詳細に説明する。図２０および図２１は、トラッキングフェーズにおける部位検出追跡手段４の動作を説明するためのフローチャートである。
【００８２】
部位追跡手段１４では、記憶されている前フレームについての部位領域の中心座標を中心に、一定の大きさの矩形領域を設定する。その矩形領域に存在する現フレームの候補領域を求める（ステップＳ８２０）。つぎに、各候補領域に対して次に示すような判別式（１１）を用いて評価値Ｅを得る。
【数２】

【００８３】
ここで、Ｅは評価値、ＳＰは前フレームにおける部位領域の画素数、ＳＣは現フレームにおける候補領域の画素数、ＯＰは現フレームにおける候補領域のマスク画像（候補領域の画素のみが１で、それ以外は０の画像）と前フレームにおける部位領域のマスク画像（部位領域の画素のみが１で、それ以外は０の画像）との排他的論理和を求めたときに、画素値が１となる画素数、Ｄは前フレームにおける部位領域の中心と候補領域の中心との距離である。
【００８４】
上記式（１１）で求めた値Ｅが最も小さいものを対象領域として選択することにより、前フレームの部位領域の位置を基準とした一定範囲内に存在する現フレームの候補領域の中から対象領域を特定する（ステップＳ８３０）。すなわち、図２２に示すような小さいノイズ領域４７が前フレームの部位領域に完全に包含されたとしても、その場合は式（１１）の｜ＳＰ−ＳＣ｜とＯＰの値が大きくなるため、このようなノイズ領域を除去できる。
【００８５】
このような処理を、左目、右目、口の領域について夫々実行する（ステップＳ８１０〜Ｓ８４０）。
【００８６】
以上の処理により全ての部位を検出できた場合は、部位領域マスク画像を、現在のフレームのもので更新し、かつ各部位（左目、右目、口）についての検出領域の中心位置を求め、これを記憶する（ステップＳ８５０およびＳ８６０）。
【００８７】
見つからない部位が存在した場合は（ステップＳ８７０）、現フレームで検出された部位の移動ベクトルから、検出できなかった部位の現フレームでの位置を予測する。例えば、図２３に示すように、現フレームで検出できなかった部位（対象部位）５４が存在した場合、現フレームで検出された他の部位４８の位置とその部位の前フレームでの位置４９からフレーム間の移動ベクトル５０を求める。そして、対象部位５４の前フレームにおける位置５１に、他の部位の検出位置から求めた移動ベクトル５０を加算して、現フレームでの推定位置を求める（ステップＳ８９０）。そして、求めた位置を含む所定の矩形領域（例えば１６×１６）５３中の画素に着目し、この矩形領域中の画素に対し前述したステップＳ８２０およびステップＳ８３０の処理を実行することで、対象部位５４を検出する（ステップＳ９００）。
【００８８】
矩形領域５３内に全く候補領域が存在しない場合は、顔の傾きなどによる隠れが生じているものとし、ステップＳ８９０で推定した位置を現フレームでの対象部位の位置とし、矩形領域５３自体をその部位領域として記憶する（ステップＳ９１０，Ｓ９２０）。
【００８９】
なお、ステップＳ８７０で、現フレームの部位領域を全く検出できなかった場合は、部位検出手段１２によって図１５のステップＳ６００〜Ｓ６４０の処理を再度行い、部位領域を再検出する（ステップＳ８８０）。
【００９０】
このように、部位を１つ検出できれば、他の部位を検出漏れしても、検出した部位の移動ベクトルから検出漏れした部位の現フレームでの位置を予測しているので、頑強な部位追跡が行える。さらに、隠れなどにより映像中に対象となる部位が現れない場合でも暫定的な部位領域を設定することから、隠れた部位が出現したときにその部位を即座に追跡可能となり、つまりは、頭部の各部位の滑らかな動きを再現可能となる。
【００９１】
（ｄ）′頭部３次元姿勢・表情計測手段５での処理
次に、図２４〜図２９を用いて頭部３次元姿勢・表情計測手段５のトラッキングフェーズにおける動作を詳細に説明する。図２４および図２７は、トラッキングフェーズにおける頭部３次元姿勢・表情計測手段５の動作を説明するためのフローチャートである。
【００９２】
まず、頭部回転量推定手段１６においては、図２５に示すように、部位検出追跡手段４で求められた現フレームの両目領域から、両目の外側の端点７０，７１を求め、これら端点７０，７１を結ぶ直線５５を求める（ステップＳ１０００）。また、直線５５に直交し、口の中心位置５９を通る直線５６を求める（ステップＳ１０１０）。求めた直線５５をＸ軸とし、直線５６をＹ軸とするローカル座標系を設定し、Ｘ軸５５およびＹ軸５６のそれぞれに平行な辺を持ち、抽出された頭部領域に外接する外接矩形５７を求める（ステップＳ１０２０）。外接矩形５７のＸ軸方向の辺の長さを１とし、キャリブレーションフェーズで計測した方の目の内側の端点５８とＹ軸に並行な２辺７２，７３までの相対距離Ｌａ′，Ｌｂ′を夫々求める（ステップＳ１０３０）。同様に、外接矩形のＹ軸方向の長さを１とし、口の中心５９とＸ軸に平行な２辺７４，７５までの相対距離Ｌｃ′，Ｌｄ′を夫々求める（ステップＳ１０４０）。
【００９３】
次に、両目の外側の端点７０，７１と、端点７０，７１を通りＹ軸に平行な直線と口の中心を通りＸ軸に平行な直線との交点（２点）７６，７７とでできる矩形６０を求める（ステップＳ１０５０）。
【００９４】
ここで、Ｘ軸については右方向を正方向、Ｙ軸については上方向を正方向としたとき、片目のＸ軸正方向の相対距離ｄｅｃ（＝Ｌｂ′）と、キャリブレーションフェーズで記憶したＸ軸正方向の相対距離ｄｅｉ（＝Ｌｂ）とから次式（１２）で頭部の左右方向の回転量を求める。
Ｒｆ_E＝ｄｅｃ／ｄｅｉ ......式（１２）
ここで、Ｒｆ_Eは左右方向の回転量、ｄｅｃは現フレームでの目のＸ軸正方向の相対距離、ｄｅｉはキャリブレーションフェーズで記憶した目のＸ軸正方向の相対距離である。
【００９５】
もし、回転量Ｒｆ_Eが１よりも大きい場合、頭部は左方向に回転していることになる。逆に回転量Ｒｆ_Eが１よりも小さい場合、頭部は右方向に回転していることになる。
【００９６】
同様に、口のＹ軸正方向の相対距離ｄｍｃ（＝Ｌｄ′）と、キャリブレーションフェーズで記憶したＹ軸正方向の相対距離ｄｍｉ（＝Ｌｄ）とから次式（１３）で頭部の上下方向の回転量を求める。
Ｒｆ_m＝ｄｍｃ／ｄｍｉ ......式（１３）
ここで、Ｒｆ_mは上下方向の回転量、ｄｍｃは現フレームでの口のＹ軸正方向の相対距離、ｄｍｉはキャリブレーションフェーズで記憶した口のＹ軸正方向の相対距離である。
【００９７】
もし、回転量Ｒｆ_mが１よりも大きい場合、頭部は下方向に回転していることになる。逆に１よりも小さい場合、頭部は上方向に回転していることになる。
【００９８】
つぎに、式（１２）および式（１３）で求めた左右上下の回転量Ｒｆ_E，Ｒｆ_mに基づき次のようにして矩形６０を歪ませる（ステップＳ１０６０）。
【００９９】
Ｒｆ_E＞１の場合：
矩形の左側の辺（Ｙ軸に平行な辺でＸ軸の負方向にある辺）の長さを次式（１４）を用いて短くする。
ｌ＝ｗ・Ｒｆ_E・ｏｌ ......式（１４）
ｌは計算した長さ、ｏｌは元の長さ、ｗは重み係数である。
【０１００】
Ｒｆ_E＜１の場合：
矩形の右側の辺（Ｙ軸に平行な辺でＸ軸の正方向にある辺）の長さを式（１４）を用いて短くする。
【０１０１】
Ｒｆ_m＞１の場合：
矩形の下側の辺（Ｘ軸に平行な辺でＹ軸の負方向にある辺）の長さを次式（１５）を用いて短くする。
ｌ＝ｗ・Ｒｆ_m・ｏｌ ......式（１５）
ｌは計算した長さ、ｏｌは元の長さ、ｗは重み係数である。
【０１０２】
Ｒｆ_m＜１の場合：
矩形の上側の辺（Ｘ軸に平行な辺でＹ軸の正方向にある辺）の長さを式（１５）を用いて短くする。
【０１０３】
例えば、図２６（ａ）に示すように、左方向に頭部を回転した場合、矩形６０は左側の辺が短くなり、図２６（ｂ）に示すように、上方向に回転した場合、矩形６０は上側の辺が短くなる。そして、このようにして変形した矩形の各頂点座標を変形前の矩形６０の中心座標を基準にして求める。
【０１０４】
つぎに、姿勢計測手段１７は、以上のようにして求めた４つの頂点座標（２次元座標）と、それらに対応するアフィン基底設定手段１５で設定された３次元空間上の仮想点を基に、頭部の３次元姿勢計測を行う。ここでは、つぎのような手法を用いて、３次元姿勢計測を行う。
【０１０５】
カメラで撮影された画像と３次元空間上のオブジェクトとの関係は図２８のようになっている。図２８において６３は、アフィン基底設定手段１５で設定した３次元空間上の平面、６４はカメラ画像平面、６５はカメラ座標系である。
【０１０６】
３次元空間上の平面６３の座標系における点（Ｘ_f，Ｙ_f，Ｚ_f）と、それに対応するカメラ座標系６５における点（Ｘ_c，Ｙ_c，Ｚ_c）は次式（１６）の関係がある。
【数３】

式（１６）におけるＲが回転成分を、Ｔが並進成分を表しており、これが、頭部の３次元姿勢情報に等しい。
【０１０７】
一方、カメラ座標系６５における３次元空間上の点（Ｘ_c，Ｙ_c，Ｚ_c）とカメラ画像平面６４における２次元の点（ｄＸ_c，ｄＹ_c）とは、次式（１７）に示す関係がある。
【数４】

ここで、Ｐを含む行列は使用するビデオカメラ８０の透視投影行列であり、周知のカメラキャリブレーション技術を用いて予め求めることができる。
【０１０８】
さて、頭部回転量推定手段１６で得られた矩形（カメラ画像平面６４）は、３次元空間上では上下と左右の辺は平行している。この二組の平行した辺から矩形の３次元空間上の上下方向と左右方向の方向ベクトル（Ｘ軸、Ｙ軸）を求めることができる。
【０１０９】
平行する辺のカメラ画像平面６４上における直線の方程式を
ａ₁ｘ＋ｂ₁ｙ＋ｃ₁＝０ ......式（１８）
ａ₂ｘ＋ｂ₂ｙ＋ｃ₂＝０ ......式（１９）
とすると、カメラ座標系６５におけるこれらの各直線を含む３次元の平面の方程式は次式（２０）（２１）であらわすことができる。
【０１１０】

これら２つの平面の法線ベクトル（Ｘ、Ｙ、Ｚの係数）の外積を求めると上記方向ベクトル（Ｘ軸、Ｙ軸）を求めることができる。
【０１１１】
以上で、カメラ座標系６５における矩形のＸ軸、Ｙ軸に相当する方向ベクトルを求めることができるが、画像から得られる情報の誤差により、得られた方向ベクトルが図２９に示すように直交していない場合がある。そこで、求めた方向ベクトルをＳ１、Ｓ２としたとき、そのベクトルＳ１、Ｓ２を基に直交ベクトルＶ１、Ｖ２を求める。Ｚ軸方向のベクトルは、求めたＶ１とＶ２の外積から求められる。この３つの方向ベクトルが式（１６）における回転成分Ｒとなる。
【０１１２】
回転成分Ｒが分かれば、２次元座標と３次元座標の対応点を式（１６）と式（１７）に代入することにより並進成分Ｔを求めることができる。
【０１１３】
姿勢計測手段１７では、まず頭部回転量推定手段１６で求めた矩形の４頂点の座標から式（１８）に示す各辺の直線パラメータ（方程式）を求め（ステップＳ１１００）、求めた直線パラメータを用いて式（２０）および式（２１）に基づき、アフィン基底設定手段１５で設定した仮想３次元平面のＸ軸、Ｙ軸を求める（ステップＳ１１１０）。そして、前述したように、求めた軸が直交するように修正し、更にこの修正したＸ軸、Ｙ軸からＺ軸を求め、これら３軸（Ｘ軸、Ｙ軸、Ｚ軸）の方向ベクトルから回転行列（回転成分）Ｒを求め（ステップＳ１１２０）、さらにこの回転成分Ｒを用いて得られた２次元座標と３次元座標の対応点を式（１６）（１７）に代入することで、並進行列（並進成分）Ｔを求める（ステップＳ１１３０）。
【０１１４】
以上のようにして求めた投影行列を用いて、実際に３次元空間上の仮想点をカメラ画像平面に投影したときの誤差に応じて投影行列を修正し（ステップＳ１１４０）、誤差が閾値以下になったときの投影行列を頭部の３次元姿勢情報とし（ステップＳ１１５０）、この３次元姿勢情報をキャラクタ制御装置９０に出力することで、ＣＧキャラクタの頭部の３次元姿勢を制御する。
【０１１５】
このように、顔画像から検出する両目および口の３点から３次元空間上の矩形（仮想平面）を規定し、追跡時に両目および口の３点から作成した矩形を頭部の動きに応じて歪ませることにより、３次元平面を２次元に投影したときの歪みを擬似的に再現し、本来４点以上の３次元と２次元の対応点がなければ求めることができない３次元姿勢情報を画像から得られる両目および口の３点のみで推定するようにしている。
【０１１６】
次に開閉状態計測手段１８の動作を説明する。開閉状態計測手段１８では、姿勢計測手段１７で求めた投影行列、すなわち頭部の３次元姿勢情報を用いて、ユーザが正面を向いたときのカメラ画像における両目および口領域を再現し、再現した領域の画像垂直方向（Ｙ方向）の長さと、初期位置設定手段１３に記憶されている初期状態における各部位領域の画像垂直方向の長さとの比率を求める。この比率が、両目および口がどの程度開閉しているかを示す開閉状態情報となる。
【０１１７】
このように３次元姿勢情報を用いてユーザが正面を向いたときのカメラ画像における両目および口領域を推定しているので、例えば頭部が横や上を向いている画像においても正面を向いた場合の画像を推定でき、２次元画像のみから両目および口の開閉状態をより正確に求めることができる。
【０１１８】
このようにして、求められた頭部の３次元姿勢情報および両目および口の開閉状態情報は、キャラクタ制御装置９０に入力される。キャラクタ制御装置９０は、入力された頭部の３次元姿勢情報および両目および口の開閉状態情報を用いてＣＧキャラクタの頭部の動きおよび両目および口の開閉状態を可変制御することで、ビデオカメラ８０で撮像した利用者の動き、表情に追従させてＣＧキャラクタの動き、表情をリアルタイムに変化させる。
【０１１９】
【発明の効果】
以上説明したように、この発明によれば、最初に検出した両目および口の位置から３次元空間上の仮想平面を設定し、検出した両目および口位置から頭部の左右および上下方向の回転量を推定し、前記検出した両目および口位置から得た４点の座標を結ぶ矩形を前記推定した頭部の左右および上下方向の回転量を用いて歪ませることで３次元平面を２次元に投影したときの歪みを擬似的に再現し、該歪ませた矩形の４点の座標を用いて頭部の３次元姿勢を推測しているので、本来４点以上の３次元と２次元の対応点がなければ求めることができない３次元姿勢情報を画像から得られる３点から推定することができ、これにより高速処理が可能となり、計算能力が低いハードウェアを用いてもリアルタイムかつ頑強な処理が可能である。
【０１２０】
つぎの発明によれば、両目領域を結ぶ直線をＸ軸とし、Ｘ軸に垂直で口領域の中心位置を通る直線をＹ軸とした頭部のローカル座標系を設定し、このローカル座標系において求めた頭部領域の外接矩形の左右の辺と片目との相対距離と、外接矩形の上下の辺と口領域との相対距離から頭部の左右、上下方向の回転量をそれぞれ推定するようにしているので、比較的簡単な処理によって頭部の回転量を判定することができる。
【０１２１】
つぎの発明によれば、推定した頭部の３次元姿勢情報を用いて対象人物が正面を向いたときの両目および口領域を再現することに基づき両目および口の開閉状態を計測するようにしているので、例えば頭部が横や上を向いている画像においても頭部が正面を向いた場合の画像を推定でき、２次元画像のみから両目および口の開閉状態をより正確に求めることができる。
【０１２２】
つぎの発明によれば、部位検出手段によって両目および口の領域が特定できない場合には、現フレームで特定された部位領域の位置と、この特定された部位領域の前フレームでの位置とを用いて移動ベクトルを求め、この移動ベクトルを用いて特定できなかった部位の位置を特定しているので、部位を１つ検出できれば、例え他の部位を検出漏れしても、この検出漏れした部位を検出することができ、これにより頑強な部位追跡が行える。さらに、隠れなどにより映像中に対象となる部位が現れない場合でも暫定的な部位領域を設定するようにすれば、隠れた部位が出現したときにその部位を即座に追跡可能となり、頭部の滑らかな動きを再現可能となる。
【０１２３】
つぎの発明によれば、前フレームについての部位領域の中心座標を中心に一定の大きさの矩形領域を設定し、その矩形領域中に存在する現フレーム候補領域を求め、求めた候補領域夫々について、判別式Ｅ＝｜ＳＰ−ＳＣ｜＋ＯＰ＋Ｄを用いて評価値Ｅを夫々取得し、評価値Ｅが最も小さな候補領域を各部位領域として特定するようにしているので、簡単な演算処理で各部位を特定することができ、高速処理が可能となり、計算能力が低いハードウェアを用いてもリアルタイムかつ頑強な処理が可能である。
【０１２４】
つぎの発明によれば、抽出された頭部領域の輝度を平均化・正規化し、該輝度平均化・正規化された画像を用いて頭部領域中の両目および口の候補領域を抽出するようにしているので、グラデーションや影、ハイライトなどの影響を抑えて輝度変化に影響されることなく候補領域の抽出が実現でき、また両目や口の部位領域の抽出を固定の閾値処理で行えることから、高速処理が可能となり、計算能力が低いハードウェアを用いてもリアルタイムかつ頑強な処理が可能である。
【０１２５】
つぎの発明によれば、頭部領域を複数の小領域に分割し、各小領域毎にヒストグラム平均化処理を行うようにしているので、グラデーションや影、ハイライトなどを影響を抑えて輝度変化に影響されることなく候補領域の抽出が実現でき、また両目や口の部位領域の抽出を固定の閾値処理で行えることから、高速処理が可能となり、計算能力が低いハードウェアを用いてもリアルタイムかつ頑強な処理が可能である。
【０１２６】
つぎの発明によれば、ヒストグラム平均化処理では、所定の閾値を越えた頻度をもつ画素値の頻度を他の画素値に分散させるようにしているので、ノイズの発生を抑えることができる。
【図面の簡単な説明】
【図１】この発明にかかるリアルタイム表情追跡装置の実施の形態を示すブロック図である。
【図２】図１のリアルタイム表情追跡装置のキャリブレーションフェーズの動作の概要を説明するためのフローチャートである。
【図３】図１のリアルタイム表情追跡装置のトラッキングフェーズの動作の概要を説明するためのフローチャートである。
【図４】肌色サンプリングを説明するための図である。
【図５】肌色サンプリング手段および肌色抽出パラメータ調整手段の動作を説明するためのフローチャートである。
【図６】肌色領域抽出手段と頭部領域抽出手段９の動作を説明するためのフローチャートである。
【図７】肌色領域抽出手段で肌色領域を抽出した結果の一例を示した図である。
【図８】膨張マスクおよび収縮マスクを例示する図である。
【図９】検出した頭部領域に発生した裂け目を埋める処理を説明するための図である。
【図１０】頭部領域内の全ての穴を埋める論理演算処理を説明するための図である。
【図１１】部位領域候補抽出手段の動作を説明するためのフローチャートである。
【図１２】適応型ヒストグラム平均化法の欠点であるノイズ発生を抑える処理を説明するための図である。
【図１３】適応型ヒストグラム平均化法を説明するための図である。
【図１４】適応型ヒストグラム平均化法を説明するための図である。
【図１５】キャリブレーションフェーズにおける部位検出追跡手段の動作を説明するためのフローチャートである。
【図１６】部位検出手段において両目および口領域を特定する際に用いるマスク領域を示した図である。
【図１７】キャリブレーションフェーズにおける頭部３次元姿勢・表情計測手段５動作を説明するためのフローチャートである。
【図１８】アフィン基底設定手段で設定する３次元空間上の仮想点を示した図である。
【図１９】頭部移動量推定手段で求める両目の端点および口の中心点の頭部領域の外接矩形に対する相対位置を説明するための図である。
【図２０】トラッキングフェーズにおける部位検出追跡手段の動作を説明するためのフローチャートである（その１）。
【図２１】トラッキングフェーズにおける部位検出追跡手段の動作を説明するためのフローチャートである（その２）。
【図２２】部位追跡手段での現フレームにおける部位領域の追跡方法を説明するための図である。
【図２３】検出できなかった部位領域を検出できた部位領域の位置から予測する処理を説明するための図である。
【図２４】トラッキングフェーズにおける頭部３次元姿勢・表情計測手段の動作を説明するためのフローチャートである。
【図２５】頭部回転量推定手段での左右上下方向の頭部回転量を推定する処理を説明するための図である。
【図２６】頭部回転量推定手段において３次元空間上の仮想点（アフィン基底）に対応する対応点を求める処理を説明するための図である。
【図２７】トラッキングフェーズにおける頭部３次元姿勢・表情計測手段の動作を説明するためのフローチャートである。
【図２８】姿勢計測手段での３次元と２次元の対応点から頭部の３次元姿勢情報を求める処理を説明するための図である。
【図２９】姿勢情報を求める際の誤差を補正する処理を説明するための図である。
【図３０】従来技術を示す図である。
【符号の説明】
１映像入力手段、２頭部領域検出手段、３部位領域候補抽出手段、４部位検出追跡手段、５３次元姿勢・表情計測手段、６肌色サンプリング手段、７肌色抽出パラメータ調整手段、８肌色領域抽出手段、９頭部領域抽出手段、１０頭部領域輝度平均化手段、１１画素選別手段、１２部位検出手段、１３初期位置設定手段、１４部位追跡手段、１５アフィン基底設定手段、１６頭部回転量推定手段、１７姿勢計測手段、１８開閉状態計測手段、２０サンプリングウィンドウ、２２膨張マスク、２３収縮マスク、３３左目マスク、３４右目マスク、３５口マスク、５０移動ベクトル、５３矩形領域、５７外接矩形、６４カメラ画像平面、８０ビデオカメラ、９０キャラクタ制御装置。[0001]
BACKGROUND OF THE INVENTION
The present invention is applied to a communication system such as a videophone that communicates human images with each other by transmitting a video of a CG character to the other party instead of transmitting the person's face, and in particular from the image of the face captured by the camera. The present invention relates to a real-time facial expression tracking device using a proxy response that measures the three-dimensional posture information and facial expression of the CG character and controls the movement of the CG character based on the measurement result.
[0002]
[Prior art]
For example, FIG. 30 shows a conventional virtual transformation device (first prior art) disclosed in Patent Document 1, which rotates a video camera that inputs a face image and a video camera. A face image recognition device that detects the rotation of the face axis from the face image input from the video camera, or the rotation around the face axis and the direction of the line of sight, and detects the shape change of both eyes and mouth, And a virtual environment synthesis device that controls a character in a virtual space constructed by CG (computer graphics) based on the measurement result.
[0003]
In the first prior art, the face image input from the video camera is binarized by setting the skin color to 1 and setting the skin image other than the skin color to 0 according to the skin color model constructed in a preset RGB space. Next, the center of gravity of the binarized face area is obtained, the electric pan head apparatus is controlled so that the center of gravity becomes the center of the image, and the angle of the camera is corrected. Next, holes existing in the face area are detected as both eyes and mouth based on the position of the center of gravity. Next, the eye region is tracked by template matching using a preset template, and the line-of-sight direction is obtained from the position of the black eye. Further, the angle between the straight line connecting both eyes and the horizontal axis of the image is measured, and further, rotation around the face axis is detected from the distance between both eyes. Then, the shape change of both eyes and mouth is measured by capturing the power change in each frequency band when the surrounding images of both eyes and mouth are subjected to discrete cosine transform. Based on the above measurement results, the head and facial expression of the character in the virtual space constructed by CG are controlled.
[0004]
In addition, the facial expression detection apparatus (second prior art) disclosed in Patent Document 2 tracks a plurality of selected feature points in each successive frame image, and a Delaunay having the plurality of feature points as vertices for each frame. A change is made in the facial muscle model by constructing a network and displacing the facial muscle model based on the movement of the feature points using this Delaunay network.
[0005]
Further, in Patent Document 3 (third prior art), a window mask having a fixed size is scanned over the entire image, and the luminance dispersion in the mask is normalized, so that it is stable even if the illumination conditions change. Thus, an invention relating to an object detection apparatus that can extract a feature amount of an object is disclosed.
[0006]
[Patent Document 1]
JP 2000-331190 A
[Patent Document 2]
JP 2000-259831 A
[Patent Document 3]
JP-A-11-306348
[0007]
[Problems to be solved by the invention]
In the first conventional technique, a face image photographed by a camera is binarized based on a skin color model, holes in the face area are found, and these are made to correspond to eyes and mouth from the center of gravity position of the face area. However, since there is inherently the influence of shadows and highlights resulting from the unevenness of the face, it is very difficult to detect only the eyes and mouth as holes in the first conventional technique unless the illumination conditions are set carefully. is there. In addition, this first prior art cannot simultaneously measure the rotation of the head around the three axes (X axis, Y axis, Z axis), and further, the rotation around the face axis can be measured between both eyes. Since the distance is determined by the distance, for example, when the face moves away from or approaches the camera, the distance between both eyes inevitably changes, so it is considered that the face is rotating even though it is not actually rotated. There was a problem.
[0008]
Further, in the second prior art, since it is necessary to track a large number of feature points in a face image in order to measure a three-dimensional posture, there is a problem that real-time processing is difficult with hardware having low calculation capability. .
[0009]
In the third prior art, since a mask area having a fixed size is used, it is difficult to deal with changes in the size of the face area depending on individual differences and shooting distances.
[0010]
The present invention has been made in view of the above, and even in hardware with low calculation ability, it measures the three-dimensional movement of the head in real time, and measures the open / closed state of both eyes and mouth, and uses the result. An object of the present invention is to obtain a real-time facial expression tracking device that controls the movement and facial expression of a CG character in real time.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, a real-time facial expression tracking device according to the present invention includes a video input unit that sequentially captures video input at a predetermined frame rate, and a head region detection that extracts a head image from the captured image. Means, part region candidate extraction means for extracting candidate regions of each part including both eyes and mouth from the extracted head region, and part detection tracking means for detecting the position of each part from the extracted candidate areas; A head 3D posture / facial expression measuring means for measuring the three-dimensional posture of the head based on the detected positions of both eyes and mouth, and for measuring the open / closed state of both eyes and mouth, the measured head A real-time facial expression tracking device that controls the movement of a CG character based on the three-dimensional posture of the mouse and the open / closed state of both eyes and mouth, wherein the head three-dimensional posture and facial expression measurement The stage estimates affine base setting means for setting a virtual plane in a three-dimensional space from the positions of the eyes and mouth detected first, and the amount of rotation of the head in the horizontal and vertical directions from the detected eyes and mouth positions. A rectangle connecting the head rotation amount estimation means and the coordinates of the four points obtained from the detected eyes and mouth positions is distorted by using the estimated left and right and vertical rotation amounts of the head, and the distorted rectangle A posture measuring means for estimating the three-dimensional posture of the head using the coordinates of the four points, and an open / closed state measuring means for estimating the open / closed states of both eyes and the mouth according to the movement of the head. .
[0012]
According to the present invention, a virtual plane in a three-dimensional space is set from the positions of the eyes and mouth detected first, the amount of rotation of the head in the left and right and up and down directions is estimated from the detected positions of both eyes and mouth, and the detection A rectangle connecting the coordinates of the four points obtained from both eyes and mouth positions is distorted using the estimated horizontal and vertical rotations of the head, and the head using the coordinates of the four points of the distorted rectangle. The three-dimensional posture is estimated, and the open / closed state of both eyes and mouth is estimated according to the movement of the head.
[0013]
In the real-time facial expression tracking device according to the next invention, in the above invention, the head rotation amount estimation means uses a straight line connecting both eye regions as the X axis, and a straight line perpendicular to the X axis and passing through the center position of the mouth region as Y Set the local coordinate system of the head as the axis, the relative distance between the left and right sides of the circumscribed rectangle of the head area and one eye calculated in this local coordinate system, and the relative distance between the upper and lower sides of the circumscribed rectangle and the mouth area The amount of rotation of the head in the left and right and up and down directions is estimated from the distance.
[0014]
According to the present invention, a local coordinate system of the head is set with the straight line connecting both eye regions as the X-axis and the straight line perpendicular to the X-axis and passing through the center position of the mouth region as the Y-axis. The amount of rotation in the left and right and up and down directions of the head is estimated from the relative distance between the left and right sides of the circumscribed rectangle of the head region and one eye, and the relative distance between the upper and lower sides of the circumscribed rectangle and the mouth region. Yes.
[0015]
In the real-time facial expression tracking device according to the next invention, in the above invention, the open / closed state measuring means uses both eyes when the target person faces the front using the three-dimensional posture information of the head estimated by the posture measuring means. And measuring the open / closed state of both eyes and mouth based on reproducing the mouth region.
[0016]
According to the present invention, the open / closed state of both eyes and mouth is measured based on reproducing both eyes and mouth area when the target person faces the front using the estimated three-dimensional posture information of the head. .
[0017]
The real-time facial expression tracking device according to the next invention is a video input unit that captures video sequentially input at a predetermined frame rate, a head region detection unit that extracts a head image from the captured image, and the extracted Part region candidate extraction means for extracting candidate regions for each part including both eyes and mouth from the head region, part detection tracking means for detecting the position of each part from the extracted candidate areas, and the detected both eyes and mouth 3D posture of the head based on the detected position of the head, and a head 3D posture / facial expression measuring means for measuring the open / closed state of both eyes and mouth, the measured 3D posture of the head and both eyes And a real-time facial expression tracking device that controls the movement of the CG character based on the open / closed state of the mouth, wherein the part detection / tracking means includes the part region candidate extraction means A part detection unit for specifying both eyes and a mouth region from both eye and mouth candidate regions detected in the above, and a part specified in the current frame when the region detection unit cannot identify both eyes and a mouth region It is provided with a part tracking means for obtaining a movement vector using the position of the area and the position of the specified part area in the previous frame and specifying the position of the part that could not be specified using the movement vector. It is characterized by.
[0018]
According to the present invention, when the region of both eyes and mouth cannot be identified by the region detecting means, the position of the region of the region specified in the current frame and the position of the region of the specified region in the previous frame are used. A movement vector is obtained, and the position of the part that cannot be specified is specified using this movement vector.
[0019]
In the real-time facial expression tracking device according to the next invention, in the above invention, the part detecting means sets a rectangular area having a constant size around the center coordinates of the part area with respect to the previous frame, and in the rectangular area An existing current frame candidate area is obtained, and an evaluation value E is obtained for each of the obtained candidate areas using a discriminant E = | SP−SC | + OP + D,
SP: the number of pixels in the region of the previous frame,
SC: the number of pixels in the candidate area in the current frame,
OP: Logical value level corresponding to non-skin color only in candidate area in current frame
The logical value corresponding to the non-skin color when the exclusive OR of the partial region mask image and the partial region mask image in which only the partial region in the previous frame is set to the logical value level corresponding to the non-skin color is obtained. The number of pixels at the value level,
D: the distance between the center of the part region and the center of the candidate region in the previous frame,
A candidate region having the smallest evaluation value E is specified as each region.
[0020]
According to the present invention, a rectangular area having a certain size is set around the center coordinates of the part area for the previous frame, the current frame candidate area existing in the rectangular area is obtained, and for each obtained candidate area, The evaluation value E is acquired using the discriminant E = | SP−SC | + OP + D, respectively, and the candidate area having the smallest evaluation value E is specified as each part area.
[0021]
The real-time facial expression tracking device according to the next invention is a video input unit that captures video sequentially input at a predetermined frame rate, a head region detection unit that extracts a head image from the captured image, and the extracted Part region candidate extraction means for extracting candidate regions for each part including both eyes and mouth from the head region, part detection tracking means for detecting the position of each part from the extracted candidate areas, and the detected both eyes and mouth 3D posture of the head based on the detected position of the head, and a head 3D posture / facial expression measuring means for measuring the open / closed state of both eyes and mouth, the measured 3D posture of the head and both eyes And a real-time facial expression tracking device that controls the movement of a CG character based on the open / closed state of the mouth, wherein the region region candidate extracting means is connected to the head region detecting means. Head region luminance averaging means for averaging and normalizing the luminance of the head region extracted by using the image and the luminance averaged / normalized image and the candidate regions for both eyes and mouth in the head region And a pixel selecting means for extracting.
[0022]
According to the present invention, the brightness of the head region extracted by the head region detection means is averaged / normalized, and the eyes and mouth candidates in the head region are used using the luminance averaged / normalized image. The region is extracted.
[0023]
In the real-time facial expression tracking device according to the next invention, in the above invention, the head region luminance averaging means divides the head region into a plurality of small regions and performs histogram averaging processing for each small region. It is characterized by.
[0024]
According to this invention, the head region is divided into a plurality of small regions, and the histogram averaging process is performed for each small region.
[0025]
A real-time facial expression tracking device according to a next invention is characterized in that, in the above-described invention, the histogram averaging process includes a process of distributing the frequency of pixel values having a frequency exceeding a predetermined threshold value to other pixel values. And
[0026]
According to the present invention, in the histogram averaging process, the frequency of pixel values having a frequency exceeding a predetermined threshold is distributed to other pixel values.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Exemplary embodiments of a real-time facial expression tracking device according to the present invention will be explained below in detail with reference to the accompanying drawings. This real-time facial expression tracking apparatus is applied to a communication system such as a videophone that communicates human videos with each other by transmitting a video of a CG character to the other party instead of transmitting the person's face.
[0028]
Hereinafter, embodiments of the present invention will be described with reference to FIGS. FIG. 1 shows a conceptual configuration of the real-time facial expression tracking apparatus according to the present embodiment.
[0029]
The real-time facial expression tracking apparatus shown in FIG. 1 shows a functional configuration of a program executed by, for example, a personal computer or a workstation. The real-time facial expression tracking apparatus shown in FIG. 1 includes a video input unit 1 for capturing a video input from a video capture unit such as a video camera 80, and a head from a person video input via the video input unit 1. A head region detecting unit 2 for detecting a region, a part region candidate extracting unit 3 for extracting candidate regions for both eyes and mouth from the head region extracted by the head region detecting unit 2, and a part region candidate extracting unit 3 Detecting both eyes and mouth regions from the candidate regions extracted in step 2, tracking positions that change every hour, and further measuring the open / closed state of each part, and both eye and mouth positions detected by the part detection tracking means 4 And a head 3D posture / expression measurement means 5 for measuring the 3D posture and expression of the head.
[0030]
Further, the head region detection means 2 is configured to sample a skin color sampling means 6 for sampling a person's skin color under an imaged environment (such as a lighting environment), and to set a skin color extraction parameter based on the skin color information sampled by the skin color sampling means 6. Skin color extraction parameter adjustment means 7 to be adjusted, and skin color area extraction that extracts skin color pixels from the input video based on the skin color extraction parameters adjusted by the skin color extraction parameter adjustment means 7 and classifies the extracted pixels into chunks (areas) All pixels related to the person's head by selecting the head area from the extracted skin color area and filling all small areas such as holes and tears in the head area (substituting with skin color) And a head region extracting means 9 for extracting the region as a region.
[0031]
The part region candidate extraction unit 3 includes a head region luminance averaging unit 10 that averages the luminance values of the head region, and a pixel selection unit 11 that extracts candidate regions for both eyes and mouth.
[0032]
The part detection / tracking means 4 includes a part detection means 12 for identifying regions corresponding to the eye and mouth candidate areas extracted by the part region candidate extraction means 3, and an initial for both eyes and mouth detected by the part detection means 12. Initial position setting means 13 for storing the position and part tracking means 14 for detecting the position in the current frame from the positions of the parts stored up to the previous frame are provided.
[0033]
The head 3D posture / expression measurement means 5 sets an affine base setting means 15 for setting a reference affine base for obtaining the head 3D posture based on the initial position of each part set by the initial position setting means 13. A head rotation amount estimation means 16 for obtaining a provisional rotation amount around the horizontal and vertical axes of the head, and a virtual 3 set by the affine base setting means 15 from the position of each part detected by the part detection means 12 The posture measurement means 17 for measuring the three-dimensional posture of the head from the two-dimensional point in the image corresponding to the point in the three-dimensional space, and the facial expression is tracked by measuring the open / closed state of each part (both eyes and mouth). And an open / close state measuring means 18.
[0034]
The character control device 90 controls the three-dimensional CG character using the three-dimensional posture of the head input from the head three-dimensional posture / expression measuring means 5 and the open / closed state of each part (both eyes and mouth). Then, the movement and expression of the CG character are changed in real time by following the movement and expression of the user imaged by the video camera 80.
[0035]
FIG. 2 is a flowchart for explaining the outline of the calibration phase operation of the real-time facial expression tracking apparatus of FIG. FIG. 3 is a flowchart for explaining the outline of the operation of the tracking phase of the real-time facial expression tracking apparatus of FIG. The outline of the operation of the real-time facial expression tracking apparatus will be described with reference to FIGS.
[0036]
The operation procedure performed by the real-time facial expression tracking device includes a calibration phase for acquiring the positions of both eyes and mouth and the state of no expression as information for tracking the movement of the head, and the actual movement of the head and There is a tracking phase in which both eyes and mouth are tracked and the head posture and the open / closed state of both eyes and mouth, that is, facial expressions are measured.
[0037]
In the calibration phase, first, an image from the video camera 80 is captured by the image input means 1 (step S100). When the video of the person is captured by the video camera 80, the user is instructed to turn the front of the camera, open both eyes, and close the mouth. obtain. Next, the head region detection means 2 samples the user's skin color under the shooting environment (step S110), and adjusts the preset skin color extraction parameter using this sampling data (step S120). Then, the skin color region is actually extracted using the adjusted skin color extraction parameter (step S130), and the head region is detected from the extracted region (step S140). Next, the region candidate extraction unit 3 extracts both eye and mouth candidate regions from the extracted head region (step S150), and the region detection tracking unit 4 detects both eye regions and mouth regions (step S160). The position, size, and initial value of the template are stored from the detected eyes and mouth area (step S170). Finally, the head 3D posture / expression measurement means 5 uses an affine basis (virtual point in the 3D space) for obtaining 3D posture information of the head in the tracking phase based on the obtained positions of both eyes and mouth. ) Is set (step S180).
[0038]
In the tracking phase, the video from the video camera 80 is captured by the video input means 1 (step S200). The head region detection means 2 extracts the skin color from the captured video using the skin color extraction parameter set in the calibration phase (step S210), and detects the head region from the extracted region (step S220). Next, the candidate region extraction unit 3 extracts both eye and mouth candidate regions (step S230). Next, the part detection and tracking means 4 detects both eyes and mouth area in the current frame from the candidate areas extracted by the part area candidate extraction means 3 based on the positions of both eyes and mouth detected in the previous frame (step S240). . Next, in the head 3D posture / expression measurement means 5, the 3D of the head is calculated from both eyes and mouth positions (2D image points) detected by the part detection / tracking means 4 and virtual points in a preset 3D space. Target posture information is measured (step S250), and the open / closed states of both eyes and mouth are measured based on the measurement information (step S260). Finally, the measured opening / closing state information of both eyes and mouth and the posture information of the head are input to the character control device 90, and the character control device 90 controls the movement and facial expression of the head of the CG character (step S270).
[0039]
[Calibration phase]
Next, the operation in the calibration phase of the real-time facial expression tracking device of FIG. 1 will be described in detail with reference to FIGS.
[0040]
(A) Processing in the head region detection means 2
First, details of the processing of steps S110 to S140 of FIG. 2 performed by the head region detection unit 2 will be described with reference to FIGS.
[0041]
FIG. 4 is a diagram for explaining the operation of the skin color sampling means 6 in the head region detecting means 2. FIG. 5 is a flowchart for explaining the operation of the skin color sampling means 6 and the skin color extraction parameter adjusting means 7.
[0042]
First, in order to sample the user's skin color under the lighting environment to be used, as shown in FIG. 4, a sampling window 20 for designating a sampling area is displayed over the captured video image 19 (step S300). Next, the user uses the mouse or other pointing device or keyboard to move the sampling window 20 to a position where only the skin color such as cheeks or forehead can be extracted, and informs the system that sampling is possible (step S310). ). The user may adjust the position by moving his / her head in accordance with the position of the sampling window 20 displayed first.
[0043]
Next, the colors of all the pixels in the sampling window 20 are mapped to a color space (skin color model space) for skin color extraction (step S320), and the maximum and minimum values in the mapping space of the mapping pixels are used in advance. The set skin color extraction parameter is adjusted (step S330).
[0044]
Here, the skin color extraction space uses, for example, a method of newly constructing a color space that is relatively robust to changes in luminance, or constructing it on a pixel color data space (R, G, B space). Here, a color space that is relatively robust to luminance changes as described below is used.
[0045]
If R (red), G (green), and B (blue) are components of the three primary colors of each pixel, first, the colors are normalized by the following equation.
c1 = arctan (R / max (G, B)) (1)
c2 = arctan (G / max (R, B)) (2)
c3 = arctan (B / max (R, G)) Equation (3)
[0046]
The color normalized by the above equation is further converted by the following equation.
C1 = c2 / c1 (Formula (4))
C2 = c3 / c2 Formula (5)
[0047]
In the skin color region extracting means 8, whether or not the color converted from the RGB space to the C1-C2 space by the equations (4) and (5) is in the skin color range defined by the following equations (6) and (7). Is determined to extract a skin color region from the input image.
th1 <C1 <th2 (Formula (6))
th3 <C2 <th4 (Formula (7))
[0048]
The skin color extraction parameter adjustment means 7 adapts the skin color extraction parameters (threshold values) th1 to th4 used in this skin color extraction to different lighting conditions or differences in the skin color of each person using the sampling data of the skin color sampling means 6. It is made variable. That is, the skin color extraction parameter adjusting means 7 maps the RGB data of the pixels sampled by the skin color sampling means 6 to the C1-C2 space, obtains the maximum value and the minimum value at that time for C1 and C2, respectively, and the minimum value for C1. The threshold value th1 is changed, the threshold value th2 is changed with the maximum value for C1, the threshold value th3 is changed with the minimum value for C2, and the threshold value th4 is changed with the maximum value for C2.
[0049]
As described above, the skin color extraction performance can be improved by sampling the user's skin color under the lighting environment to be used, and simple parameter adjustment can be performed by using a color space that is robust to the luminance change of the lighting. It is possible to further improve the skin color extraction performance.
[0050]
Next, the operations of the skin color area extracting means 8 and the head area extracting means 9 will be described with reference to FIGS. FIG. 6 is a flowchart for explaining the operation of the skin color area extracting means 8 and the head area extracting means 9.
[0051]
Even if the skin color extraction parameter adjusted by the skin color extraction parameter adjusting means 7 is used, highlights may be generated on a part of the face depending on the lighting environment, or the head region is accurately extracted only by skin color extraction due to wrinkles or shadows. It is difficult. Therefore, the largest area among the flesh color areas extracted by the flesh color area extracting means 8 is determined as the head area, and small areas other than the parts such as the eyes, nose, mouth, and the like such as holes and tears due to omission are extracted. It is possible to appropriately extract the entire head by performing a head region repair process to be removed from the region.
[0052]
The skin color area extracting means 8 maps the color data of all pixels of the captured image to the skin color model space (step S400), and the pixels are within the thresholds th1 to th4 defined by the equations (6) and (7). Is extracted (step S410), and a labeling process (a process of grouping and numbering consecutive figures into numbers) is performed by integrating the extracted pixels in four or eight connections. The area is divided (step S420). Then, a region having the largest area (number of pixels) is selected from the block regions obtained as a result of the labeling process, and is set as a head region (step S430).
[0053]
FIG. 7 shows an image including the head region selected in this way. At this time, since dark portions such as highlights, shadows, both eyes, mouth, and nose are not extracted, a small region 21 such as a hole or a tear is generated in the head region as shown in FIG. There are many cases.
[0054]
Therefore, the head region extracting means 9 first repairs the tear portion. Restoration of the fissure is achieved by performing expansion and contraction processing on a binary image in which the skin color pixel after extraction of the skin color region is 1 and the others are 0. In the expansion / contraction process, the expansion mask 22 and the contraction mask 23 as shown in FIG. 8 are set, and the following expansion process and the contraction process are repeated to fill the above-described tears and small holes. The expansion process is to expand the region by replacing the pixel value in the vicinity of the target pixel with the pixel value set by the expansion mask 22. In the contraction process, when the pixel value of the non-zero pixel set by the contraction mask 23 is the same as the pixel value of the contraction mask 23 among the neighboring pixels of the target pixel, the target pixel is left. The region is contracted by setting the value to 0. By the expansion and contraction process, the tear 24 as shown in FIG. 9A is repaired, and the result is as shown in FIG. 9B. Moreover, it is possible to fill a minute hole by this processing.
[0055]
By repairing the tear that occurred in the head region due to expansion and contraction processing, it is possible to extract the entire head as one region by filling in the small region corresponding to all the holes in the head region afterwards Become. For this hole filling process, a logical operation process as shown in FIG. 10 is used.
[0056]
First, an exclusive OR of the head region image 26 obtained by the tear repair process and the mask 27 having all pixel values 1 is obtained. As a result, holes in the background region and the head region are obtained. Next, a region (background region) in contact with the outer side of the image is removed from the obtained image 28, and the head of the head is obtained by obtaining a logical sum of the removed image 29 and the original head region image 26. The entire image can be extracted as one area (30 is an image obtained by ORing, step S440). As described above, since the head region can be extracted by a simple logical operation process, high-speed processing is possible.
[0057]
(B) Processing in part region candidate extraction means 3
Next, details of the process in step S150 of FIG. 2 performed by the region region candidate extraction unit 3 will be described with reference to FIGS. FIG. 11 is a flowchart for explaining the operation of the part region candidate extraction unit 3.
[0058]
The part region candidate extraction unit 3 performs adaptive histogram averaging on the head region extracted by the head region detection unit 2 in order to be able to cope with the luminance change according to the change of the illumination condition. The process of keeping the contrast of the head region constant using the method is performed. First, the head region luminance averaging means 10 obtains a circumscribed rectangle of the head region, and divides the circumscribed rectangular region into, for example, 8 × 8 small regions (step S500). Next, the head region luminance averaging means 10 performs histogram averaging processing for each small region (step S510).
[0059]
The histogram averaging process is performed as follows. First, a histogram indicating the relationship between pixel value and frequency is obtained for each small area. Next, cumulative frequency (cumulative frequency up to each class (pixel value)) is obtained, and each cumulative frequency is divided by the maximum value of the cumulative frequency to obtain a ratio of each cumulative frequency. Then, the calculated ratio is multiplied by the maximum value of the pixel values in the small area, and the decimal part is rounded off by rounding off. The value obtained here is the pixel value after averaging. Finally, the frequency of pixel values after averaging is obtained from the frequency before averaging.
[0060]
For example, as shown in FIG. 13, when the pixel values in the small area are in the range of 0 to 7 and the frequency is as shown in FIG. 13, the frequency of each pixel value after averaging is as shown in FIG. It becomes as shown in. For example, when the pixel value after averaging is 4, the pixel values before averaging corresponding to the pixel value 4 are 2 and 3, so the frequency is 9 + 2 = 11.
[0061]
Here, in the adaptive histogram averaging method as described above, a large amount of noise may occur because most pixel values in the region are assigned to the maximum points of the histogram, particularly in a small region with low contrast. Therefore, when there are pixel values 31 having a frequency exceeding a certain threshold as shown in FIG. 12A, these frequencies are distributed to other pixel values as shown in FIG. 12B. Processing is performed, so that it is possible to suppress the generation of noise.
[0062]
Since a constant contrast can always be obtained by the above processing, the pixel selection unit 11 uses a constant threshold value tha and sets a pixel (dark pixel) whose luminance value in the head region is equal to or less than the threshold value tha to a logical level 1. Other than that, the logic level is set to 0 (step S520), and the pixels having the pixel value of 1 are combined by 4-connection or 8-connection to divide the region (step S530). Finally, by removing the minute regions, candidate regions for each part (both eyes, mouth, and nose) can be extracted (step S540).
[0063]
As described above, the entire head is extracted as one area, and the process of making the contrast of the head area always constant is performed, so that the extraction process of the regions of both eyes and mouth is executed using the fixed threshold tha. can do. Therefore, it is possible to construct a system capable of high-speed processing and robust against changes in luminance.
[0064]
(C) Processing by the part detection tracking means 4
Next, the operations of steps S160 and S170 in FIG. 2 performed by the part detection tracking unit 4 in the calibration phase will be described with reference to FIGS. FIG. 15 is a flowchart for explaining the operation of the part detection tracking means 4 in the calibration phase.
[0065]
First, the part detection unit 12 obtains the center of gravity of the head region extracted by the head region detection unit 2 (step S600). The position of the center of gravity is obtained using a known distance conversion process.
[0066]
The distance conversion process is a conversion process that replaces each pixel value of an object in an image with the shortest distance from each pixel position to the background area. As the concept of distance, the simplest urban area distance (4 connection distances) and chessboard distance (8 connection distances) are often used. Here, an algorithm using city distance will be described.
[0067]
Step1. First, each pixel data obtained by binarizing the input image is f_{i, j}And D_{i, j}Is initialized and converted as follows. That is, pixels in the head region with a pixel value of 1 are replaced with multi-value data ∞ (actually a large value such as 100), and background pixels with a pixel value of 0 are replaced with 0.
[Expression 1]

Step2. Scan the initialized image from the upper left to the lower right._{i, j}Update.
D ″_{i, j}= Min (D '_{i, j} _,D ″_{i-1, j}+1, D ″_{i, j-1}+1) ...... Formula (9)
Step3. D ″ obtained in the previous Step2._{i, j}Are scanned from the lower right to the upper left and sequentially D ″ according to the following rule:_{i, j}Update.
D_{i, j}= Min (D '_{i, j} _,D ″_{i + 1, j}+1, D ″_{i, j + 1}+1) ...... Formula (10)
[0068]
D obtained by the above equation (10)_{i, j}Is each pixel data of the distance image. Therefore, a pixel having the maximum distance value is obtained from these obtained distance images, and this pixel is set as the center of gravity of the head region.
[0069]
A feature of distance image conversion is that a stable center of gravity position may be obtained even if the shape of the region changes. Note that the center of gravity may be obtained by averaging pixel coordinate values without using distance image conversion.
[0070]
The part detecting means 12 selects the candidate area closest to the center of gravity of the head area obtained in the previous step S600 from the candidate areas for both eyes, mouth and nose extracted by the part area candidate extracting means 3. Deemed (step S610).
[0071]
Next, as shown in FIG. 16, the part detection means 12 includes a left eye mask 33, a right eye mask 34, a mouth mask having a size proportional to the size of the head region in a certain direction and distance from the identified nose region. 35 is set.
[0072]
Of the set mask areas, the areas closest to the center of gravity are respectively set as the right eye, left eye, and mouth area (step S620).
[0073]
Next, the initial position setting means 13 stores the center position of each region and the positions of the

outer end points

36a and 37a of both eyes (step S630). Finally, among the detection areas related to the right eye, the left eye, and the nose, a part area mask image in which the pixel values in the right eye, the left eye, and the mouth area are set to 1 and the others are set to 0 is generated for each part. The mask image is stored. (Step S640). This part region mask image is used for the part tracking process for the first frame in the tracking phase. The part detection unit 12 measures the length of each part region (left eye, right eye, mouth) in the image vertical direction (Y direction) at the center position, and stores these measured values in the initial position setting unit 13. The length of the stored region regions (left eye, right eye, mouth) in the image vertical direction (Y direction) is used to obtain open / close state information of each region in the subsequent tracking phase.
[0074]
(D) Processing in the head 3D posture / expression measurement means 5
Next, the operation of step S180 in FIG. 2 performed in the calibration phase by the head 3D posture / expression measurement means 5 will be described with reference to FIGS. FIG. 17 is a flowchart for explaining the operation of the head three-dimensional posture / expression measurement means 5 in the calibration phase.
[0075]
As shown in FIG. 18, the affine base setting means 15 obtains a straight line 38 connecting the

outer end points

36a, 37a of both eyes obtained by the part detection tracking means 4 (step S700). Next, the image is rotated so that the straight line 38 is horizontal with reference to the end point of either the left eye or the right eye (step S710). A straight line 39 passing through the center position of the mouth and parallel to the obtained straight line and having the same length is obtained (step S720). The center coordinates 40 of the rectangle formed by the two end points of the two

straight lines

38, 39, that is, the four

points

36a, 37a, 36b, 37b are obtained (step S730). Further, relative coordinates of the four vertices of the rectangle are obtained with reference to the center 40 of the rectangle 39, and these are stored as virtual points in the three-dimensional space (step S740).
[0076]
This virtual point in the three-dimensional space becomes a reference point for head three-dimensional posture measurement in the tracking phase.
[0077]
Next, as shown in FIG. 19, the head rotation amount estimation means 16 sets the coordinate system using the straight line connecting the

end points

36a and 37a of both eyes as the X axis and the straight line passing through the center of the mouth and perpendicular to the X axis as the Y axis. When the length of the circumscribed rectangle circumscribing the head region in the X-axis direction is 1, distances La and Lb between the inner end point of the left eye or the right eye and the left and right sides of the circumscribed rectangle are obtained (step S750). ). Similarly, when the length of the circumscribed rectangle in the Y-axis direction is 1, distances Lc and Ld from the center position of the mouth to the upper and lower sides of the circumscribed rectangle are obtained (step S760).
[0078]
This relative position serves as a reference for predicting the amount of rotation of the head in the tracking phase in the vertical and horizontal directions.
The above is the operation of the real-time facial expression tracking apparatus in the calibration phase.
[0079]
[Tracking phase]
Next, the operation in the tracking phase of the real-time facial expression tracking device of FIG. 1 will be described in detail with reference to FIGS.
(A) ′ Processing in the head region detection means 2
In the head region detection unit 2, the skin color region extraction unit 8 and the head region extraction unit 9 are operated to calibrate the current frame video sequentially input at a predetermined frame rate via the video input unit 1. The same processing as that in the process phase is performed to extract the skin color region and the head region (steps S200 to S220 in FIG. 3). However, in this tracking phase, the skin color sampling by the skin color sampling means 6 and the skin color parameter adjustment by the skin color extraction parameter adjusting means 7 are not performed.
[0080]
(B) Processing by the region region candidate extraction means 3
The part region candidate extraction unit 3 extracts a part (eye, mouth, nose) region candidate from the video of the current frame by executing the same process as in the calibration phase (step S230 in FIG. 3). That is, the head region extracted by the head region detection means 2 is subjected to a process for keeping the contrast of the head region constant using an adaptive histogram averaging method, and further using a constant threshold value tha, A pixel (dark pixel) whose luminance value in the partial area is equal to or less than the threshold value tha is set to 1, and the others are set to 0. Further, the pixels having the pixel value of 1 are connected by 4-connection or 8-connection to divide the region. By removing the minute regions, candidate regions for each part (both eyes, mouth and nose) are extracted.
[0081]
(C) Processing in the part detection / tracking means 4
The operation in the tracking phase of the part detection tracking means 4 will be described in detail with reference to FIGS. 20 and 21 are flowcharts for explaining the operation of the part detection tracking means 4 in the tracking phase.
[0082]
The part tracking means 14 sets a rectangular area of a certain size around the stored center coordinates of the part area for the previous frame. A candidate area of the current frame existing in the rectangular area is obtained (step S820). Next, an evaluation value E is obtained for each candidate region using a discriminant (11) as shown below.
[Expression 2]

[0083]
Here, E is the evaluation value, SP is the number of pixels in the part area in the previous frame, SC is the number of pixels in the candidate area in the current frame, OP is a mask image of the candidate area in the current frame (only the pixels in the candidate area are 1, The pixel value is 1 when the exclusive OR of the image of the other region 0) and the mask image of the region region in the previous frame (the pixel of the region region is only 1 and the image 0 otherwise) is obtained. The number of pixels, D, is the distance between the center of the part region and the center of the candidate region in the previous frame.
[0084]
By selecting a target area having the smallest value E obtained by the above equation (11) as a target area, the target area is selected from the candidate areas of the current frame existing within a certain range based on the position of the part area of the previous frame. Is identified (step S830). That is, even if the small noise region 47 as shown in FIG. 22 is completely included in the region of the previous frame, in this case, the values of | SP−SC | Such a noise region can be removed.
[0085]
Such processing is executed for each of the left eye, right eye, and mouth area (steps S810 to S840).
[0086]
If all parts can be detected by the above processing, the part area mask image is updated with the current frame, and the center position of the detection area for each part (left eye, right eye, mouth) is obtained. Is stored (steps S850 and S860).
[0087]
If there is a missing part (step S870), the position of the part that could not be detected in the current frame is predicted from the movement vector of the part detected in the current frame. For example, as shown in FIG. 23, when there is a part (target part) 54 that could not be detected in the current frame, from the position of the other part 48 detected in the current frame and the position 49 in the previous frame of that part. A movement vector 50 between frames is obtained. Then, the movement vector 50 obtained from the detection position of the other part is added to the position 51 in the previous frame of the target part 54 to obtain the estimated position in the current frame (step S890). Then, paying attention to a pixel in a predetermined rectangular area (for example, 16 × 16) 53 including the obtained position, and executing the above-described processing in step S820 and step S830 on the pixel in this rectangular area, 54 is detected (step S900).
[0088]
If there is no candidate area in the rectangular area 53, it is assumed that the face is hidden by the inclination of the face, the position estimated in step S890 is set as the position of the target part in the current frame, and the rectangular area 53 itself is This is stored as a region (steps S910 and S920).
[0089]
If no part region of the current frame has been detected in step S870, the part detection unit 12 performs the processes in steps S600 to S640 in FIG. 15 again to detect the part region again (step S880).
[0090]
Thus, if one part can be detected, even if other parts are not detected, the position of the detected part in the current frame is predicted from the movement vector of the detected part, so robust part tracking is possible. Yes. Furthermore, even if the target part does not appear in the video due to hiding etc., the provisional part area is set, so when the hidden part appears, it can be tracked immediately, that is, the head The smooth movement of each part can be reproduced.
[0091]
(D) ′ Processing in the head 3D posture / expression measurement means 5
Next, the operation in the tracking phase of the head three-dimensional posture / expression measuring means 5 will be described in detail with reference to FIGS. 24 and 27 are flowcharts for explaining the operation of the head 3D posture / expression measuring means 5 in the tracking phase.
[0092]
First, as shown in FIG. 25, the head rotation amount estimating means 16 obtains

end points

70 and 71 outside the eyes from both eye regions of the current frame obtained by the part detection and tracking means 4. A straight line 55 connecting 71 is obtained (step S1000). Further, a straight line 56 that is orthogonal to the straight line 55 and passes through the center position 59 of the mouth is obtained (step S1010). A local coordinate system is set with the obtained straight line 55 as the X axis and the straight line 56 as the Y axis. The circumscribed rectangle has sides parallel to the X axis 55 and the Y axis 56 and circumscribes the extracted head region. 57 is obtained (step S1020). The length of the side of the circumscribed rectangle 57 in the X-axis direction is 1, and the relative distances La ′ and Lb ′ between the end point 58 inside the eye measured in the calibration phase and the two

sides

72 and 73 parallel to the Y-axis. (Step S1030). Similarly, the length of the circumscribed rectangle in the Y-axis direction is set to 1, and the relative distances Lc ′ and Ld ′ between the mouth center 59 and the two

sides

74 and 75 parallel to the X-axis are obtained (step S1040).
[0093]
Next, the

outer end points

70 and 71 of both eyes, and intersection points (two points) 76 and 77 of a straight line passing through the end points 70 and 71 and parallel to the Y axis and a straight line passing through the center of the mouth and parallel to the X axis can be formed. A rectangle 60 is obtained (step S1050).
[0094]
Here, assuming that the right direction is the positive direction for the X axis and the upward direction is the positive direction for the Y axis, the relative distance dec (= Lb ′) in the positive direction of the X axis of one eye and the X stored in the calibration phase are stored. From the relative axis dei (= Lb) in the positive axis direction, the amount of rotation of the head in the left-right direction is obtained by the following equation (12).
Rf_E= Dec / dei (12)
Where Rf_EIs the amount of rotation in the left-right direction, dec is the relative distance in the positive X-axis direction of the eye in the current frame, and dei is the relative distance in the positive X-axis direction of the eye stored in the calibration phase.
[0095]
If the rotation amount Rf_EIf is greater than 1, the head is rotating in the left direction. Conversely, the rotation amount Rf_EIs less than 1, the head is rotating in the right direction.
[0096]
Similarly, based on the relative distance dmc (= Ld ′) in the positive Y-axis direction of the mouth and the relative distance dmi (= Ld) in the positive Y-axis direction stored in the calibration phase, Find the amount of rotation in the direction.
Rf_m= Dmc / dmi ...... Formula (13)
Where Rf_mIs the amount of rotation in the vertical direction, dmc is the relative distance in the positive Y-axis direction of the mouth in the current frame, and dmi is the relative distance in the positive Y-axis direction of the mouth stored in the calibration phase.
[0097]
If the rotation amount Rf_mIf is greater than 1, the head is rotating downward. On the other hand, if it is smaller than 1, the head is rotating upward.
[0098]
Next, the left / right / up / down rotation amount Rf obtained by Expression (12) and Expression (13)_E, Rf_mBased on the above, the rectangle 60 is distorted as follows (step S1060).
[0099]
Rf_EIf> 1:
The length of the left side of the rectangle (side parallel to the Y axis and in the negative direction of the X axis) is shortened using the following equation (14).
l = w · Rf_E・ Ol ...... Formula (14)
l is the calculated length, ol is the original length, and w is a weighting factor.
[0100]
Rf_E<If 1:
The length of the right side of the rectangle (side parallel to the Y axis and in the positive direction of the X axis) is shortened using the equation (14).
[0101]
Rf_mIf> 1:
The length of the lower side of the rectangle (side parallel to the X axis and in the negative direction of the Y axis) is shortened using the following equation (15).
l = w · Rf_m・ Ol ...... Formula (15)
l is the calculated length, ol is the original length, and w is a weighting factor.
[0102]
Rf_m<If 1:
The length of the upper side of the rectangle (side parallel to the X axis and in the positive direction of the Y axis) is shortened using the equation (15).
[0103]
For example, as shown in FIG. 26 (a), when the head is rotated in the left direction, the rectangle 60 has a shorter left side, and as shown in FIG. 26 (b), the rectangle 60 is rotated in the upward direction. 60 has a shorter upper side. And each vertex coordinate of the rectangle deform | transformed in this way is calculated | required on the basis of the center coordinate of the rectangle 60 before a deformation | transformation.
[0104]
Next, the posture measuring means 17 is based on the four vertex coordinates (two-dimensional coordinates) obtained as described above and the virtual points on the three-dimensional space set by the affine base setting means 15 corresponding to them. 3D posture measurement of the head is performed. Here, three-dimensional posture measurement is performed using the following method.
[0105]
The relationship between the image photographed by the camera and the object in the three-dimensional space is as shown in FIG. In FIG. 28, 63 is a plane in the three-dimensional space set by the affine base setting means 15, 64 is a camera image plane, and 65 is a camera coordinate system.
[0106]
A point (X in the coordinate system of the plane 63 in the three-dimensional space_f, Y_f, Z_f) And the corresponding point (X in the camera coordinate system 65)_c, Y_c, Z_c) Has the relationship of the following equation (16).
[Equation 3]

In Expression (16), R represents a rotational component and T represents a translational component, which is equal to the three-dimensional posture information of the head.
[0107]
On the other hand, a point on the three-dimensional space (X_c, Y_c, Z_c) And a two-dimensional point (dX) in the camera image plane 64_c, DY_c) Is related to the following equation (17).
[Expression 4]

Here, the matrix including P is a perspective projection matrix of the video camera 80 to be used, and can be obtained in advance using a known camera calibration technique.
[0108]
Now, the rectangle (camera image plane 64) obtained by the head rotation amount estimation means 16 is parallel in the vertical and horizontal sides in the three-dimensional space. From these two sets of parallel sides, vertical and horizontal direction vectors (X axis, Y axis) in a rectangular three-dimensional space can be obtained.
[0109]
The equation of a straight line on the camera image plane 64 of parallel sides is
a₁x + b₁y + c₁= 0 ... Formula (18)
a₂x + b₂y + c₂= 0 ... Formula (19)
Then, an equation of a three-dimensional plane including these straight lines in the camera coordinate system 65 can be expressed by the following equations (20) and (21).
[0110]

When the outer product of the normal vectors (X, Y, and Z coefficients) of these two planes is obtained, the direction vector (X axis, Y axis) can be obtained.
[0111]
The direction vectors corresponding to the rectangular X-axis and Y-axis in the camera coordinate system 65 can be obtained as described above, but the obtained direction vector is orthogonal as shown in FIG. 29 due to the information error obtained from the image. There may not be. Therefore, when the obtained direction vectors are S1 and S2, orthogonal vectors V1 and V2 are obtained based on the vectors S1 and S2. The vector in the Z-axis direction is obtained from the outer product of the obtained V1 and V2. These three direction vectors become the rotation component R in the equation (16).
[0112]
If the rotation component R is known, the translation component T can be obtained by substituting the corresponding points of the two-dimensional coordinates and the three-dimensional coordinates into the equations (16) and (17).
[0113]
The posture measuring means 17 first obtains the linear parameters (equation) of each side shown in the equation (18) from the coordinates of the four vertices of the rectangle obtained by the head rotation amount estimating means 16 (step S1100). Using the equations (20) and (21), the X axis and Y axis of the virtual three-dimensional plane set by the affine base setting means 15 are obtained (step S1110). Then, as described above, the obtained axes are corrected so as to be orthogonal to each other, and the Z-axis is obtained from the corrected X-axis and Y-axis, and the direction vector of these three axes (X-axis, Y-axis, Z-axis) is obtained. Translation is performed by obtaining a rotation matrix (rotation component) R (step S1120) and substituting the corresponding points of the two-dimensional coordinates and the three-dimensional coordinates obtained by using this rotation component R into equations (16) and (17). A matrix (translation component) T is obtained (step S1130).
[0114]
Using the projection matrix obtained as described above, the projection matrix is corrected according to the error when the virtual point in the three-dimensional space is actually projected onto the camera image plane (step S1140), and the error is less than the threshold value. The projection matrix at this time is set as the three-dimensional posture information of the head (step S1150), and the three-dimensional posture information of the head of the CG character is controlled by outputting this three-dimensional posture information to the character control device 90.
[0115]
In this way, a rectangle (virtual plane) in a three-dimensional space is defined from the three points of both eyes and mouth detected from the face image, and the rectangle created from the three points of both eyes and mouth during tracking is determined according to the movement of the head. By distorting, the distortion when a 3D plane is projected in 2D is simulated, and 3D posture information that cannot be obtained without 4 or more 3D and 2D corresponding points. The estimation is made only with the three points of both eyes and mouth obtained from.
[0116]
Next, the operation of the open / close state measuring means 18 will be described. The open / closed state measuring means 18 reproduces and reproduces both eyes and mouth area in the camera image when the user faces the front using the projection matrix obtained by the attitude measuring means 17, that is, the three-dimensional posture information of the head. The ratio between the length of the region in the image vertical direction (Y direction) and the length in the image vertical direction of each region in the initial state stored in the initial position setting means 13 is obtained. This ratio becomes the open / closed state information indicating how much both eyes and mouth are opened and closed.
[0117]
Thus, since both eyes and the mouth area in the camera image when the user is facing the front are estimated using the three-dimensional posture information, for example, the head is facing the front even in the image where the head is facing sideways or upward. The image of the case can be estimated, and the open / closed state of both eyes and mouth can be obtained more accurately from only the two-dimensional image.
[0118]
In this way, the obtained three-dimensional posture information of the head and the open / closed state information of both eyes and mouth are input to the character control device 90. The character control device 90 variably controls the movement of the head of the CG character and the open / closed states of both eyes and mouth using the input three-dimensional posture information of the head and open / closed state information of both eyes and mouth, thereby providing a video camera. The movement and expression of the CG character are changed in real time by following the movement and expression of the user imaged at 80.
[0119]
【The invention's effect】
As described above, according to the present invention, a virtual plane in a three-dimensional space is set from the positions of the eyes and mouth detected first, and the amount of rotation of the head in the horizontal and vertical directions from the detected eyes and mouth positions. 3D plane is projected in two dimensions by distorting a rectangle connecting the coordinates of the four points obtained from the detected eyes and mouth position using the estimated horizontal and vertical rotations of the head. Since the three-dimensional posture of the head is estimated using the coordinates of the four points of the distorted rectangle, the three-dimensional and two-dimensional corresponding points that are originally four or more are simulated. 3D posture information that cannot be obtained without it can be estimated from the three points obtained from the image, which enables high-speed processing and enables real-time and robust processing even with hardware with low computing power It is.
[0120]
According to the next invention, a local coordinate system of the head is set with the straight line connecting both eye areas as the X axis and the straight line perpendicular to the X axis and passing through the center position of the mouth area as the Y axis. From the relative distance between the left and right sides of the circumscribed rectangle of the head area and one eye, and the relative distance between the upper and lower sides of the circumscribed rectangle and the mouth area, the left and right and vertical rotation amounts of the head area are estimated. Therefore, the amount of rotation of the head can be determined by a relatively simple process.
[0121]
According to the next invention, based on the estimated three-dimensional posture information of the head, the open / closed state of both eyes and mouth is measured based on reproducing both eyes and mouth area when the target person faces the front. Therefore, for example, even in an image in which the head is facing sideways or upward, an image when the head is facing front can be estimated, and the open / closed state of both eyes and mouth can be obtained more accurately from only a two-dimensional image. .
[0122]
According to the next invention, when the region of both eyes and mouth cannot be identified by the region detection means, the position of the region of the region specified in the current frame and the position of the region of the specified region in the previous frame are used. The movement vector is obtained, and the position of the part that could not be specified using this movement vector is specified. Therefore, if one part can be detected, even if other parts are not detected, the detected part is detected. Can be detected, and this makes it possible to perform robust site tracking. Furthermore, even if the target part does not appear in the video due to hiding etc., if a temporary part region is set, it becomes possible to immediately track the part when the hidden part appears, Smooth movement can be reproduced.
[0123]
According to the next invention, a rectangular area having a certain size is set around the center coordinate of the part area for the previous frame, the current frame candidate area existing in the rectangular area is obtained, and each of the obtained candidate areas is determined. Since the evaluation value E is obtained using the discriminant E = | SP−SC | + OP + D and the candidate area having the smallest evaluation value E is specified as each part area, each part can be obtained by a simple calculation process. Can be specified, high-speed processing is possible, and real-time and robust processing is possible even using hardware with low calculation capability.
[0124]
According to the next invention, the luminance of the extracted head region is averaged / normalized, and the candidate regions for both eyes and mouth in the head region are extracted using the luminance averaged / normalized image. Therefore, it is possible to extract candidate areas without being affected by changes in brightness by suppressing the influence of gradation, shadows, highlights, etc., and to be able to extract both eye and mouth part areas with fixed threshold processing Therefore, high-speed processing is possible, and real-time and robust processing is possible even with hardware having low calculation capability.
[0125]
According to the next invention, the head region is divided into a plurality of small regions, and the histogram averaging process is performed for each small region. Therefore, the luminance change is suppressed while suppressing the influence of gradation, shadow, highlight, etc. Candidate regions can be extracted without being affected by the image, and both eye and mouth region regions can be extracted by fixed threshold processing, enabling high-speed processing and real-time operation even when using hardware with low computing power. And robust processing is possible.
[0126]
According to the next invention, in the histogram averaging process, the frequency of pixel values having a frequency exceeding a predetermined threshold value is distributed to other pixel values, so that the occurrence of noise can be suppressed.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an embodiment of a real-time facial expression tracking device according to the present invention.
FIG. 2 is a flowchart for explaining an outline of an operation in a calibration phase of the real-time facial expression tracking apparatus of FIG. 1;
FIG. 3 is a flowchart for explaining an outline of an operation in a tracking phase of the real-time facial expression tracking device of FIG. 1;
FIG. 4 is a diagram for explaining skin color sampling;
FIG. 5 is a flowchart for explaining operations of a skin color sampling unit and a skin color extraction parameter adjustment unit.
FIG. 6 is a flowchart for explaining the operation of the skin color area extracting means and the head area extracting means 9;
FIG. 7 is a diagram showing an example of a result of extracting a skin color area by a skin color area extracting unit.
FIG. 8 is a diagram illustrating an expansion mask and a contraction mask.
FIG. 9 is a diagram for explaining a process for filling a tear that has occurred in a detected head region;
FIG. 10 is a diagram for explaining logical operation processing for filling all holes in the head region.
FIG. 11 is a flowchart for explaining the operation of a part region candidate extraction unit;
FIG. 12 is a diagram for explaining processing for suppressing noise generation, which is a drawback of the adaptive histogram averaging method;
FIG. 13 is a diagram for explaining an adaptive histogram averaging method;
FIG. 14 is a diagram for explaining an adaptive histogram averaging method;
FIG. 15 is a flowchart for explaining the operation of the part detection tracking means in the calibration phase.
FIG. 16 is a diagram showing a mask area used when specifying both eyes and a mouth area in the part detecting means;
FIG. 17 is a flowchart for explaining the operation of the head three-dimensional posture / expression measurement means 5 in the calibration phase.
FIG. 18 is a diagram showing virtual points on a three-dimensional space set by affine base setting means.
FIG. 19 is a diagram for describing the relative positions of the end points of both eyes and the center point of the mouth with respect to the circumscribed rectangle of the head region obtained by the head movement amount estimating means;
FIG. 20 is a flowchart for explaining the operation of the part detection tracking means in the tracking phase (part 1);
FIG. 21 is a flowchart for explaining the operation of the part detection / tracking means in the tracking phase (part 2);
FIG. 22 is a diagram for explaining a tracking method of a part region in a current frame by a part tracking unit.
FIG. 23 is a diagram for explaining processing for predicting a part region that could not be detected from the position of the part region that could be detected;
FIG. 24 is a flowchart for explaining the operation of the head three-dimensional posture / expression measurement means in the tracking phase.
FIG. 25 is a diagram for describing processing for estimating the head rotation amount in the left-right and up-down directions by the head rotation amount estimation unit;
FIG. 26 is a diagram for explaining processing for obtaining corresponding points corresponding to virtual points (affine bases) in a three-dimensional space in the head rotation amount estimation unit;
FIG. 27 is a flowchart for explaining the operation of the head three-dimensional posture / expression measuring means in the tracking phase.
FIG. 28 is a diagram for explaining processing for obtaining three-dimensional posture information of a head from three-dimensional and two-dimensional corresponding points in posture measurement means.
FIG. 29 is a diagram for describing processing for correcting an error when obtaining posture information;
FIG. 30 is a diagram showing a conventional technique.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Image | video input means, 2 head area | region detection means, 3 area | region area candidate extraction means, 4 area | region detection tracking means, 5 3D posture / expression measurement means, 6 skin color sampling means, 7 skin color extraction parameter adjustment means, 8 skin color area extraction Means 9 head area extraction means 10 head area luminance averaging means 11 pixel selection means 12 part detection means 13 initial position setting means 14 part tracking means 15 affine base setting means 16 head rotation amount Estimating means, 17 posture measuring means, 18 opening / closing state measuring means, 20 sampling window, 22 expansion mask, 23 contraction mask, 33 left eye mask, 34 right eye mask, 35 mouth mask, 50 movement vector, 53 rectangular area, 57 circumscribed rectangle, 64 camera image plane, 80 video camera, 90 character control device.

Claims

順次所定のフレームレートで入力される映像をキャプチャする映像入力手段と、
前記キャプチャした画像から頭部画像を抽出する頭部領域検出手段と、
前記抽出した頭部領域から両目および口を含む各部位の候補領域を抽出する部位領域候補抽出手段と、
抽出した候補領域の中から各部位の位置を検出する部位検出追跡手段と、
前記検出した両目、口の検出位置に基づいて頭部の３次元姿勢を計測するとともに、両目および口の開閉状態を計測する頭部３次元姿勢・表情計測手段とを備え、前記計測した頭部の３次元姿勢および両目および口の開閉状態
に基づいてＣＧキャラクタの動きを制御し、
前記頭部３次元姿勢・表情計測手段は、
最初に検出した両目および口の位置から３次元空間上の仮想平面を設定するアフィン基底設定手段と、
前記検出した両目および口位置から頭部の左右および上下方向の回転量を推定する頭部回転量推定手段と、
前記検出した両目および口位置から得た４点の座標を結ぶ矩形を前記推定した頭部の左右および上下方向の回転量を用いて歪ませ、該歪ませた矩形の４点の座標を用いて頭部の３次元姿勢を推測する姿勢計測手段と、
頭部の動きに応じて両目および口の開閉状態を推測する開閉状態計測手段と、
を備えるリアルタイム表情追跡装置であって、
前記頭部回転量推定手段は、両目領域を結ぶ直線をＸ軸とし、Ｘ軸に垂直で口領域の中心位置を通る直線をＹ軸とした頭部のローカル座標系を設定し、このローカル座標系において求めた頭部領域の外接矩形の左右の辺と片目との相対距離と、外接矩形の上下の辺と口領域との相対距離から頭部の左右、上下方向の回転量をそれぞれ推定することを特徴とするリアルタイム表情追跡装置。 Video input means for capturing video sequentially input at a predetermined frame rate;
Head region detection means for extracting a head image from the captured image;
Part region candidate extraction means for extracting candidate regions for each part including both eyes and mouth from the extracted head region;
A part detection tracking means for detecting the position of each part from the extracted candidate areas;
A head 3D posture / facial expression measuring means for measuring the three-dimensional posture of the head based on the detected positions of both eyes and mouth, and for measuring the open / closed state of both eyes and mouth, the measured head Controlling the movement of the CG character based on the three-dimensional posture and the open / closed state of both eyes and mouth ,
The head three-dimensional posture / facial expression measuring means is:
Affine basis setting means for setting a virtual plane in a three-dimensional space from the positions of the eyes and mouth detected first;
Head rotation amount estimation means for estimating the left and right and vertical rotation amounts of the head from the detected eyes and mouth positions;
The rectangle connecting the coordinates of the four points obtained from the detected eyes and mouth positions is distorted using the estimated amount of rotation of the head in the horizontal and vertical directions, and using the coordinates of the four points of the distorted rectangle. Posture measuring means for estimating the three-dimensional posture of the head;
Open / closed state measuring means for estimating open / closed states of both eyes and mouth according to the movement of the head;
A Brighter real-time facial tracking device comprises a,
The head rotation amount estimating means sets a local coordinate system of the head with the straight line connecting both eye regions as the X axis and the straight line perpendicular to the X axis and passing through the center position of the mouth region as the Y axis. From the relative distance between the left and right sides of the circumscribed rectangle of the head region obtained in the system and one eye, and the relative distance between the upper and lower sides of the circumscribed rectangle and the mouth region, the amount of rotation of the head in the left and right and up and down directions is estimated. Real-time facial expression tracking device.

順次所定のフレームレートで入力される映像をキャプチャする映像入力手段と、
前記キャプチャした画像から頭部画像を抽出する頭部領域検出手段と、
前記抽出した頭部領域から両目および口を含む各部位の候補領域を抽出する部位領域候補抽出手段と、
抽出した候補領域の中から各部位の位置を検出する部位検出追跡手段と、
前記検出した両目、口の検出位置に基づいて頭部の３次元姿勢を計測するとともに、両目および口の開閉状態を計測する頭部３次元姿勢・表情計測手段とを備え、前記計測した頭部の３次元姿勢および両目および口の開閉状態
に基づいてＣＧキャラクタの動きを制御し、
前記頭部３次元姿勢・表情計測手段は、
最初に検出した両目および口の位置から３次元空間上の仮想平面を設定するアフィン基底設定手段と、
前記検出した両目および口位置から頭部の左右および上下方向の回転量を推定する頭部回転量推定手段と、
前記検出した両目および口位置から得た４点の座標を結ぶ矩形を前記推定した頭部の左右および上下方向の回転量を用いて歪ませ、該歪ませた矩形の４点の座標を用いて頭部の３次元姿勢を推測する姿勢計測手段と、
頭部の動きに応じて両目および口の開閉状態を推測する開閉状態計測手段と、
を備えるリアルタイム表情追跡装置であって、
前記開閉状態計測手段は、前記姿勢計測手段によって推定した頭部の３次元姿勢情報を用いて対象人物が正面を向いたときの両目および口領域を再現することに基づき両目および口の開閉状態を計測することを特徴とするリアルタイム表情追跡装置。 Video input means for capturing video sequentially input at a predetermined frame rate;
Head region detection means for extracting a head image from the captured image;
Part region candidate extraction means for extracting candidate regions for each part including both eyes and mouth from the extracted head region;
A part detection tracking means for detecting the position of each part from the extracted candidate areas;
A head three-dimensional posture / facial expression measuring means for measuring the three-dimensional posture of the head based on the detected positions of both eyes and mouth, and measuring the open / closed state of both eyes and mouth; 3D posture and open / closed state of both eyes and mouth
Control the movement of the CG character based on
The head three-dimensional posture / facial expression measuring means is:
Affine basis setting means for setting a virtual plane in a three-dimensional space from the positions of the eyes and mouth detected first;
Head rotation amount estimation means for estimating the left and right and vertical rotation amounts of the head from the detected eyes and mouth positions;
A rectangle connecting the coordinates of the four points obtained from the detected eyes and mouth positions is distorted using the estimated amount of rotation of the head in the horizontal and vertical directions, and the coordinates of the four points of the distorted rectangle are used. Posture measuring means for estimating the three-dimensional posture of the head;
Open / closed state measuring means for estimating open / closed states of both eyes and mouth according to the movement of the head;
A real-time facial expression tracking device comprising:
The open / closed state measuring means reproduces the open / closed states of both eyes and mouth based on reproducing the eyes and mouth area when the target person faces front using the three-dimensional posture information of the head estimated by the posture measuring means. Real-time facial expression tracking device characterized by measuring .

前記開閉状態計測手段は、前記姿勢計測手段によって推定した頭部の３次元姿勢情報を用いて対象人物が正面を向いたときの両目および口領域を再現することに基づき両目および口の開閉状態を計測することを
特徴とする請求項１に記載のリアルタイム表情追跡装置。The open / closed state measuring means reproduces the open / closed states of both eyes and mouth based on reproducing the eyes and mouth area when the target person faces front using the three-dimensional posture information of the head estimated by the posture measuring means. The real-time facial expression tracking device according to claim 1, wherein measurement is performed.