JP4177689B2

JP4177689B2 - Video feature information generator

Info

Publication number: JP4177689B2
Application number: JP2003073548A
Authority: JP
Inventors: 貴裕望月; 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2003-03-18
Filing date: 2003-03-18
Publication date: 2008-11-05
Anticipated expiration: 2023-03-18
Also published as: JP2004280669A

Description

【０００１】
【発明の属する技術分野】
本発明は、映像データから、その映像データを特徴付ける簡易な映像特徴情報を生成する映像特徴情報生成装置に関する。
【０００２】
【従来の技術】
一般に、大量の映像データの中から所望の映像データを検索するには、予め映像データから特徴量を抽出しておき、その特徴量に基づいて映像データの検索を行う映像検索技術が知られている。
【０００３】
従来、映像検索を行うために映像データから特徴量を抽出するには、その映像データ内の被写体及び背景の厳密な形状を切り出して、各形状の特徴量を抽出したり、映像データを構成する各フレーム画像上に配置した多数の特徴点の正確な動きを示すベクトルを特徴量として抽出したりすることにより行っていた。
【０００４】
例えば、映像データから、オブジェクト（被写体）を検出し、そのオブジェクトの位置、形状及び動きを含んだ特徴量と、背景の動きを含んだ特徴量とを、映像データの特徴量データとして生成し、その特徴量データに基づいて、映像検索を行う技術が存在する（特許文献１参照。）。
【０００５】
【特許文献１】
特開２０００−２２２５８４号公報（段落番号２５−３０、第１図）
【０００６】
【発明が解決しようとする課題】
しかし、前記従来の技術では、映像データから詳細な特徴量を抽出するため、その特徴量を算出するための計算時間が長くなってしまうという問題があった。また、従来のように、詳細な特徴量に基づいて、映像データの検索を行うと、検索者の検索意図を反映した検索情報の絞り込みや、映像データ内の検索情報の類似度計算が複雑になるため、検索者の検索意図に沿った映像データを検索するための時間が長くなってしまうという問題があった。
【０００７】
さらに、前記従来の技術では、映像データの物体（被写体）領域、背景領域及びそれらの動きを示すベクトルを、スケッチのような簡単な表現で検索者の検索意図を表したい場合、映像データの特徴量が詳細であるほど、スケッチの表現を複雑にしなければならない。そのため、検索者が思い描いている検索情報をより複雑に表現しなければならず、結果的に検索者の感覚（類似感覚）に適した検索を行うことができないというということになる。
【０００８】
本発明は、以上のような問題点に鑑みてなされたものであり、映像データから簡易な情報を特徴量として抽出することで、特徴量を抽出するための計算時間を短縮し、かつ、検索者の類似感覚に適した映像検索により所望の映像データを検索することを可能にした映像特徴情報生成装置を提供することを目的とする。
【００１８】
【課題を解決するための手段】
本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の映像特徴情報生成装置は、映像データのシーン毎に、当該シーンを特徴付ける映像特徴情報を生成する映像特徴情報生成装置であって、フレーム画像サンプリング手段と、節点初期設定手段と、節点追跡手段と、節点分類手段と、クラス画像特徴量生成手段と、特徴情報生成手段とを備える構成とした。
【００１９】
かかる構成によれば、映像特徴情報生成装置は、フレーム画像サンプリング手段によって、映像データから特定のサンプリング間隔でフレーム画像を抽出し、節点初期設定手段によって、フレーム画像サンプリング手段で抽出されたフレーム画像の先頭フレームに、画像の特徴を抽出するための基準となる節点（基準点）を設定する。なお、この節点は、特定の間隔で設定することとしてもよいし、予め配置を決めたテンプレートに基づいて、位置を設定することとしてもよい。
【００２０】
そして、映像特徴情報生成装置は、節点追跡手段によって、近傍画像領域の画像特徴量が類似する節点を探索することで、先頭フレーム以降のフレーム画像に対応する節点を追跡するとともに、節点分類手段によって、フレーム画像毎に、節点追跡手段で追跡した各節点の近傍画像領域の画像特徴量で類似する節点を１つのクラス（クラスタ）にまとめて分類（クラスタリング）する。
【００２１】
また、映像特徴情報生成装置は、クラス画像特徴量生成手段によって、シーンの先頭フレームから最終フレームまでにおいて、節点分類手段で同一のクラス（クラスタ）に分類された節点の近傍画像領域に対応する各フレーム画像の画像特徴量の中で、当該クラス内の平均画像特徴量との距離が最も小さい画像特徴量を、当該クラスを代表するクラス画像特徴量として生成する。このクラス画像特徴量は、同一クラス（クラスタ）の画像の特徴を最もよく表した画像特徴量となる。
【００２２】
そして、映像特徴情報生成装置は、特徴情報生成手段によって、クラス画像特徴量生成手段で生成されたクラス画像特徴量と、当該クラス内の節点を含んだ矩形領域の座標情報と、シーンの先頭フレームからの最終フレームまでの矩形領域の位置重心の動きベクトルである動き情報とを、映像特徴情報として生成する。
また、映像特徴情報生成装置は、特徴情報生成手段が、クラス画像特徴量に最も類似するフレーム画像から代表的なテクスチャ（模様）を抽出する。そして、このテクスチャで、節点分類手段で分類された近傍画像領域毎の矩形領域のそれぞれを塗りつぶす等によって表すことで、矩形領域そのものの特徴を代表的な画像データで表現することができる。これによって、映像データの映像特徴情報を、視覚可能な画像データによって表現することができる。
【００２３】
また、請求項２に記載の映像特徴情報生成装置は、請求項１に記載の映像特徴情報生成装置において、前記映像データをシーン毎に分割するシーン分割手段を備え、前記フレーム画像サンプリング手段が、前記シーン分割手段で分割されたシーン毎に特定のサンプリング間隔でフレーム画像を抽出することを特徴とする。
【００２４】
かかる構成によれば、映像特徴情報生成装置は、シーン分割手段によって、入力された映像データのシーンの切り替わり（シーンチェンジ）を検出し、映像データをシーン毎に分割する。これによって、映像特徴情報生成装置は、節点の特徴の区切りとなるシーン毎に映像データを自動的に分割することができるため、映像特徴情報を生成するための効率を高めることができる。
【００３２】
【発明の実施の形態】
以下、本発明の実施の形態について図面を参照して説明する。
［映像特徴情報生成装置の構成］
図１は、本発明における映像特徴情報生成装置の構成を示したブロック図である。映像特徴情報生成装置１は、映像データを部分映像（以下、シーンという）に分割し、これらのシーンを、被写体や背景の大まかな位置・大きさ、代表的な画像・模様（以下、テクスチャという）、及び、大まかな動き（以下、動きベクトルという）で表現した映像特徴情報（シーン簡易情報）を生成するものである。ここでは、映像特徴情報生成装置１を、シーン分割手段２と、フレーム画像サンプリング手段３と、節点初期設定手段４と、節点追跡手段５と、節点クラスタリング手段６と、テクスチャクラスタリング手段７と、クラスタ情報記憶手段８と、特徴情報生成手段９とを備えて構成した。また、映像特徴情報生成装置１は、映像データを蓄積したハードディスク等の映像蓄積手段２０から、映像データを入力するものとする。
【００３３】
シーン分割手段２は、映像蓄積手段２０から入力した映像データをシーンに分割するものである。このシーン分割手段２におけるシーン分割は、映像データの構成が大きく切り替わる点（シーンチェンジ）を検出して、その切り替わり点毎に映像データを分割する。このシーンチェンジの検出は、既存の手法によって行うことができる。例えば、映像データを構成する時系列に連続するフレーム画像間の差分を計算し、その差分の絶対値和が予め定めた閾値よりも大きい場合は、フレーム画像間に連続性がないとみなすことによりシーンチェンジの検出を行う。
【００３４】
このシーン分割手段２では、映像データを分割したシーン毎にフレーム画像を逐次フレーム画像サンプリング手段３へ出力する。なお、このシーン分割手段２は、後記する特徴情報生成手段９で、１つのシーンの映像特徴情報（シーン簡易情報）を生成した段階で、次のシーンを検出（分割）するものとする。
【００３５】
フレーム画像サンプリング手段３は、シーン分割手段２から入力されるフレーム画像を予め設定したサンプリング間隔で抽出するものである。このフレーム画像サンプリング手段３では、抽出したフレーム画像がシーンの先頭フレームである場合には、フレーム画像（先頭フレーム）を節点初期設定手段４へ出力し、抽出したフレーム画像がシーンの先頭フレーム以外である場合には、フレーム画像（先頭フレーム以降）を節点追跡手段５へ出力する。
【００３６】
節点初期設定手段４は、フレーム画像サンプリング手段３から先頭フレームのフレーム画像を入力されたときに、フレーム画像（先頭フレーム）上に予め定めた間隔で格子状に節点を設定するものである。ここでは、近傍画像特徴量計算部４ａを付加して、節点初期設定手段４を構成した。なお、節点とはフレーム画像の特徴を抽出するための基準となる点を指すものである。また、この節点は、特定の間隔で設定することとしてもよいし、予め配置を決めたテンプレートに基づいて、位置を設定することとしてもよい。
【００３７】
近傍画像特徴量計算部４ａは、フレーム画像（先頭フレーム）上に設定した節点の近傍（近傍画像領域：予め定めた大きさの領域（正方形領域等））から画像特徴量を計算するものである。この画像特徴量としては、画像処理の分野で一般的な特徴量を用いればよい。例えば、ＲＧＢの各色成分の平均値、画像をエッジ化したときのエッジ量の分布、画像の複雑さを示すフラクタル次元等を用いることができる。
【００３８】
ここで、図７を参照（適宜図１参照）して節点について説明する。図７は、節点初期設定手段４が先頭フレームのフレーム画像上において、ある一定間隔で格子状に節点ＰＴを設置した状態を示す図である。ここでは、説明の都合上、フレーム画像上に節点ＰＴを図示しているが、この節点ＰＴはフレーム画像上の格子状の点に対応した位置を示しているだけである。このように、節点初期設定手段４は、横Ｎｐｘ個、縦Ｎｐｙ個（Ｎｐｘ及びＮｐｙは予め設定）で格子状に節点ＰＴを設定する。
【００３９】
さらに、節点初期設定手段４の近傍画像特徴量計算部４ａは、各節点ＰＴを中心とした近傍領域（近傍画像領域：Ｒｆｖ画素×Ｒｆｖ画素の正方形領域）から画像特徴量を計算し、各節点ＰＴに対応付ける。
【００４０】
なお、この節点初期設定手段４では、図６（ａ）に示すような情報（節点情報ＰＴＩ）を節点毎に生成し、その節点情報ＰＴＩをクラスタ情報記憶手段８に記憶する。図６（ａ）に示すように、節点情報ＰＴＩは、節点を設定したフレーム画像の番号（フレーム番号）ＰＴＩｆｎと、節点のフレーム画像上での現在の位置（座標情報：ｘ座標及びｙ座標）ＰＴＩｐｓと、現在の位置における節点の近傍領域から計算した画像特徴量（近傍画像特徴）ＰＴＩｆｖとで構成される。
図１に戻って説明を続ける。
【００４１】
節点追跡手段５は、フレーム画像サンプリング手段３から入力されるフレーム画像において、クラスタ情報記憶手段８に記憶されている前のフレーム画像に対応付けられている節点の近傍画像領域の画像特徴量（近傍画像特徴量）を参照して、その近傍画像特徴量に類似する近傍画像領域を追跡するものである。ここでは、近傍画像特徴量計算部５ａを付加して節点追跡手段５を構成した。なお、ここでは、近傍画像領域の追跡は、対応する節点の位置を追跡することとする。
【００４２】
近傍画像特徴量計算部５ａは、節点を中心とした近傍領域（近傍画像領域：特定の半径の円領域）から画像特徴量を計算するものである。これによって、節点追跡手段５は、前のフレーム画像に対応付けられている節点の近傍画像特徴量と、その節点位置の近傍領域における現在のフレーム画像の画像特徴量とを比較し、最も差の小さくなる画像特徴量を持つ画素を、現在のフレーム画像の節点とする。
【００４３】
なお、近傍画像特徴量計算部５ａが計算する画像特徴量は、近傍画像特徴量計算部４ａと同様、画像処理の分野で一般的な特徴量を計算するものとする。また、ここでは、近傍画像特徴量計算部４ａと近傍画像特徴量計算部５ａとを別の手段として構成したが、１つの手段として構成し、節点初期設定手段４及び節点追跡手段５が共通して動作させることとしてもよい。
【００４４】
ここで、図８を参照（適宜図１参照）して、フレーム画像上における節点の追跡について説明する。図８は、節点追跡手段５がフレーム画像上において節点を追跡した状態を示す図である。ここでは、前フレ−ム画像（ここでは図７の先頭フレーム）における節点ＰＴ（図中●印）が、現フレ−ム画像では節点ＰＴ_B（図中×印）に追跡されたことを示している。このように、節点追跡手段５は、時々刻々と変化するフレーム画像の内容に基づいて、画像特徴量の差が最も小さくなる位置に節点を移動させる。
図１に戻って説明を続ける。
【００４５】
節点クラスタリング手段（節点分類手段）６は、節点初期設定手段４又は節点追跡手段５で節点の位置が決まった１つのフレーム画像における節点を、近傍画像特徴量及び座標情報に基づいて分類（クラスタリング）し、その分類情報（クラスタ情報）をクラスタ情報記憶手段８に記憶するものである。
【００４６】
なお、この節点クラスタリング手段６では、図６（ｂ）に示すような情報（節点クラスタ情報ＰＣ）を生成し、その節点クラスタ情報ＰＣをクラスタ情報記憶手段８に記憶（格納）する。図６（ｂ）に示すように、節点クラスタ情報ＰＣは、フレーム画像毎に、節点クラスタに属する全節点ＰＣｐｔ（節点情報ＰＴＩを指し示すポインタとする）と、節点クラスタに属する全節点の近傍画像特徴量ＰＴＩｆｖ（図６（ａ））の平均画像特徴量ＰＣｆｖと、その節点クラスタに割り当てられるクラスタ番号ＰＣｃｎとで構成される。
【００４７】
テクスチャクラスタリング手段（クラス画像特徴量生成手段）７は、節点初期設定手段４又は節点追跡手段５で節点の位置が決まったフレーム画像を逐次入力し、１シーンにおける先頭フレームから最終フレームまでの、各クラスタの代表となる画像特徴量を生成するものである。
【００４８】
なお、このテクスチャクラスタリング手段７では、図６（ｃ）に示すような情報（テクスチャクラスタ情報ＴＣ）を、クラスタ情報記憶手段８に作成し、フレーム画像が入力される度に更新（格納）する。図６（ｃ）に示すように、テクスチャクラスタ情報ＴＣは、テクスチャクラスタに属する全節点ＴＣｐｔ（節点情報ＰＴＩを指し示すポインタとする）と、テクスチャクラスタに属する全節点の近傍画像特徴量ＰＴＩｆｖ（図６（ａ）参照）の平均画像特徴量ＴＣｆｖとで構成される。なお、平均画像特徴量ＴＣｆｖは、画像特徴量がＮ次元のベクトルである場合は、各要素の平均値からなるＮ次元のベクトルとする。
【００４９】
また、テクスチャクラスタリング手段７は、シーンにおける最終フレームが入力された段階で、各テクスチャクラスタを代表する代表テクスチャ構造体を生成する。ここでは、テクスチャクラスタに属する全節点ＴＣｐｔ（図６（ｃ）参照）の中で、クラスタに属する全節点の近傍画像特徴量ＰＴＩｆｖ（図６（ａ）参照）と、テクスチャクラスタに属する全節点の平均特徴量ＴＣｆｖ（図６（ｃ）参照）との距離が最も小さくなる画像特徴量を、テクスチャクラスタを代表する代表画像特徴量とする。また、この代表画像特徴量を持つ節点から予め定めた大きさの矩形領域を抽出し、代表テクスチャ画像とする。
【００５０】
なお、代表テクスチャ構造体は、図６（ｄ）に示すような構造とする。すなわち、代表テクスチャ構造体ＰＴＸは、予め設定されたＴｗ画素×Ｔｈ画素の代表テクスチャ画像ＰＴＸｉｍｇと、代表テクスチャ画像ＰＴＸｉｍｇの中心座標の近傍画像特徴量（代表画像特徴量）ＰＴＸｆｖとで構成される。
【００５１】
クラスタ情報記憶手段（記憶手段）８は、クラスタの情報を記憶しておくもので、一般的なハードディスク等の記憶媒体である。このクラスタ情報記憶手段８には、各フレーム画像のクラスタ情報ＣＩと、各シーン毎に生成される代表テクスチャ構造体ＰＴＸが記憶される。
なお、クラスタ情報には、前記したフレーム画像の節点情報ＰＴＩ（図６（ａ））、節点クラスタ情報ＰＣ（図６（ｂ））、及び、テクスチャクラスタ情報ＴＣ（図６（ｃ）参照）が含まれる。
【００５２】
特徴情報生成手段９は、クラスタ情報記憶手段８を参照して、各シーン毎に、そのシーンを特徴付ける映像特徴情報（シーン簡易情報）を生成するものである。ここでは、特徴情報生成手段９をシーン特徴情報生成部９ａと、初期化部９ｂとで構成した。
【００５３】
シーン特徴情報生成部９ａは、節点クラスタリング手段（節点分類手段）６で分類された節点を含んだ矩形領域の座標情報と、テクスチャクラスタリング手段（クラス画像特徴量生成手段）７で生成されたクラス画像特徴量（代表テクスチャ画像及び代表画像特徴量）と、矩形領域の動き情報とを映像特徴情報として生成するものである。なお、シーン特徴情報生成部９ａは、画像データ及びテキストデータとして映像特徴情報を生成する。
【００５４】
画像データで映像特徴情報を生成する場合は、クラスタの節点を含んだ矩形領域の内部を、代表テクスチャ画像で塗りつぶす（敷き詰める）。そして、先頭フレームから最終フレームまでのクラスタの位置重心の動きをそのクラスタの動きベクトルとして直線で描画する。
【００５５】
図１０にシーン特徴情報生成部９ａで生成した画像データの一例を示す。図１０に示すように、クラスタの矩形領域が、代表テクスチャ画像で塗りつぶされ、クラスタの動きが直線で示されている。これによって、映像データのシーンの簡易情報を視覚化することができる。
【００５６】
また、テキストデータで映像特徴情報を生成する場合は、先頭フレーム及び最終フレームのフレーム番号、クラスタの節点を含んだ矩形領域の座標情報、代表画像特徴量、及び、先頭フレームから最終フレームまでのクラスタの位置重心の動きベクトルをテキストデータとしてテキストファイルに書き込む。
【００５７】
図１１にシーン特徴情報生成部９ａで生成したテキストデータの一例を示す。図１１に示すように、シーン毎に、シーンの先頭フレーム番号Ｎｓ、最終フレーム番号Ｎｅ、矩形領域の座標情報｛（ｘ０，ｙ０）、（ｘ１、ｙ１）、（ｘ２、ｙ２）、（ｘ３、ｙ３）｝、テクスチャ構造体の画像特徴量（代表画像特徴量）｛（ｆ（０）、ｆ（１）、ｆ（２）、…、ｆ（Ｎ−１）｝、及び、動きベクトルのｘ及びｙ成分｛ｖｘ、ｖｙ｝でクラスタ１個分の情報を生成する。
【００５８】
ここで、代表画像特徴量｛（ｆ（０）、ｆ（１）、…｝は、それぞれ、ＲＧＢの各色成分の平均値、画像をエッジ化したときのエッジ量の分布等を示すものとする。また、ここでは、代表画像特徴量を、Ｎ次元の画像特徴ベクトルとして表している。
【００５９】
初期化部９ｂは、フレーム画像がシーンの最終フレームであるかどうかを判定し、最終フレームである場合に、クラスタ情報記憶手段８のクラスタ情報を初期化するものである。これによって、シーンの切り替わり（シーンチェンジ）が発生したときに、クラスタ情報記憶手段８に新しいシーンのクラスタ情報が生成されることになる。
【００６０】
以上、本発明に係る映像特徴情報生成装置１の構成について説明したが、本発明はこれに限定されるものではない。ここでは、映像データをシーンに分割するシーン分割手段２を備えたが、映像データが１シーン毎に入力される場合は、シーン分割手段２を映像特徴情報生成装置１から削除して構成してもよい。また、ここでは、シーン特徴情報生成部９ａにおいて、映像特徴情報を画像データ及びテキストデータで生成したが、いずれか一方を生成するものとしてもよい。
【００６１】
また、映像特徴情報生成装置１は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して映像特徴情報生成プログラムとして動作させることも可能である。
【００６２】
［映像特徴情報生成装置の動作］
次に、図２を参照（適宜図１参照）して、映像特徴情報生成装置の動作について説明する。図２は、映像特徴情報生成装置の動作を示すフローチャートである。
【００６３】
まず、映像特徴情報生成装置１は、シーン分割手段２によって、映像蓄積手段２０から映像データを入力し、シーンの切り替わり（シーンチェンジ）を検出することで、映像データをシーンに分割する（ステップＳ１）。そして、フレーム画像サンプリング手段３によって、シーンから予め設定したサンプリング間隔でフレーム画像を抽出する（ステップＳ２）。
【００６４】
ここでフレーム画像サンプリング手段３は、抽出したフレーム画像がシーンの先頭フレームであるかどうかを判定し（ステップＳ３）、先頭フレームである場合（Ｙｅｓ）は、そのフレーム画像を節点初期設定手段４へ出力する。そして、節点初期設定手段４が、先頭フレームのフレーム画像に対して格子状に節点を設定し（ステップＳ４）、近傍画像特徴量計算部４ａが、各節点の近傍から画像特徴量を計算する（ステップＳ５）。
【００６５】
一方、抽出したフレーム画像が先頭フレーム以外の場合（ステップＳ３でＮｏ）は、そのフレーム画像を節点追跡手段５へ出力する。そして、節点追跡手段５が、現在のフレーム画像における節点の周り（特定の半径の円領域）の画像特徴量を計算し（ステップＳ６）、前フレーム画像における節点の近傍の画像特徴量を参照して、現在のフレーム画像における節点の位置を移動（追跡）する（ステップＳ７）。
このように、ステップＳ５又はステップＳ７で、各フレーム画像における節点の位置及び画像特徴量が節点情報ＰＴＩ（図６（ａ）参照）として生成される。
【００６６】
そして、映像特徴情報生成装置１は、節点クラスタリング手段（節点分類手段）６によって、フレーム画像の節点情報に基づいて、節点を分類（クラスタリング）し、クラスタ番号を付与することで、クラスタ情報（節点クラスタ情報ＰＣ：図６（ｂ）参照）を生成する（ステップＳ８）。また、映像特徴情報生成装置１は、テクスチャクラスタリング手段７によって、先頭のフレーム画像から現在入力されたフレーム画像までの同一のクラスタ情報（節点クラスタ情報）から、テクスチャクラスタ情報ＴＣ（図６（ｃ）参照）を生成する（ステップＳ９）。
【００６７】
ここで、映像特徴情報生成装置１は、入力されたフレーム画像がシーンの最終フレームであるかどうかを判定し（ステップＳ１０）、最終フレームでない場合（Ｎｏ）は、ステップＳ２に戻って、次のフレーム画像の分類（クラスタリング）を行う。
【００６８】
一方、入力されたフレーム画像が最終フレームの場合（ステップＳ１０でＹｅｓ）は、テクスチャクラスタリング手段７が、各テクスチャクラスタにおいて代表する代表テクスチャ構造体ＰＴＸ（図６（ｄ）参照）を生成する（ステップＳ１１）。
そして、映像特徴情報生成装置１は、特徴情報生成手段９のシーン特徴情報生成部９ａによって、シーンを特徴付ける映像特徴情報（シーン簡易情報）を生成し出力する（ステップＳ１２）。
【００６９】
以上の動作によって、映像データのあるシーンにおける映像特徴情報が生成されたことになる。なお、映像データの入力が継続する場合は、特徴情報生成手段９の初期化部９ｂによって、クラスタ情報記憶手段８のクラスタ情報及び代表テクスチャ構造体を初期化して、ステップＳ１からの動作を繰り返す。これによって、シーン毎の映像特徴情報（シーン簡易情報）が順次生成されることになる。
【００７０】
（節点クラスタリング手段の詳細動作）
次に、図３を参照（適宜図１参照）して、節点クラスタリング手段６の動作について、さらに詳細に説明を行う。図３は、節点クラスタリング手段６の動作を示すフローチャートである。なお、この動作は、図２におけるステップＳ８の動作に相当するものである。また、以下の説明において、各情報は図６で付した符号を用いることとする。
【００７１】
まず、節点クラスタリング手段６は、入力された１枚のフレーム画像上で、全節点対の近傍画像特徴量ＰＴＩｆｖの距離を計算する（ステップＳ２１）。そして、その距離がある閾値以下の節点同士を、同一のクラスタに含まれるように分類（クラスタリング）する（ステップＳ２２）。そして、同一のクラスタに含まれる全節点ＰＣｐｔの近傍領域の平均画像特徴量ＰＣｆｖを計算する（ステップＳ２３）。そして、節点クラスタリング手段６は、節点クラスタに属する全節点の位置ＰＴＩｐｓが離れた節点集合を異なる節点クラスタとして分離し（ステップＳ２４）、各節点クラスタにクラスタ番号ＰＣｃｎを付ける。
【００７２】
ここで、図９を参照（適宜図１参照）して、この節点クラスタの分離について説明する。図９は、節点クラスタリング手段６が節点クラスタの切り離し処理を行う画面の一例であり、（ａ）は画像特徴量に基づいてクラスタリングした画面、（ｂ）は位置が離れた節点集合同士を異なる節点クラスタにした画面である。図９（ａ）に示すように、画像特徴量に基づいてクラスタリングを行うと、左右の節点クラスタの位置は離れているが両者は同じ節点クラスタになってしまう。そこで、距離の離れた節点集合を切り離すことで、図９（ｂ）に示すように異なる節点クラスタが生成される（図９（ｂ）では節点クラスタ内の斜線方向の違いで異なる節点クラスタを表現している）。
図３のフローチャートに戻って説明を続ける。
【００７３】
節点クラスタリング手段６は、現在処理を行っているフレーム画像が先頭フレームであるかどうかを判定し（ステップＳ２５）、先頭フレームである場合は、ステップＳ２７へ進んでクラスタリング結果を各フレーム画像のクラスタ情報（節点クラスタ情報ＰＣ）としてクラスタ情報記憶手段８に書き込む。
【００７４】
一方、現在処理を行っているフレーム画像が先頭フレームでない場合（ステップＳ２５でＮｏ）は、以下のクラスタ番号付け直し処理を行う（ステップＳ２６）。
【００７５】
すなわち、ステップＳ２４で分離し生成された節点クラスタ（処理対象節点クラスタ）に属する節点が、前のフレーム画像のある節点クラスタに半数以上含まれていた場合は、前のフレーム画像の節点クラスタに付されているクラスタ番号ＰＣｃｎを、処理対象節点クラスタのクラスタ番号ＰＣｃｎとする。なお、処理対象節点クラスタに属する節点が、前のフレーム画像の節点クラスタに半数以上含まれていない場合は、クラスタ番号ＰＣｃｎとして「０」を付しておく。そして、クラスタ番号ＰＣｃｎに「０」が付された全ての節点クラスタに対して、使用されていないクラスタ番号を付すことで、全ての節点クラスタに固有のクラスタ番号を付す。
【００７６】
そして、このクラスタリング結果を各フレーム画像のクラスタ情報（節点クラスタ情報ＰＣ）としてクラスタ情報記憶手段８に書き込む（ステップＳ２７）。以上の動作によって、節点クラスタリング手段６は、フレーム画像毎に節点を分類（クラスタリング）する。
【００７７】
（テクスチャクラスタリング手段の詳細動作）
次に、図４を参照（適宜図１参照）して、テクスチャクラスタリング手段７の動作について、さらに詳細に説明を行う。図４は、テクスチャクラスタリング手段７の動作を示すフローチャートである。なお、この動作は、図２におけるステップＳ９の動作に相当するものである。また、以下の説明において、各情報は図６で付した符号を用いることとする。
【００７８】
まず、テクスチャクラスタリング手段７は、フレーム画像上の１つの節点を入力する（ステップＳ３１）。そして、現在のテクスチャクラスタの数が「０」かどうかを判定し（ステップＳ３２）、「０」の場合（Ｙｅｓ）は、入力された節点のみからなるテクスチャクラスタを生成し（ステップＳ３３）、ステップＳ３８へ進む。なお、このテクスチャクラスタの平均画像特徴量は、入力された節点の近傍における画像特徴量そのものである。
【００７９】
一方、現在のテクスチャクラスタの数が「０」でない場合（ステップＳ３２でＮｏ）は、入力した節点の近傍における画像特徴量（近傍画像特徴量）ＰＴＩｆｖと、各テクスチャクラスタの全節点の平均画像特徴量ＴＣｆｖとの距離を計算する（ステップＳ３４）。ここで、距離が最小でかつ予め定めた閾値以下であるテクスチャクラスタを探索する（ステップＳ３５）。そして、その探索結果を判定し（ステップＳ３６）、テクスチャクラスタが見つからなかった場合（Ｎｇ）は、ステップＳ３３へ進む。一方、テクスチャクラスタが見つかった場合（Ｏｋ）は、そのテクスチャクラスタに属する節点ＴＣｐｔに、入力された節点を追加し、そのテクスチャクラスタにおける全節点の平均画像特徴量ＴＣｆｖを更新する（ステップＳ３７）。
【００８０】
ここで、フレーム画像上の全節点について処理を行ったかどうかを判定し（ステップＳ３８）、全節点について処理を行っていない場合（Ｎｏ）は、ステップＳ３１へ戻って動作を続ける。一方、フレーム画像上の全節点について処理を行った場合（ステップＳ３８でＹｅｓ）は、以下の代表テクスチャ構造体生成処理を行う（ステップＳ３９）。
【００８１】
すなわち、各テクスチャクラスタにおいて、節点の近傍画像特徴量ＰＴＩｆｖと、テクスチャクラスタの平均画像特徴量ＴＣｆｖとの距離が最も近い節点を選び、その節点の近傍画像特徴量ＰＴＩｆｖを、テクスチャクラスタを代表する代表画像特徴量ＰＴＸｆｖとする。そして、その節点のフレーム番号に対応するフレーム画像から、その節点の座標を中心として予め定めた大きさの矩形領域を抽出し、代表テクスチャ画像とする。
【００８２】
そして、代表画像特徴量ＰＴＸｆｖ及び代表テクスチャ画像ＰＴＸｉｍｇを代表テクスチャ構造体ＰＴＸとしてクラスタ情報記憶手段８に書き込む（ステップＳ４０）。
以上の動作によって、テクスチャクラスタリング手段７は、テクスチャクラスタの数と同数の代表テクスチャ構造体を生成する。
【００８３】
（特徴情報生成手段の詳細動作）
次に、図５を参照（適宜図１参照）して、特徴情報生成手段９の動作について、さらに詳細に説明を行う。図５は、特徴情報生成手段９の動作を示すフローチャートである。なお、この動作は、図２におけるステップＳ１２の動作に相当するものである。また、以下の説明において、各情報は図６で付した符号を用いることとする。
【００８４】
まず、特徴情報生成手段９は、出力を行う画像データの初期化（全ての画素を「０」に設定）と、出力を行うテキストデータ書き込み用のテキストファイルをオープンする（ステップＳ４１）。そして、特徴情報生成手段９は、先頭フレームのクラスタの中で、最終フレームまで追跡できた（同じクラスタ番号を持つクラスタが最終フレームに存在する）クラスタを抽出する（ステップＳ４２）。
【００８５】
ここで、特徴情報生成手段９は、抽出したクラスタの全節点の位置ＰＴＩｐｓを解析し、全節点を最も効率良く（面積を小さく）囲む矩形の四隅の座標を設定する（ステップＳ４３）。
また、特徴情報生成手段９は、クラスタに属する全節点の平均画像特徴量ＰＣｆｖに最も近い代表画像特徴量ＰＴＸｆｖを持つ代表テクスチャ構造体ＰＴＸを選択する（ステップＳ４４）。
【００８６】
さらに、特徴情報生成手段９は、クラスタの全節点の位置ＰＴＩｐｓから、先頭フレームのクラスタの位置重心（ｓｘ，ｓｙ）を算出し（ステップＳ４５）、そのクラスタと同一のクラスタ番号を持つ最終フレームのクラスタからクラスタ位置重心（ｅｘ，ｅｙ）を算出する（ステップＳ４６）。そして、位置重心の差（ｅｘ−ｓｘ，ｅｙ−ｓｙ）をクラスタの動きベクトルとして算出する（ステップＳ４７）。
【００８７】
そして、特徴情報生成手段９は、ステップＳ４３で求めた矩形の内部をステップＳ４４で選択した代表テクスチャ構造体ＰＴＸの代表テクスチャ画像ＰＴＸｉｍｇで塗りつぶし、ステップＳ４５で算出した位置重心（ｓｘ，ｓｙ）を始点として、動きベクトル（ｅｘ−ｓｘ，ｅｙ−ｓｙ）を直線で描画した画像データを生成する（ステップＳ４８）。
【００８８】
さらに、特徴情報生成手段９は、先頭フレーム番号、最終フレーム番号、矩形で囲んだ四隅の座標（矩形座標情報）、ステップＳ４４で選択した代表テクスチャ構造体ＰＴＸの代表画像特徴量及び動きベクトル（ｅｘ−ｓｘ，ｅｙ−ｓｙ）の各情報をテキストファイルヘ書き込む（ステップＳ４９）。
【００８９】
以上の動作によって、特徴情報生成手段９は、シーンを特徴付ける映像特徴情報（シーン簡易情報）を画像データ及びテキストデータ（テキストファイル）として生成し出力する。
【００９０】
【発明の効果】
以上説明したとおり、本発明に係る映像特徴情報生成方法、映像特徴情報生成装置及び映像特徴情報生成プログラムでは、以下に示す優れた効果を奏する。
【００９１】
請求項１に記載の発明によれば、映像データから、物体（被写体）及び背景の位置、大きさ、動きベクトル等の大まかな特徴を映像特徴情報として生成することができるため、この映像特徴情報を映像データの検索情報として使用することができる。このとき、映像特徴情報は、大まかな特徴を表現しているため、検索者の検索意図の表現が容易となり、検索者の検索意図に適した映像データを検索することが可能になる。
【００９２】
また、本発明によれば、映像データから検索者が検索を行う際に、検索条件（パラメータ）の数を減らすことができるため、検索者にとって検索を行うための操作が簡単になる。さらに、検索条件が簡易化されているため、検索速度を高速化できるという効果も奏する。
また、本発明によれば、シーンを特徴付けるテクスチャで、動きのある領域を塗りつぶした画像データを、映像特徴情報として生成することができるため、視覚的に映像データのシーンの特徴を捉えることができる。これによって、検索者は、この画像データである映像特徴情報から、場面毎の変化を視覚的に認識することができるため、検索者が所望する映像データの場面を素早く検索することが可能になる。
【００９３】
請求項２に記載の発明によれば、映像データのシーンを検出して、そのシーン毎に映像特徴情報を生成するため、映画等のシーンが変化する映像データにおいても、自動で簡易な映像特徴情報を生成することができる。
【図面の簡単な説明】
【図１】本発明における映像特徴情報生成装置の構成を示すブロック図である。
【図２】本発明における映像特徴情報生成装置の動作を示すフローチャートである。
【図３】節点クラスタリング手段の動作を示すフローチャートである。
【図４】テクスチャクラスタリング手段の動作を示すフローチャートである。
【図５】特徴情報生成手段の動作を示すフローチャートである。
【図６】（ａ）節点情報のデータの構造を示すデータ構造図である。
（ｂ）節点クラスタ情報のデータの構造を示すデータ構造図である。
（ｃ）テクスチャクラスタ情報のデータの構造を示すデータ構造図である。
（ｄ）代表テクスチャ構造体のデータの構造を示すデータ構造図である。
【図７】格子状に節点を設定（配置）したフレーム画像の一例を示す図である。
【図８】フレーム画像上で節点を追跡する状態を説明するための説明図である。
【図９】節点クラスタリング手段の節点クラスタの切り離し処理を説明するための説明図である。
【図１０】画像データとして生成した映像特徴情報の画像の一例を示す図である。
【図１１】テキストデータとして生成した映像特徴情報のデータ内容の一例を示す図である。
【符号の説明】
１映像特徴情報生成装置
２シーン分割手段
３フレーム画像サンプリング手段
４節点初期設定手段
４ａ近傍画像特徴量計算部
５節点追跡手段
５ａ近傍画像特徴量計算部
６節点クラスタリング手段（節点分類手段）
７テクスチャクラスタリング手段（クラス画像特徴量生成手段）
８クラスタ情報記憶手段（記憶手段）
９特徴情報生成手段
９ａシーン特徴情報生成部
９ｂ初期化部
２０映像蓄積手段[0001]
BACKGROUND OF THE INVENTION
  The present invention generates video feature information that generates simple video feature information that characterizes the video data from the video data.apparatusAbout.
[0002]
[Prior art]
In general, in order to search for desired video data from a large amount of video data, a video search technique is known in which feature quantities are extracted from video data in advance and video data is searched based on the feature quantities. Yes.
[0003]
Conventionally, in order to extract feature amounts from video data for video search, the exact shape of the subject and background in the video data is cut out, and feature amounts of each shape are extracted, or video data is configured. For example, a vector indicating the accurate movement of a large number of feature points arranged on each frame image is extracted as a feature amount.
[0004]
For example, an object (subject) is detected from video data, and a feature quantity including the position, shape, and movement of the object and a feature quantity including background movement are generated as feature quantity data of the video data. There is a technique for performing video search based on the feature amount data (see Patent Document 1).
[0005]
[Patent Document 1]
JP 2000-222584 A (paragraph number 25-30, FIG. 1)
[0006]
[Problems to be solved by the invention]
However, the conventional technique has a problem in that since a detailed feature amount is extracted from video data, a calculation time for calculating the feature amount becomes long. Also, as in the past, when video data is searched based on detailed feature amounts, it is complicated to narrow down search information reflecting the searcher's search intention and to calculate the similarity of search information in video data. Therefore, there is a problem that it takes a long time to search the video data in accordance with the search intention of the searcher.
[0007]
Further, in the conventional technique, when it is desired to express a searcher's search intention with a simple expression such as a sketch representing an object (subject) region, a background region, and a vector indicating their motion in the video data, The more detailed the amount, the more complicated the sketch representation. For this reason, the search information envisioned by the searcher must be expressed in a more complicated manner, and as a result, a search suitable for the searcher's sense (similar sense) cannot be performed.
[0008]
  The present invention has been made in view of the above problems, and by extracting simple information from video data as a feature amount, the calculation time for extracting the feature amount is shortened, and the search is performed. Video feature information generation that enables users to search for desired video data by video search suitable for the user's similar senseapparatusThe purpose is to provide.
[0018]
[Means for Solving the Problems]
  The present invention has been made to achieve the above-mentioned object, and first, claim 1The video feature information generating device described in the above is a video feature information generating device that generates video feature information that characterizes the scene for each scene of the video data, and includes a frame image sampling unit, a node initial setting unit, and a node tracking Means, node classification means, class image feature value generation means, and feature information generation means.
[0019]
According to such a configuration, the video feature information generation apparatus extracts the frame image from the video data at a specific sampling interval by the frame image sampling unit, and extracts the frame image extracted by the frame image sampling unit by the node initial setting unit. A node (reference point) serving as a reference for extracting image features is set in the first frame. Note that the nodes may be set at specific intervals, or the positions may be set based on templates that are determined in advance.
[0020]
  Then, the video feature information generating apparatus searches for nodes having similar image feature amounts in the neighboring image regions by the node tracking means, thereby corresponding to frame images after the first frame.NodeIn addition to tracking, each node image tracked by the node tracking means for each frame image by the node classification meansKnotNodes that are similar in the image feature quantity of the neighboring image area are grouped (clustered) into one class (cluster)..
[0021]
  In addition, the video feature information generation device uses the class image feature amount generation means toFrom the first frame to the last frame of the scene,Classified into the same class (cluster) by node classification meansKnotImage feature of each frame image corresponding to the neighborhood image areaImage feature amount having the smallest distance from the average image feature amount in the class,The classRepresentClass image featureAsGenerate. This class image feature amount is an image feature amount that best represents the features of images of the same class (cluster)..
[0022]
  Then, the video feature information generation device uses the feature information generation unit to generate the class image feature amount generated by the class image feature amount generation unit, the coordinate information of the rectangular area including the node in the class, and the first frame of the scene. Motion information that is a motion vector of the position centroid of the rectangular area from to the last frame is generated as video feature information.
  In the video feature information generation device, the feature information generation unit extracts a representative texture (pattern) from the frame image most similar to the class image feature amount. Then, by representing each of the rectangular areas for each neighboring image area classified by the node classification means with this texture, the characteristics of the rectangular area itself can be expressed by representative image data. Thereby, the video feature information of the video data can be expressed by visual image data.
[0023]
  Claims2The video feature information generating device according to claim 11The video feature information generating apparatus according to claim 1, further comprising a scene dividing unit that divides the video data for each scene, wherein the frame image sampling unit is configured to generate a frame image at a specific sampling interval for each scene divided by the scene dividing unit. Is extracted.
[0024]
According to this configuration, the video feature information generation apparatus detects scene switching (scene change) of the input video data by the scene dividing unit, and divides the video data for each scene. As a result, the video feature information generation apparatus can automatically divide video data for each scene that becomes a break of node features, thereby improving the efficiency for generating video feature information.
[0032]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of Video Feature Information Generation Device]
FIG. 1 is a block diagram showing a configuration of a video feature information generating apparatus according to the present invention. The video feature information generation apparatus 1 divides video data into partial videos (hereinafter referred to as scenes), and these scenes are roughly positioned and sizes of subjects and backgrounds, representative images / patterns (hereinafter referred to as textures). ) And video feature information (simple scene information) expressed by rough motion (hereinafter referred to as motion vector). Here, the video feature information generating apparatus 1 includes a scene dividing unit 2, a frame image sampling unit 3, a node initial setting unit 4, a node tracking unit 5, a node clustering unit 6, a texture clustering unit 7, a cluster An information storage unit 8 and a feature information generation unit 9 are provided. In addition, the video feature information generation apparatus 1 is assumed to input video data from video storage means 20 such as a hard disk that stores video data.
[0033]
The scene dividing unit 2 divides the video data input from the video storage unit 20 into scenes. The scene division in the scene dividing means 2 detects a point (scene change) where the configuration of the video data is largely switched, and divides the video data at each switching point. This scene change can be detected by an existing method. For example, by calculating the difference between time-series consecutive frame images constituting video data, and assuming that the sum of absolute values of the differences is greater than a predetermined threshold, it is assumed that there is no continuity between the frame images Detect scene changes.
[0034]
The scene dividing unit 2 sequentially outputs frame images to the frame image sampling unit 3 for each scene obtained by dividing the video data. The scene dividing unit 2 detects (divides) the next scene when the feature information generating unit 9 described later generates video feature information (scene simple information) of one scene.
[0035]
The frame image sampling means 3 extracts the frame image input from the scene dividing means 2 at a preset sampling interval. In this frame image sampling means 3, when the extracted frame image is the first frame of the scene, the frame image (first frame) is output to the node initial setting means 4, and the extracted frame image is other than the first frame of the scene. In some cases, the frame image (after the first frame) is output to the node tracking means 5.
[0036]
The node initial setting means 4 sets the nodes in a grid pattern at predetermined intervals on the frame image (first frame) when the frame image of the first frame is input from the frame image sampling means 3. Here, the nodal point initial setting means 4 is configured by adding the neighborhood image feature amount calculation unit 4a. Note that a node refers to a point serving as a reference for extracting a feature of a frame image. In addition, the nodes may be set at specific intervals, or the positions may be set based on templates that have been arranged in advance.
[0037]
The neighborhood image feature amount calculation unit 4a calculates an image feature amount from the vicinity of a node set on the frame image (first frame) (neighboring image region: a region having a predetermined size (such as a square region)). . As this image feature amount, a feature amount common in the field of image processing may be used. For example, the average value of each color component of RGB, the distribution of the edge amount when the image is edged, the fractal dimension indicating the complexity of the image, and the like can be used.
[0038]
Here, the nodes will be described with reference to FIG. 7 (refer to FIG. 1 as appropriate). FIG. 7 is a diagram showing a state in which the node initial setting means 4 has arranged the nodes PT in a grid pattern at a certain interval on the frame image of the first frame. Here, for convenience of explanation, a node PT is illustrated on the frame image, but this node PT only indicates a position corresponding to a grid-like point on the frame image. In this way, the node initial setting means 4 sets the nodes PT in a grid pattern with Npx horizontal and Npy vertical (Npx and Npy are preset).
[0039]
  In addition, node initial settingmeansThe neighborhood image feature amount calculation unit 4a of 4 calculates image feature amounts from neighborhood regions (neighboring image regions: square regions of Rfv pixels × Rfv pixels) centered on each node PT, and associates them with each node PT.
[0040]
The node initial setting means 4 generates information (node information PTI) as shown in FIG. 6A for each node, and stores the node information PTI in the cluster information storage means 8. As shown in FIG. 6A, the node information PTI includes a frame image number (frame number) PIfn in which the node is set, and the current position of the node on the frame image (coordinate information: x coordinate and y coordinate). PTIps and an image feature amount (neighboring image feature) PIfv calculated from the neighborhood region of the node at the current position.
Returning to FIG. 1, the description will be continued.
[0041]
The node tracking unit 5 includes an image feature amount (neighboring of the neighborhood image region of the node associated with the previous frame image stored in the cluster information storage unit 8 in the frame image input from the frame image sampling unit 3. Image feature amount) is referred to, and a neighborhood image region similar to the neighborhood image feature amount is tracked. Here, the node tracking unit 5 is configured by adding the neighborhood image feature amount calculation unit 5a. Here, the tracking of the neighborhood image region is performed by tracking the position of the corresponding node.
[0042]
The neighborhood image feature quantity calculation unit 5a calculates an image feature quantity from a neighborhood area centered on a node (neighboring image area: a circular area having a specific radius). As a result, the node tracking unit 5 compares the adjacent image feature quantity of the node associated with the previous frame image with the image feature quantity of the current frame image in the vicinity area of the node position, and the most difference is obtained. A pixel having a smaller image feature amount is set as a node of the current frame image.
[0043]
Note that the image feature amount calculated by the neighborhood image feature amount calculation unit 5a is a feature amount that is general in the field of image processing, as in the neighborhood image feature amount calculation unit 4a. Here, the neighborhood image feature quantity calculation unit 4a and the neighborhood image feature quantity calculation unit 5a are configured as separate units, but are configured as one unit, and the node initial setting unit 4 and the node tracking unit 5 are common. It is good also as making it operate.
[0044]
Here, with reference to FIG. 8 (refer to FIG. 1 as appropriate), tracking of nodes on a frame image will be described. FIG. 8 is a diagram illustrating a state in which the node tracking unit 5 tracks the node on the frame image. In this case, the node PT (marked with ● in the figure) in the previous frame image (here, the first frame in FIG. 7) is the node PT in the current frame image._B(X mark in the figure) indicates that tracking was performed. As described above, the node tracking unit 5 moves the node to a position where the difference in the image feature amount is minimized based on the content of the frame image that changes every moment.
Returning to FIG. 1, the description will be continued.
[0045]
The node clustering means (node classification means) 6 classifies (clusters) the nodes in one frame image in which the position of the node is determined by the node initial setting means 4 or the node tracking means 5 based on the neighborhood image feature amount and the coordinate information. The classification information (cluster information) is stored in the cluster information storage means 8.
[0046]
The node clustering means 6 generates information (node cluster information PC) as shown in FIG. 6B and stores (stores) the node cluster information PC in the cluster information storage means 8. As shown in FIG. 6B, the node cluster information PC includes, for each frame image, all node PCpt belonging to the node cluster (to be a pointer indicating the node information PTI) and neighboring image features of all nodes belonging to the node cluster. The average image feature amount PCfv of the amount PTFvv (FIG. 6A) and the cluster number PCcn assigned to the node cluster.
[0047]
The texture clustering means (class image feature quantity generating means) 7 sequentially inputs the frame images in which the positions of the nodes are determined by the node initial setting means 4 or the node tracking means 5, and each of the frames from the first frame to the last frame in one scene An image feature amount that represents a cluster is generated.
[0048]
The texture clustering means 7 creates information (texture cluster information TC) as shown in FIG. 6C in the cluster information storage means 8 and updates (stores) it every time a frame image is input. As shown in FIG. 6C, the texture cluster information TC includes all nodes TCpt belonging to the texture cluster (to be a pointer indicating the node information PTI) and neighboring image feature values PIfv belonging to the texture cluster (FIG. 6). (See (a)) and the average image feature amount TCfv. The average image feature amount TCfv is an N-dimensional vector composed of an average value of each element when the image feature amount is an N-dimensional vector.
[0049]
Further, the texture clustering means 7 generates a representative texture structure representing each texture cluster at the stage when the final frame in the scene is input. Here, among all the nodes TCpt belonging to the texture cluster (see FIG. 6C), the neighboring image feature amount PIfv (see FIG. 6A) of all the nodes belonging to the cluster and all the nodes belonging to the texture cluster. The image feature amount with the smallest distance from the average feature amount TCfv (see FIG. 6C) is set as a representative image feature amount representing a texture cluster. In addition, a rectangular area having a predetermined size is extracted from the nodes having the representative image feature amount to obtain a representative texture image.
[0050]
The representative texture structure has a structure as shown in FIG. That is, the representative texture structure PTX includes a preset representative texture image PTXimg of Tw pixels × Th pixels, and a neighborhood image feature amount (representative image feature amount) PTXfv of the center coordinates of the representative texture image PTXimg.
[0051]
The cluster information storage means (storage means) 8 stores cluster information and is a general storage medium such as a hard disk. The cluster information storage unit 8 stores cluster information CI of each frame image and a representative texture structure PTX generated for each scene.
The cluster information includes the node information PTI (FIG. 6A), the node cluster information PC (FIG. 6B), and the texture cluster information TC (see FIG. 6C) of the frame image. included.
[0052]
The feature information generation means 9 refers to the cluster information storage means 8 and generates video feature information (scene simple information) that characterizes the scene for each scene. Here, the feature information generation means 9 is composed of a scene feature information generation unit 9a and an initialization unit 9b.
[0053]
The scene feature information generation unit 9a includes the coordinate information of the rectangular area including the nodes classified by the node clustering means (node classification means) 6, and the class image generated by the texture clustering means (class image feature value generation means) 7. A feature amount (representative texture image and representative image feature amount) and motion information of a rectangular area are generated as video feature information. The scene feature information generation unit 9a generates video feature information as image data and text data.
[0054]
When generating video feature information with image data, the interior of the rectangular area including the nodes of the cluster is filled with a representative texture image. Then, the motion of the position centroid of the cluster from the first frame to the last frame is drawn in a straight line as the motion vector of the cluster.
[0055]
FIG. 10 shows an example of image data generated by the scene feature information generation unit 9a. As shown in FIG. 10, the rectangular area of the cluster is filled with the representative texture image, and the movement of the cluster is indicated by a straight line. Thereby, the simple information of the scene of the video data can be visualized.
[0056]
Also, when generating video feature information with text data, the frame number of the first and last frames, the coordinate information of the rectangular area including the nodes of the cluster, the representative image feature amount, and the cluster from the first frame to the last frame The motion vector of the position centroid is written into a text file as text data.
[0057]
FIG. 11 shows an example of text data generated by the scene feature information generation unit 9a. As shown in FIG. 11, for each scene, the first frame number Ns, the last frame number Ne, and the coordinate information {(x0, y0), (x1, y1), (x2, y2), (x3, y3)}, the image feature amount (representative image feature amount) of the texture structure {(f (0), f (1), f (2),..., f (N−1)}, and x of the motion vector And y component {vx, vy}, information for one cluster is generated.
[0058]
Here, the representative image feature amount {(f (0), f (1),...) Represents the average value of each RGB color component, the distribution of the edge amount when the image is edged, and the like. Here, the representative image feature quantity is represented as an N-dimensional image feature vector.
[0059]
The initialization unit 9b determines whether or not the frame image is the final frame of the scene, and initializes the cluster information in the cluster information storage unit 8 when the frame image is the final frame. As a result, cluster information of a new scene is generated in the cluster information storage means 8 when a scene change (scene change) occurs.
[0060]
The configuration of the video feature information generation device 1 according to the present invention has been described above, but the present invention is not limited to this. Here, the scene dividing means 2 for dividing the video data into scenes is provided. However, when the video data is input for each scene, the scene dividing means 2 is deleted from the video feature information generating device 1 and configured. Also good. Here, in the scene feature information generation unit 9a, the video feature information is generated as image data and text data, but either one may be generated.
[0061]
In addition, the video feature information generation apparatus 1 can also realize each unit as a function program in a computer, and can also function as a video feature information generation program by combining the function programs.
[0062]
[Operation of video feature information generation device]
Next, the operation of the video feature information generation apparatus will be described with reference to FIG. 2 (refer to FIG. 1 as appropriate). FIG. 2 is a flowchart showing the operation of the video feature information generation apparatus.
[0063]
First, the video feature information generation apparatus 1 inputs video data from the video storage unit 20 by the scene dividing unit 2 and detects a scene change (scene change), thereby dividing the video data into scenes (step S1). ). Then, the frame image sampling means 3 extracts frame images from the scene at preset sampling intervals (step S2).
[0064]
Here, the frame image sampling means 3 determines whether or not the extracted frame image is the first frame of the scene (step S3). If it is the first frame (Yes), the frame image is sent to the node initial setting means 4. Output. Then, the node initial setting means 4 sets the nodes in a grid pattern with respect to the frame image of the first frame (step S4), and the neighborhood image feature quantity calculation unit 4a calculates the image feature quantity from the neighborhood of each node ( Step S5).
[0065]
On the other hand, if the extracted frame image is other than the first frame (No in step S3), the frame image is output to the node tracking means 5. Then, the node tracking means 5 calculates the image feature amount around the node (circular region having a specific radius) in the current frame image (step S6), and refers to the image feature amount near the node in the previous frame image. The position of the node in the current frame image is moved (tracked) (step S7).
Thus, in step S5 or step S7, the position of the node and the image feature amount in each frame image are generated as the node information PTI (see FIG. 6A).
[0066]
Then, the video feature information generation device 1 classifies (clusters) the nodes based on the node information of the frame image by the node clustering means (node classification means) 6 and assigns a cluster number to the cluster information (nodes). Cluster information PC: see FIG. 6B) is generated (step S8). Also, the video feature information generation apparatus 1 uses the texture clustering means 7 to obtain the texture cluster information TC (FIG. 6C) from the same cluster information (node cluster information) from the first frame image to the currently input frame image. Reference) is generated (step S9).
[0067]
Here, the video feature information generation device 1 determines whether or not the input frame image is the final frame of the scene (step S10), and if it is not the final frame (No), the process returns to step S2 to return to the next Classification (clustering) of frame images is performed.
[0068]
On the other hand, when the input frame image is the final frame (Yes in step S10), the texture clustering unit 7 generates a representative texture structure PTX (see FIG. 6D) that is representative in each texture cluster (step (d) in FIG. 6). S11).
Then, the video feature information generation device 1 generates and outputs video feature information (scene simple information) that characterizes the scene by the scene feature information generation unit 9a of the feature information generation unit 9 (step S12).
[0069]
Through the above operation, video feature information in a scene with video data is generated. If the input of video data continues, the initialization unit 9b of the feature information generation unit 9 initializes the cluster information and the representative texture structure in the cluster information storage unit 8, and repeats the operation from step S1. As a result, the video feature information (simple scene information) for each scene is sequentially generated.
[0070]
(Detailed operation of node clustering means)
Next, the operation of the node clustering means 6 will be described in more detail with reference to FIG. FIG. 3 is a flowchart showing the operation of the node clustering means 6. This operation corresponds to the operation in step S8 in FIG. Further, in the following description, the symbols used in FIG.
[0071]
First, the node clustering means 6 calculates the distances of the neighboring image feature amounts PIfv of all the node pairs on one input frame image (step S21). Then, nodes whose distances are not more than a certain threshold are classified (clustered) so as to be included in the same cluster (step S22). Then, an average image feature value PCfv in the vicinity of all nodes PCpt included in the same cluster is calculated (step S23). Then, the node clustering means 6 separates the node sets from which the positions PTIps of all the nodes belonging to the node cluster are separated as different node clusters (step S24), and attaches the cluster number PCcn to each node cluster.
[0072]
Here, the separation of the node clusters will be described with reference to FIG. 9 (refer to FIG. 1 as appropriate). FIG. 9 is an example of a screen on which the node clustering means 6 performs node cluster separation processing, where (a) is a screen clustered based on image feature values, and (b) is a different node set for distant node sets. This is a cluster screen. As shown in FIG. 9A, when clustering is performed based on the image feature amount, the left and right node clusters are separated from each other, but both become the same node cluster. Therefore, by separating the node sets that are separated from each other, different node clusters are generated as shown in FIG. 9 (b) (in FIG. 9 (b), different node clusters are represented by the difference in the diagonal direction in the node cluster). is doing).
Returning to the flowchart of FIG.
[0073]
The node clustering means 6 determines whether or not the currently processed frame image is the first frame (step S25). If it is the first frame, the node clustering means 6 proceeds to step S27 and displays the clustering result as cluster information of each frame image. Write to the cluster information storage means 8 as (node cluster information PC).
[0074]
On the other hand, if the frame image currently being processed is not the first frame (No in step S25), the following cluster number renumbering process is performed (step S26).
[0075]
That is, if more than half of the nodes belonging to the node cluster (processing target node cluster) separated and generated in step S24 are included in a node cluster in the previous frame image, the nodes are attached to the node cluster in the previous frame image. The assigned cluster number PCcn is set as the cluster number PCcn of the node cluster to be processed. If more than half of the nodes belonging to the node cluster to be processed are not included in the node cluster of the previous frame image, “0” is assigned as the cluster number PCcn. Then, by assigning an unused cluster number to all the node clusters having “0” added to the cluster number PCcn, a unique cluster number is assigned to all the node clusters.
[0076]
Then, this clustering result is written in the cluster information storage means 8 as cluster information (node cluster information PC) of each frame image (step S27). With the above operation, the node clustering means 6 classifies (clusters) the nodes for each frame image.
[0077]
(Detailed operation of texture clustering means)
Next, the operation of the texture clustering means 7 will be described in more detail with reference to FIG. 4 (refer to FIG. 1 as appropriate). FIG. 4 is a flowchart showing the operation of the texture clustering means 7. This operation corresponds to the operation in step S9 in FIG. Further, in the following description, the symbols used in FIG.
[0078]
First, the texture clustering means 7 inputs one node on the frame image (step S31). Then, it is determined whether or not the current number of texture clusters is “0” (step S32). If “0” (Yes), a texture cluster including only the input nodes is generated (step S33). Proceed to S38. Note that the average image feature amount of the texture cluster is the image feature amount in the vicinity of the input node.
[0079]
On the other hand, when the number of current texture clusters is not “0” (No in step S32), the image feature amount (neighboring image feature amount) PIfv in the vicinity of the input node and the average image feature of all the nodes of each texture cluster. The distance from the quantity TCfv is calculated (step S34). Here, a texture cluster having a minimum distance and not more than a predetermined threshold is searched (step S35). Then, the search result is determined (step S36), and if the texture cluster is not found (Ng), the process proceeds to step S33. On the other hand, when a texture cluster is found (Ok), the input node is added to the node TCpt belonging to the texture cluster, and the average image feature value TCfv of all nodes in the texture cluster is updated (step S37).
[0080]
Here, it is determined whether or not processing has been performed for all nodes on the frame image (step S38). If processing has not been performed for all nodes (No), the process returns to step S31 to continue the operation. On the other hand, when processing has been performed for all nodes on the frame image (Yes in step S38), the following representative texture structure generation processing is performed (step S39).
[0081]
That is, in each texture cluster, a node having the closest distance between the neighboring image feature quantity PTFvv of the node and the average image feature quantity TCfv of the texture cluster is selected, and the neighboring image feature quantity PIfv of the node is represented as a representative of the texture cluster. The image feature amount is PTXfv. Then, from the frame image corresponding to the frame number of the node, a rectangular area having a predetermined size with the coordinates of the node as a center is extracted and used as a representative texture image.
[0082]
Then, the representative image feature amount PTXfv and the representative texture image PTXimg are written in the cluster information storage unit 8 as the representative texture structure PTX (step S40).
Through the above operation, the texture clustering unit 7 generates the same number of representative texture structures as the number of texture clusters.
[0083]
(Detailed operation of feature information generation means)
Next, the operation of the feature information generation unit 9 will be described in more detail with reference to FIG. 5 (refer to FIG. 1 as appropriate). FIG. 5 is a flowchart showing the operation of the feature information generating unit 9. This operation corresponds to the operation in step S12 in FIG. Further, in the following description, the symbols used in FIG.
[0084]
First, the feature information generation unit 9 initializes image data to be output (sets all pixels to “0”) and opens a text file for writing text data to be output (step S41). Then, the feature information generation unit 9 extracts a cluster that can be traced to the last frame (clusters having the same cluster number exist in the last frame) from the clusters of the first frame (step S42).
[0085]
Here, the feature information generation means 9 analyzes the positions PTIps of all the nodes of the extracted cluster, and sets the coordinates of the four corners of the rectangle that surrounds all the nodes most efficiently (the area is small) (step S43).
Further, the feature information generating unit 9 selects a representative texture structure PTX having a representative image feature PTXfv closest to the average image feature PCfv of all nodes belonging to the cluster (Step S44).
[0086]
Further, the feature information generation means 9 calculates the position centroid (sx, sy) of the cluster of the first frame from the positions PTIps of all the nodes of the cluster (step S45), and the feature information generating means 9 of the last frame having the same cluster number as that cluster The cluster position centroid (ex, ey) is calculated from the cluster (step S46). Then, the difference (ex-sx, ey-sy) in the position centroid is calculated as the motion vector of the cluster (step S47).
[0087]
Then, the feature information generation unit 9 fills the inside of the rectangle obtained in step S43 with the representative texture image PTXimg of the representative texture structure PTX selected in step S44, and starts the position gravity center (sx, sy) calculated in step S45. As described above, image data in which a motion vector (ex-sx, ey-sy) is drawn as a straight line is generated (step S48).
[0088]
  Further, the feature information generating unit 9 includes the first frame number, the last frame number, the coordinates of the four corners enclosed by the rectangle (rectangular coordinate information), and the representative texture structure PTX selected in step S44.representativeEach information of the image feature amount and the motion vector (ex-sx, ey-sy) is written into the text file (step S49).
[0089]
Through the above operation, the feature information generation unit 9 generates and outputs video feature information (scene simple information) characterizing a scene as image data and text data (text file).
[0090]
【The invention's effect】
As described above, the video feature information generation method, the video feature information generation device, and the video feature information generation program according to the present invention have the following excellent effects.
[0091]
  Claim1According to the described invention, rough features such as the position, size, and motion vector of an object (subject) and background can be generated from video data as video feature information. It can be used as search information. At this time, since the video feature information expresses a rough feature, it is easy to express the search intention of the searcher, and video data suitable for the search intention of the searcher can be searched.
[0092]
  In addition, according to the present invention, when the searcher searches from the video data, the number of search conditions (parameters) can be reduced, so that the searcher can easily perform the search operation. Furthermore, since the search conditions are simplified, there is an effect that the search speed can be increased.
  In addition, according to the present invention, image data in which a moving area is filled with a texture that characterizes a scene can be generated as video feature information, so that a scene feature of video data can be visually captured. . Thus, the searcher can visually recognize the change for each scene from the video feature information that is the image data, so that the searcher can quickly search for the scene of the video data desired by the searcher. .
[0093]
  Claim2According to the invention described in the above, since the scene of the video data is detected and video feature information is generated for each scene, simple and simple video feature information is automatically generated even in video data in which scenes such as movies change. can do.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a video feature information generation apparatus according to the present invention.
FIG. 2 is a flowchart showing the operation of the video feature information generation apparatus according to the present invention.
FIG. 3 is a flowchart showing the operation of the node clustering means.
FIG. 4 is a flowchart showing the operation of the texture clustering means.
FIG. 5 is a flowchart showing an operation of feature information generating means.
FIG. 6A is a data structure diagram showing a data structure of node information.
(B) It is a data structure figure which shows the data structure of node cluster information.
(C) It is a data structure figure which shows the structure of the data of texture cluster information.
(D) It is a data structure figure which shows the structure of the data of a representative texture structure.
FIG. 7 is a diagram illustrating an example of a frame image in which nodes are set (arranged) in a grid pattern.
FIG. 8 is an explanatory diagram for explaining a state in which nodes are tracked on a frame image.
FIG. 9 is an explanatory diagram for explaining node cluster separation processing of the node clustering means;
FIG. 10 is a diagram illustrating an example of an image of video feature information generated as image data.
FIG. 11 is a diagram illustrating an example of data content of video feature information generated as text data.
[Explanation of symbols]
1 Video feature information generator
2 Scene division means
3 Frame image sampling means
4 Node initial setting means
4a Neighborhood image feature amount calculation unit
5 Node tracking means
5a Neighborhood image feature amount calculation unit
6 Node clustering means (node classification means)
7 Texture clustering means (class image feature generation means)
8 Cluster information storage means (storage means)
9 Feature information generation means
9a Scene feature information generator
9b Initialization part
20 Video storage means

Claims

映像データのシーン毎に、当該シーンを特徴付ける映像特徴情報を生成する映像特徴情報生成装置であって、
前記映像データから特定のサンプリング間隔でフレーム画像を抽出するフレーム画像サンプリング手段と、
このフレーム画像サンプリング手段で抽出されたフレーム画像の先頭フレームに、画像の特徴を抽出するための基準となる節点を設定する節点初期設定手段と、
この節点初期設定手段で設定された節点の近傍画像領域の画像特徴量に基づいて、前記先頭フレーム以降のフレーム画像に対応する前記節点を追跡する節点追跡手段と、
前記フレーム画像毎に、前記節点追跡手段で追跡した各節点の近傍画像領域の画像特徴量に基づいて、前記節点を分類する節点分類手段と、
前記シーンの先頭フレームから最終フレームまでにおいて、前記節点分類手段で同一のクラスに分類された節点の近傍画像領域に対応する前記各フレーム画像の画像特徴量の中で、当該クラス内の平均画像特徴量との距離が最も小さい画像特徴量を、当該クラスを代表するクラス画像特徴量として生成するクラス画像特徴量生成手段と、
このクラス画像特徴量生成手段で生成されたクラス画像特徴量と、当該クラス内の節点を含んだ矩形領域の座標情報と、前記シーンの先頭フレームからの最終フレームまでの前記矩形領域の位置重心の動きベクトルである動き情報とを、テキストデータとしての映像特徴情報として生成するとともに、前記節点分類手段で分類された近傍画像領域に基づいて区分けされた矩形領域に、前記クラス画像特徴量に最も類似するフレーム画像から抽出したテクスチャを敷き詰めた画像を、画像データとしての映像特徴情報として生成する特徴情報生成手段と、
を備えていることを特徴とする映像特徴情報生成装置。A video feature information generating device that generates video feature information characterizing each scene of the video data,
Frame image sampling means for extracting frame images from the video data at specific sampling intervals;
A node initial setting means for setting a reference node for extracting the feature of the image in the first frame of the frame image extracted by the frame image sampling means;
A node tracking unit that tracks the node corresponding to the frame image after the first frame based on the image feature amount of the neighborhood image region of the node set by the node initial setting unit;
Node classification means for classifying the nodes on the basis of the image feature amount of the neighborhood image area of each node tracked by the node tracking means for each frame image;
From the first frame to the last frame of the scene, among the image feature values of the respective frame images corresponding to the neighboring image areas of the nodes classified into the same class by the node classification means, the average image features in the class A class image feature amount generating unit that generates an image feature amount having the smallest distance from the amount as a class image feature amount representing the class;
The class image feature amount generated by the class image feature amount generation means, the coordinate information of the rectangular region including the node in the class, and the position centroid of the rectangular region from the first frame to the last frame of the scene the motion information is a motion vector, and generates as Film image feature information of the text data, the rectangular area is divided on the basis of the neighboring image regions classified by the node classifying unit, most to the class image feature amount Feature information generating means for generating an image in which texture extracted from a similar frame image is spread as video feature information as image data ;
A video feature information generating apparatus comprising:

前記映像データをシーン毎に分割するシーン分割手段を備え、前記フレーム画像サンプリング手段が、前記シーン分割手段で分割されたシーン毎に特定のサンプリング間隔でフレーム画像を抽出することを特徴とする請求項１に記載の映像特徴情報生成装置。A scene dividing unit that divides the video data for each scene, wherein the frame image sampling unit extracts a frame image at a specific sampling interval for each scene divided by the scene dividing unit. image feature information generating device according to one.