JP3838483B2

JP3838483B2 - Audio summary information, audio video summary information extraction device, playback device, and recording medium

Info

Publication number: JP3838483B2
Application number: JP2000396820A
Authority: JP
Inventors: 勝菅野; 康之中島; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2000-12-27
Filing date: 2000-12-27
Publication date: 2006-10-25
Anticipated expiration: 2020-12-27
Also published as: JP2002199332A

Description

【０００１】
【発明の属する技術分野】
本発明は、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報から、それらの内容を効率的に把握するための概要情報（サマリ）を抽出するオーディオ情報、オーディオビデオ概要情報の抽出装置および記録媒体に関する。また、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報から圧縮データ領域で高速に概要情報を抽出することにより、オーディオ情報またはオーディオビデオ情報の高速かつ効率的な閲覧を提供することが可能な、オーディオビデオ概要情報の再生装置に関する。
【０００２】
【従来の技術】
ビデオの自動要約作成については、例えば田中、脇本、神田による「シーン検出による動画情報の自動要約・閲覧技術の開発」、電子情報通信学会技術報告、IE99-20（1999）で研究発表（第１の従来技術）されており、該研究発表では、ビデオのシーン変化点を検出した後、階層構造化を行い、各シーンに優先度を付与することによってビデオの要約を自動的に作成している。特に、話題の転換点直後のシーンには高い優先度が付与される。階層構造においては、上位階層ほど優先度が高くなるように設定される。
【０００３】
また、J.Saarela、B.Merialdoによる「Using content models to build audio-video summaries」、SPIE Conference on Strorage and Retrieval for Image and Video Databases VII（1999）においては（第２の従来技術）、汎用的なビデオの概要情報（サマリ）の作成を、制約つきの最適化問題として捉えている。制約としては、最小のショット長、オーディオとビデオの同期、ビデオの連続性、及びオーディオとビデオの冗長性などである。そして、手動によってビデオの内容モデル（記号的な記述）を構築し、サマリの作成を行っている。
【０００４】
また、R.Lienhartらによる「Video Abstracting」、Communications of ACM、Vol.40、No.12（1997）では（第３の従来技術）、映画の予告編に特化した概要情報の作成を目的としている。主な作成手順としては、ビデオのショットへの分割、特別なイベントを含むクリップの解析、クリップの選択、およびクリップの集約である。特別なイベントとしては、俳優の顔の認識・会話の識別、タイトルからの文字情報の抽出、及び銃撃や爆発などのイベントを抽出している。
【０００５】
【発明が解決しようとする課題】
これらの従来技術によるオーディオビデオの概要情報抽出は、次のような問題を有している。まず、前記第１の従来技術においては、ビデオの階層構造化により効率的な要約作成を提供するが、階層構造化は手動で行う必要があり、長尺のオーディオビデオ情報に関しては概要情報の抽出に必要な前処理としての階層構造化に、多くの時間を要する可能性がある。また、優先度の付与はシーンが属する階層に依存して行われるため、実質的には人手を要することが多くなる。また、ビデオ単体が対象となっているためオーディオビデオへの拡張は可能であるが、オーディオとしての特性を利用していないため、場合によっては重要な内容を含むオーディオ情報を利用していないため、適切な概要情報が得られないことも考えられる。
【０００６】
また、前記第２の従来技術においては、ビデオの内容モデルの構築を手動で行わなければならないほか、効果的にオーディオの特性を利用する方式を採っていない。同時に、ビデオのセグメントの分類においては、圧縮データ上では実現が困難な、高度な認識技術などが必要となるため、概要情報抽出に要する処理が大きくなることが予想される。前記第３の従来技術に関しても、これらの高度な（コンテンツの意味内容にまで立ち入った）処理が必要となっている。
【０００７】
このように、従来技術ではオーディオビデオ情報の入力から概要情報の出力までの過程において、手動による処理が介在することが多く、また、オーディオ情報を効果的に利用しないため、オーディオとしての特性による概要情報の作成が考慮されていない。また、圧縮または非圧縮のオーディオ単体からの自動概要情報抽出については、これまで有効な方式は検討されていない。
【０００８】
また、概要情報の構造的な側面から見ても、オーディオビデオ情報全体から均一に概要情報が抽出される保証はなく、さらに前記第１と第２の従来技術では、外部から指定された概要情報長に近づけるような制御を行う方式も十分には採用されていない。
【０００９】
本発明は、前記した従来技術に鑑みてなされたものであり、その目的は、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報から、それらの内容を効率よく把握するための概要情報（サマリ）を抽出する装置において、圧縮データ領域でオーディオ及びビデオの時空間的な特性を解析し、必要に応じてこれらを統合的に評価し、オーディオまたはオーディオビデオ情報の指定した長さを持つ概要情報を、コンテンツ全体から均一に、かつ高速に抽出することを可能とした、オーディオ概要情報、オーディオビデオ概要情報の抽出装置、再生装置および記録媒体を提供することにある。
【００１０】
【課題を解決するための手段】
前記した目的を達成するために、本発明は、オーディオビデオ概要情報の抽出装置において、入力されたオーディオビデオコンテンツのビデオ情報の時間的構造を解析するコンテンツ解析手段と、時間的構造を解析された該ビデオ情報に付随するオーディオ情報のオーディオレベルを評価するオーディオレベル評価手段と、該オーディオレベル評価手段により評価されたオーディオビデオ概要情報を登録する概要情報登録手段とを具備した点に第１の特徴がある
【００１１】
また、本発明は、オーディオ概要情報の抽出装置において、入力された圧縮オーディオコンテンツからサブバンドデータを抽出する手段と、該サブバンドデータから、バンドにより重み付けしたサブバンドエネルギーの総和を計算する手段と、単位時間におけるサブバンドエネルギー総和を評価する手段と、オーディオ概要情報を抽出する手段とを具備した点に第２の特徴がある。
【００１２】
また、本発明は、入力されたオーディオビデオコンテンツのビデオ情報の時間的構造を解析する機能と、時間的構造を解析された該ビデオ情報に付随するオーディオ情報のオーディオレベルを評価する機能と、画像内の動きアクティビティを評価する機能と、前記オーディオレベルと動きアクティビティを評価されたオーディオビデオ概要情報を登録する概要情報登録機能とを含む、コンピュータに実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体を提供するようにした点に第３の特徴がある。
【００１３】
さらに、本発明は、オーディオビデオ概要情報の再生装置において、オーディオビデオ概要情報の抽出装置により抽出されたビデオ要素とオーディオ要素を同期して再生する手段を具備した点に第４の特徴がある。
【００１４】
【発明の実施の形態】
以下に、図面を参照して、本発明を詳細に説明する。図１は、本発明の第１の実施形態であるオーディオビデオ概要情報の抽出装置の構成を示すブロック図である。
【００１５】
まず、圧縮されたオーディオビデオ情報ＡＶは、該オーディオビデオ情報ＡＶの時間的構造を解析するコンテンツ解析部１に入力される。該コンテンツ解析部１は、図２に示されているような構成を有しており、該圧縮されたオーディオビデオ情報ＡＶは、まずショット分割部１１に入力する。該ショット分割部１１は、ビデオ情報をショット単位Ｓに分割し、次いでコンテンツ全体に対するショット数をカウントし、全ショット数をＮＳとして保持する。次に、分割数決定部１２は、ショット分割部１１から入力される全ショット数ＮＳと、オーディオビデオ情報ＡＶから得られるコンテンツ長ＣＬと、外部から指定される概要情報長ＳＬを用いて、下記の（１）式により分割数ＮＰを決定する。
【００１６】
ＮＰ＝ＮＳ×ＳＬ／ＣＬ・・・（１）
ただし、ＳＬ＞ＣＬ／ＮＳであるとする。例えば、コンテンツ長ＣＬが１時間のものを５分の概要情報長ＳＬに纏めようとすると、前記全ショット数ＮＳが２４０個の場合、分割数ＮＰは２０となる。そして、コンテンツ分割１３においてコンテンツをＮＰ等分する。等分割された区間ＳＡＶと、各区間に属するショットＳとをマッピングする。
【００１７】
以下では、分割区間入力１４で入力された区間ＳＡＶ毎に処理を行う。
まず、ショット入力１５で最初のショットＳnが入力される。ショット長評価部１６では、入力されたショットＳnのショット長ＳＨＬnが予め指定した長さＴ以下の場合に、概要情報ＳＵＭの候補から除外し、次のショットＳn+1の入力処理へ移行する。一方、ショットＳnのショット長ＳＨＬnがTよりも長い場合には、該当するショットＳnを概要情報ＳＵＭの候補とし、代表フレーム抽出部１７に送る。
【００１８】
代表フレーム抽出部１７では、ショットSnの代表フレームRFSnとして、ショットの先頭フレームSFSnまたは特徴フレームKFSnを抽出する。特徴フレームKFSnの抽出には、例えば特願2000-065259に記載された方法などを用いることができる。さらに、代表フレーム特徴値抽出部１８において、代表フレーム抽出部１７で抽出されたフレームRFSnの特徴値CHFSnを抽出し、ショット特徴値抽出部１９において、ショットSnの特徴値CHSnを抽出する。
【００１９】
ショットSnに対する代表フレームRFSnとしての先頭フレームSFSnまたは特徴フレームKFSnの特徴値CHFSnと、ショットSnの特徴値CHSnのいずれか一方又は両方は、代表ショット評価部１Ａに送られる。先頭フレームSFSnまたは特徴フレームKFSnの特徴値CHFSnとしては、例えばMPEG-7（Moving Picture Experts Group phase 7）で規定されている記述子などを用いることができる。
【００２０】
代表ショット評価部１Ａでは、対象となる分割区間SAVにおいて既に代表ショットRSとして登録されている全てのショットに関する代表フレームRFの特徴値CHFと、代表フレーム特徴値抽出部１８から送られたショットSnの代表フレームRFsの特徴値CHFSnとの間で類似度の判定を行う。ここで、特徴値CHFと入力されたショットSnの特徴値CHFSnとの類似度が大きいと判定された場合には反復ショットであると見なされ、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、特徴値CHFとショットSnの特徴値CHFSnとの類似度が小さいと判定された場合には、独立ショットであると見なされ、該当するショットSnを概要情報SUMの候補とし、代表ショット登録１Ｂでショットの登録を行ったあと、オーディオレベル評価部２に送る。なお、代表ショットとして登録されたショットSnに関する特徴値は別途保持され、それ以降の代表ショットの評価に用いられる。
【００２１】
ショット類似度の判定においては、同様に既に代表ショットRSとして登録されているショットに関する代表ショットの特徴値CHSと、ショット特徴値抽出部１９から送られたショットSnの特徴値CHSnを代用するか、或いは併用してもよい。
【００２２】
オーディオレベル評価部２では、コンテンツ解析部１から入力されたショットSnに関して、図３のサブバンドデータ抽出部２１でショットSnのオーディオ部分のサブバンドデータSDSnを抽出する。そして、無音解析部２２で無音としての特徴値を計算したあと、無音判定部２３でショットSnの全ての区間、またはショットSnの区間のY%以上が無音であると判定された場合に、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、無音判定部２３での判定において、ショットSnのオーディオ部分がショットSnの区間の（１００−Ｙ）％以上が有音である場合に、ショットSnは低レベル音解析部２４へ送られる。無音解析部２２での無音解析方法及び無音判定部２３での無音判定方法としては、例えば特願平10-235543号に記載された方法などを用いることができる。
【００２３】
低レベル音解析部２４では、同様にサブバンドデータ抽出部２１で抽出されたサブバンドデータSDSnから該当するオーディオのレベルLSnを推定し、指定された十分に低いレベルTHLL以下のオーディオがショットSnの全ての区間、またはショットSnの区間のＺ％以上を占める場合に、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、低レベル音解析部２４でショットSnのオーディオ部分が、ショットSnの区間の（１００−Ｚ）％以上がオーディオレベルTHLLを超えるオーディオである場合に、該当するショットSnを概要情報SUMの候補とし、高レベル音解析部２６に送る。
【００２４】
高レベル音解析部２６では、サブバンドデータ抽出部２１から得られたショットSnに関するサブバンドデータSDSnに基づいて、図４の単位時間密度計算部２６１でショットSnにおけるあるレベルTHHL以上のオーディオの単位時間密度Dsnを計算する。このとき、オーディオ情報がMPEGオーディオで符号化されている場合、1秒当りの単位時間密度Dsnは、例えば以下のように求めることができる。
【００２５】
Dsn＝（ＮＡＦ_THHL／ＮＡＦ）×ＡＡＬ_THHL ・・・（２）
ここで、NAF_THHLはレベルTHHL以上を持つ1秒間当りのフレーム数、NAFは1秒当りのオーディオフレーム数、AAL_THHLはNAF_THHLに対する平均レベルである。
【００２６】
また、サブバンドエネルギー総和計算部２６２においてサブバンドにより重み付けされたサブバンドエネルギーの総和SBEを計算し、それに基づいて単位時間サブバンドエネルギー計算部２６３で単位時間でのサブバンドエネルギー総和SBsnを計算する。1秒当りの単位時間サブバンドエネルギーSBsnは、例えば以下の（３）式のように求めることができる。

ここで、αkはサブバンドkに対する重み付け、sbkはあるフレームにおけるサブバンドkのエネルギーである。
【００２７】
次に、単位時間密度判定部２６４で該当するショットSnがある閾値THDを超える単位時間密度Dsnを持つ場合に、単位時間サブバンドエネルギー判定部２６５へ移行する。単位時間サブバンドエネルギー判定部２６５では、閾値THSBを超える単位時間サブバンドエネルギーSBsnが存在する場合、該当するオーディオを含むショットSnを概要情報の候補として判定し、高レベル音解析ルーチンを抜け出し、動きアクティビティ評価部３へ移行する。
【００２８】
これに対し、オーディオレベルTHHL以上のオーディオの単位時間密度DSnが閾値THD未満の場合、或いは単位時間サブバンドエネルギーSBSnが閾値THSB未満の場合、該当するオーディオを含むショットSnを概要情報の候補から除外し、次のショットSn+1の入力処理へ移行する。
【００２９】
なお、図３の構成において、無音解析部２２，低レベル音解析部２４、および高レベル音解析部２６は、必ずしも全部は必要でなく、少なくとも１つを備えておれば良い。
【００３０】
動きアクティビティ評価部３では、オーディオレベル評価部２から入力されたショットSnに関して、図５の動きベクトル抽出部３１でショットSnに属する全てのフレームの動きベクトル情報MVを抽出する。そして、動きアクティビティ計算部３２において、動きベクトル情報MVを用いてショットSn全体としての動きアクティビティMASnを計算し、それを用いて単位時間動きアクティビティ計算部３３において、ある単位時間（例えば1秒、1フレームなど）における動きアクティビティMAを計算する。動きアクティビティ計算部３２及び単位時間動きアクティビティ３３における処理は、例えばMPEG符号化されたビデオに対して以下のように表せる。ここでは1秒当りの動きアクティビティと仮定する。

ここで、ASMVは大きさがX以上の動きベクトルのフレーム内絶対値総和、NMBはフレーム内での大きさがX以上の動きベクトルを持つマクロブロック数、NPSnはショット内の予測符号化フレーム数、NVFは1秒当りのビデオフレーム数である。
【００３１】
動きアクティビティ判定部３４において、求められた動きアクティビティMAと、ある指定された閾値THMAを比較し、MAがTHMAを超える場合に、該当するショットSnを概要情報SUMの候補から除外し、次のショットSn+1の入力処理へ移行する。一方、動きアクティビティ判定部３４でショットSnのある単位時間における動きアクティビティMAが閾値THMAを超えない場合に、該当するショットSnを概要情報SUMの候補とする。
【００３２】
以上の処理により、オーディオビデオ情報AVを等間隔に分割した区間SAVにおいて、概要情報SUMの候補として選択されたショットSは概要情報登録部４に入力され、ショットSの情報が図６のショットメモリ４１に保存される。次いで、区間ＳＡＶでの全ショットの処理が終了したか否かの判断が判断部４２でなされ、この判断が否定の時には次のショットの処理が行われる。
【００３３】
一方、区間ＳＡＶの全ショットの処理が終了した時には、区間SAVにおける全てのショットSに関するショット長の総和SSLは、概要情報長判定部４３において、各区間での平均概要情報長MSL（=SL/NP）に十分近いかどうかを判断される。区間SAVでのショット長総和SSLと平均概要情報長MSLが十分に近いと見なされた場合には、区間SAVでの概要情報抽出処理を終了し、概要情報登録４４で時間情報などが登録される。そして次の区間SAVn+1の入力処理へ移行する。このとき、代表ショット登録１Ｂで登録されていた代表ショットに関する各種特徴値はクリアされる。一方、SSLとMSLが近いと見なされない場合には、閾値変更部４７で、概要情報SUMの候補としての判定に用いる一部または全ての閾値の値を変更し、それまでに抽出されている概要情報SUMの候補を対象として、SSLとMSLが十分近くなるまでショット長評価部１６から動きアクティビティ評価部３での処理を再帰的に繰り返す。
【００３４】
これらの処理を、全ての区間SAVについて行い、最終的にオーディオビデオ情報AVの概要情報SUMを得て、概要情報SUMが登録される。このとき、概要情報の記述が指定されていれば、概要情報記述部５へ移行し、指定されていなければ処理を終了する。
【００３５】
概要情報記述部５では、図７に示されているように、概要情報SUMとして抽出された全てのショットSについて、概要情報記述部５１でそれらの時間情報を少なくとも記述し、概要情報出力部５２で概要情報記述ファイルとして出力する。記述するフォーマットとしては、例えばMPEG-7で規定されているフォーマットなどを用いることができる。全ての区間SAVについて記述が終了すると、全ての処理を終了する。また、概要情報SUMとして抽出された全てのショットSを結合して、別ファイルとして保存することができる。このとき、オーディオビデオファイルとして保存するか、オーディオとビデオを個別に保存することができる。
【００３６】
次に、前記した実施形態のオーディオビデオ概要情報抽出装置の機能は、ソフトウェア（プログラム）で実現することができる。該ソフトウェアは、光ディスク、柔軟ディスク、ハードディスク等の記録媒体に記録することができる。
【００３７】
図８は、該記録媒体１００に記録されるプログラムの一例を示すものであり、該記録媒体１００には、圧縮オーディオビデオ情報のコンテンツ解析機能１１１、オーディオレベル評価機能１１２、動きアクティビティ評価機能１１３、概要情報登録機能１１４、および概要情報記述機能１１５が記録される。なお、該動きアクティビティ評価機能１１３は、省略してもよい。
【００３８】
また、前記コンテンツ解析機能１１１は、ビデオ情報をショットに分割する機能と、入力コンテンツをある基準に従って等間隔の区間に分割する機能と、該等間隔の区間に含まれるショットの長さを評価する機能および反復ショットを判定する機能の少なくとも一方とから構成することができる。また、前記オーディオレベル評価機能１１２は、無音を判定する機能、低レベル音を判定する機能、および高レベル音を判定する機能から構成することができる。
【００３９】
また、前記動きアクティビティ評価機能１１３は、前記ショットに属するフレームの動きベクトルデータを抽出する機能と、該抽出された動きベクトルデータから動きアクティビティを計算する機能と、単位時間における動きアクティビティを計算する機能と、該単位時間における動きアクティビティを用いて概要情報の候補を判定する機能とから構成することができる。
【００４０】
また、抽出された概要情報を、オーディオビデオとして結合するか、またはオーディオとビデオ個別に結合するかをし、該結合した概要情報を別ファイルとして、記録媒体１００に記録することができる。
【００４１】
なお、前記記録媒１００には、ネットワークのように、データを一時的に記録保持するような伝送媒体も含まれる。
【００４２】
図９は、本発明の第２の実施形態であるオーディオ概要情報の抽出装置の構成を示すブロック図である。
【００４３】
まず、圧縮されたオーディオ情報CAが入力されると、サブバンドデータ抽出部Ａ１でサブバンドデータSDを抽出する。抽出されたサブバンドデータSDは高レベル音評価部Ａ３に送られる。サブバンドデータ抽出部Ａ１の動作としては、第１の実施形態に示した無音・低レベル音評価部５におけるサブバンドデータ抽出部５１と同様である。一方、非圧縮のオーディオ情報UAが入力されると、サブバンド解析部Ａ２で入力オーディオがサブバンド解析され、解析された結果としてのサブバンドデータSDは同様に高レベル音評価部Ａ３に送られる。
【００４４】
高レベル音評価部Ａ３では、第１の実施形態に示した高レベル音解析部２６に含まれるサブバンドエネルギー総和計算部２６２と、単位時間サブバンドエネルギー総和計算部２６３と同様の機能を持つ、図１０のサブバンドエネルギー総和計算部Ａ３１と、単位時間サブバンドエネルギー総和計算部Ａ３２により、入力されたサブバンドデータSDから、それぞれサブバンドエネルギー総和SBEと単位時間でのサブバンドエネルギー総和SBが計算される。
【００４５】
次に、概要情報開始時間決定部Ａ３３において、単位時間サブバンドエネルギー総和SBが最大となる時間位置を、概要情報開始時間T_startとして決定する。また、概要情報終了時間決定部Ａ３４では、単位時間サブバンドエネルギー総和SBが最大値のα倍（0<α<1）となる時間位置を、概要情報終了時間T_endとして決定する。このとき、T_start＞T_endである。
【００４６】
概要情報登録部Ａ４では、高レベル音評価部Ａ３で決定された概要情報開始時間T_startと、概要情報終了時間T_endに基づいて概要情報を登録する。そして、概要情報記述部Ａ５において上記時間情報を少なくとも記述し、概要情報記述ファイルとして出力する。記述するフォーマットとしては、例えばMPEG-7で規定されているフォーマットなどを用いることができる。
【００４７】
オーディオ情報が複数存在する場合には、上記の処理を全てのオーディオ情報に対して行う。
【００４８】
前記した実施形態のオーディオ概要情報抽出装置の機能は、ソフトウェア（プログラム）で実現することができ、該ソフトウェアは、光ディスク、フロッピーディスク、ハードディスク等の記録媒体１００に記録することができる。また、抽出されたオーディオ概要情報は、個別のファイルとして、該記録媒体１００に記録することができる。
【００４９】
図１１は、該記録媒体１００に記録されるプログラムの一例を示すものであり、該記録媒体１００には、サブバンドデータ抽出機能１２１または／およびサブバンド解析機能１２２、高レベル音評価機能１２３、概要情報登録機能１２４、および概要情報記述機能１２５が記録される。
【００５０】
図１２は、本発明のオーディオビデオの概要情報再生装置の一実施形態を、構成図として表したものである。
【００５１】
前記手段により抽出されたオーディオビデオの概要情報SUMが入力されると、オーディオビデオ分離部Ｐ１において、概要情報のビデオ要素VSUMとオーディオ要素ASUMに分離される。次に、ビデオ速度変換部Ｐ２では、外部から与えられた変換速度パラメータSPに従ってビデオ要素VSUMを空間的に間引いてビデオの再生速度を変換する。同様にして、オーディオ要素ASUMはオーディオ速度変換部Ｐ３において変換速度パラメータSPに従ってビデオ要素VSUMと同じ割合で時間的に間引かれ、オーディオの再生速度を変換する。オーディオの再生速度変換としては、例えばオーディオのフレームの周期的なスキップや、繰り返し再生と周期的スキップの組み合わせなどによって実現することができる。ここで、オーディオを1.5倍の速度にする場合、前者では
＜再生するフレーム番号＞ 1、2、4、5、7、8、10、11、…
と連続する2フレームを再生し、次に続く1フレームスキップすることによって達成される。また後者では、
＜再生するフレーム番号＞ 1、1、4、4、7、7、10、10、…
と同一フレームを2回繰り返して再生し、次に続く2フレームをスキップすることによって達成される。
【００５２】
速度を変換されたビデオ要素VSUM´及びオーディオ要素ASUM´は、オーディオビデオ多重化・同期部Ｐ４に入力され、多重化及び同期処理が行われ、速度変換されたオーディオビデオの概要情報SUM´が得られる。得られたオーディオビデオの概要情報SUM´は、表示再生される。
【００５３】
【発明の効果】
以上の説明から明らかなように、本発明によれば、非圧縮または圧縮されたオーディオ情報、または圧縮されたオーディオビデオ情報に関して、それらの内容を高速かつ効率的に把握するための概要情報を抽出することが可能になる。また、この抽出によって、オーディオビデオ情報の高速な閲覧が可能となる。
【００５４】
また、抽出される概要情報の長さは任意に指定することができると同時に、オーディオビデオ情報から均一に概要情報を抽出するため、コンテンツ全体の把握を効率的に行うことが可能となる。
【００５５】
また、概要情報に含まれる時間情報などを記述することにより、該当するオーディオビデオ情報の概要情報としての特徴記述を行うことが可能となり、コンテンツ記述の標準化であるMPEG-7などへも適用することが可能である。
【００５６】
また、抽出された概要情報の表示速度を変換するなどの、高機能な概要情報の再生を提供することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態の全体構成を示すブロック図である。
【図２】図１のコンテンツ解析部の詳細構成を示すブロック図である。
【図３】図１のオーディオレベル評価部の詳細構成を示すブロック図である。
【図４】図３の高レベル音解析部の詳細構成を示すブロック図である。
【図５】図１の動きアクティビティ評価部の詳細構成を示すブロック図である。
【図６】図１の概要情報登録部の詳細構成を示すブロック図である。
【図７】図１の概要情報記述部の詳細構成を示すブロック図である。
【図８】記録媒体に記録されるプログラムの概要を示す図である。
【図９】本発明の他の実施形態のオーディオ概要情報抽出装置の構成を示すブロック図である。
【図１０】図８の高レベル音評価部の詳細構成を示すブロック図である。
【図１１】記録媒体に記録されるプログラムの概要を示す図である。
【図１２】本発明の他の実施形態のオーディオ概要情報再生装置の構成を示すブロック図である。
【符号の説明】
１・・・コンテンツ解析部、２・・・オーディオレベル評価部、３・・・動きアクティビティ評価部、４・・・概要情報登録部、５・・・概要情報記述部，Ａ１・・・サブバンドデータ抽出部，Ａ２・・・サブバンド解析部，Ａ３・・・高レベル音評価部、Ａ４・・・概要情報登録部、Ａ５・・・概要情報記述部、Ｐ１・・・オーディオビデオ分離部、Ｐ２・・・ビデオ速度変換部、Ｐ３・・・オーディオ速度変換部、Ｐ４・・・オーディオビデオ多重化・同期部、１００・・・記録媒体。[0001]
BACKGROUND OF THE INVENTION
The present invention extracts audio information and audio video summary information for extracting summary information (summary) for efficiently grasping the contents from uncompressed or compressed audio information or compressed audio video information. The present invention relates to an apparatus and a recording medium. Also, providing high-speed and efficient browsing of audio information or audio-video information by extracting summary information at high speed in the compressed data area from uncompressed or compressed audio information or compressed audio-video information The present invention relates to a playback apparatus for audio / video outline information.
[0002]
[Prior art]
For automatic video summarization, for example, Tanaka, Wakimoto, Kanda “Development of Automatic Video Information Summarization and Browsing Technology by Scene Detection”, IEICE Technical Report, IE99-20 (1999) In this research presentation, after video scene change points are detected, a hierarchical structure is created and a video summary is automatically created by assigning priorities to each scene. . In particular, a high priority is given to the scene immediately after the turning point of the topic. In the hierarchical structure, priority is set so that the higher the higher the higher the hierarchy is.
[0003]
In addition, “Using content models to build audio-video summaries” by J.Saarela and B.Merialdo, SPIE Conference on Strorage and Retrieval for Image and Video Databases VII (1999) (second prior art) The creation of video summary information (summary) is regarded as an optimization problem with constraints. Constraints include minimum shot length, audio and video synchronization, video continuity, and audio and video redundancy. Then, a video content model (symbol description) is manually constructed to create a summary.
[0004]
R. Lienhart et al. “Video Abstracting”, Communications of ACM, Vol. 40, No. 12 (1997) (third prior art) aims to create summary information specialized for movie trailers. . The main creation procedures are splitting the video into shots, analysis of clips containing special events, clip selection, and clip aggregation. Special events include actor face recognition, conversation identification, character information extraction from titles, and events such as shootings and explosions.
[0005]
[Problems to be solved by the invention]
Extraction of summary information of audio video according to these conventional techniques has the following problems. First, in the first prior art, efficient summarization is provided by video hierarchical structuring. However, hierarchical structuring needs to be performed manually, and for long audio video information, summary information is extracted. It may take a lot of time to make a hierarchical structure as a pre-process necessary for the process. Further, since the priority is given depending on the hierarchy to which the scene belongs, the work is often required substantially in practice. In addition, it is possible to extend to audio video because it is a single video, but because it does not use the characteristics as audio, in some cases it does not use audio information containing important content, It is possible that adequate summary information is not available.
[0006]
In the second prior art, the video content model must be manually constructed, and a method of effectively using audio characteristics is not employed. At the same time, the classification of video segments requires advanced recognition technology that is difficult to realize on compressed data, so it is expected that the processing required for extracting summary information will increase. With regard to the third prior art as well, these high-level processes (entering the meaning of contents) are necessary.
[0007]
As described above, in the prior art, manual processing is often involved in the process from the input of the audio video information to the output of the summary information, and the audio information is not effectively used. Information creation is not considered. Also, no effective method has been studied so far for automatic summary information extraction from compressed or uncompressed audio alone.
[0008]
Also, from the structural aspect of the summary information, there is no guarantee that the summary information is uniformly extracted from the entire audio video information. Further, in the first and second prior arts, the summary information designated from outside is not guaranteed. A method of performing control to approach the length is not sufficiently employed.
[0009]
The present invention has been made in view of the above-described prior art, and an object thereof is an outline for efficiently grasping the contents from uncompressed or compressed audio information or compressed audio video information. In a device that extracts information (summary), it analyzes the spatio-temporal characteristics of audio and video in the compressed data area, evaluates them comprehensively as necessary, and determines the specified length of audio or audio-video information. SUMMARY OF THE INVENTION An object of the present invention is to provide an audio summary information, audio video summary information extraction device, playback device, and recording medium that enable uniform summary information to be extracted from the entire content uniformly and at high speed.
[0010]
[Means for Solving the Problems]
In order to achieve the above-described object, according to the present invention, in an audio video summary information extraction apparatus, content analysis means for analyzing a temporal structure of video information of input audio video content and a temporal structure are analyzed. A first feature is that it comprises audio level evaluation means for evaluating the audio level of audio information accompanying the video information, and outline information registration means for registering audio video outline information evaluated by the audio level evaluation means. There is [0011]
Further, the present invention relates to a device for extracting audio summary information, means for extracting subband data from input compressed audio content, and means for calculating a sum of subband energy weighted by bands from the subband data. The second feature is that it comprises means for evaluating the subband energy sum in unit time and means for extracting audio summary information.
[0012]
The present invention also provides a function for analyzing the temporal structure of the video information of the input audio video content, a function for evaluating the audio level of the audio information accompanying the video information whose temporal structure has been analyzed, A computer-readable recording recording a program to be executed by a computer, including a function for evaluating a motion activity in the computer, and a summary information registration function for registering the audio level and summary information of the audio video for which the motion activity is evaluated. A third feature is that a medium is provided.
[0013]
Furthermore, the present invention has a fourth feature in that a playback device for audio video summary information includes means for synchronizing and playing back video elements and audio elements extracted by the audio video summary information extraction device.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an audio video summary information extraction apparatus according to the first embodiment of the present invention.
[0015]
First, the compressed audio video information AV is input to the content analysis unit 1 that analyzes the temporal structure of the audio video information AV. The content analysis unit 1 has a configuration as shown in FIG. 2, and the compressed audio video information AV is first input to the shot division unit 11. The shot division unit 11 divides the video information into shot units S, then counts the number of shots for the entire content, and holds the total number of shots as NS. Next, the division number determination unit 12 uses the total shot number NS input from the shot division unit 11, the content length CL obtained from the audio video information AV, and the outline information length SL specified from the outside, as follows: The division number NP is determined by the equation (1).
[0016]
NP = NS × SL / CL (1)
However, it is assumed that SL> CL / NS. For example, if the content length CL is 1 hour and the summary information length SL is 5 minutes, the total number of shots NS is 240, and the division number NP is 20. Then, the content is divided into NP equal parts in the content division 13. The equally divided sections SAV and the shots S belonging to each section are mapped.
[0017]
In the following, processing is performed for each section SAV input by the divided section input 14.
First, the first shot Sn is input by the shot input 15. When the shot length SHLn of the input shot Sn is equal to or shorter than the length T specified in advance, the shot length evaluation unit 16 excludes it from the candidates for the summary information SUM and proceeds to input processing for the next shot Sn + 1. On the other hand, if the shot length SHLn of the shot Sn is longer than T, the corresponding shot Sn is set as a candidate for the outline information SUM and is sent to the representative frame extraction unit 17.
[0018]
The representative frame extraction unit 17 extracts the first frame SFSn or feature frame KFSn of the shot as the representative frame RFSn of the shot Sn. For example, the method described in Japanese Patent Application No. 2000-065259 can be used to extract the feature frame KFSn. Further, the representative frame feature value extraction unit 18 extracts the feature value CHFSn of the frame RFSn extracted by the representative frame extraction unit 17, and the shot feature value extraction unit 19 extracts the feature value CHSn of the shot Sn.
[0019]
One or both of the feature value CHFSn of the start frame SFSn or the feature frame KFSn as the representative frame RFSn for the shot Sn and the feature value CHSn of the shot Sn are sent to the representative shot evaluation unit 1A. As the feature value CHFSn of the first frame SFSn or the feature frame KFSn, for example, a descriptor defined in MPEG-7 (Moving Picture Experts Group phase 7) can be used.
[0020]
In the representative shot evaluation unit 1A, the feature value CHF of the representative frame RF regarding all shots already registered as the representative shot RS in the target divided section SAV and the shot Sn sent from the representative frame feature value extraction unit 18 The similarity is determined between the feature value CHFSn of the representative frame RFs. Here, if it is determined that the similarity between the feature value CHF and the feature value CHFSn of the input shot Sn is large, it is regarded as a repetitive shot, and the corresponding shot Sn is excluded from the candidates for the summary information SUM. Then, the process proceeds to input processing for the next shot Sn + 1. On the other hand, if it is determined that the similarity between the feature value CHF and the feature value CHFSn of the shot Sn is small, it is regarded as an independent shot, the corresponding shot Sn is set as a candidate for the summary information SUM, and the representative shot registration 1B After registering the shot at, it is sent to the audio level evaluation unit 2. It should be noted that the feature value related to the shot Sn registered as the representative shot is stored separately and used for evaluation of the representative shot thereafter.
[0021]
In the determination of the shot similarity, similarly, the feature value CHS of the representative shot related to the shot that has already been registered as the representative shot RS and the feature value CHSn of the shot Sn sent from the shot feature value extraction unit 19 are substituted. Or you may use together.
[0022]
In the audio level evaluation unit 2, regarding the shot Sn input from the content analysis unit 1, the subband data SDSn of the audio part of the shot Sn is extracted by the subband data extraction unit 21 of FIG. Then, after the silence analysis unit 22 calculates the feature value as silence, the silence determination unit 23 determines that all sections of the shot Sn, or Y% or more of the sections of the shot Sn are determined to be silent. The shot Sn to be excluded is excluded from the candidates for the summary information SUM, and the process proceeds to input processing for the next shot Sn + 1. On the other hand, in the determination by the silence determination unit 23, when the audio portion of the shot Sn is (100-Y)% or more of the section of the shot Sn, the shot Sn is sent to the low level sound analysis unit 24. As the silence analysis method in the silence analysis unit 22 and the silence determination method in the silence determination unit 23, for example, the method described in Japanese Patent Application No. 10-235543 can be used.
[0023]
Similarly, the low-level sound analysis unit 24 estimates the level LSn of the corresponding audio from the subband data SDSn extracted by the subband data extraction unit 21, and the audio having a level sufficiently lower than the specified level THLL is recorded in the shot Sn. If all the sections or Z% or more of the section of the shot Sn occupies, the corresponding shot Sn is excluded from the candidates for the summary information SUM, and the process shifts to the next shot Sn + 1 input process. On the other hand, when the audio part of the shot Sn is an audio in which (100-Z)% or more of the section of the shot Sn exceeds the audio level THLL in the low level sound analysis unit 24, the corresponding shot Sn is selected as the candidate of the summary information SUM. And sent to the high level sound analysis unit 26.
[0024]
In the high level sound analysis unit 26, based on the subband data SDSn relating to the shot Sn obtained from the subband data extraction unit 21, an audio unit of a level THHL or higher in the shot Sn by the unit time density calculation unit 261 in FIG. Calculate the time density Dsn. At this time, when the audio information is encoded with MPEG audio, the unit time density Dsn per second can be obtained as follows, for example.
[0025]
Dsn = (NAF _THHL / NAF) × AAL _THHL (2)
Here, NAF _THHL is the number of frames per second having a level higher than THHL, NAF is the number of audio frames per second, and AAL _THHL is the average level with respect to NAF _THHL .
[0026]
In addition, the subband energy sum calculation unit 262 calculates the sum SBE of the subband energy weighted by the subband, and based on this, the unit time subband energy calculation unit 263 calculates the subband energy sum SBsn per unit time. . The unit time subband energy SBsn per second can be obtained, for example, by the following equation (3).

Here, αk is a weight for subband k, and sbk is the energy of subband k in a frame.
[0027]
Next, when the unit time density determination unit 264 has a unit time density Dsn that exceeds a certain threshold THD, the corresponding shot Sn shifts to the unit time subband energy determination unit 265. When the unit time subband energy SBsn exceeding the threshold THSB is present, the unit time subband energy determination unit 265 determines the shot Sn including the corresponding audio as a candidate for summary information, exits the high-level sound analysis routine, and moves The process proceeds to the activity evaluation unit 3.
[0028]
On the other hand, if the unit time density DSn of the audio level THHL or higher is less than the threshold value THD, or if the unit time subband energy SBSn is less than the threshold value THSB, the shot Sn including the corresponding audio is excluded from the summary information candidates. Then, the process proceeds to input processing for the next shot Sn + 1.
[0029]
In the configuration of FIG. 3, all of the silence analysis unit 22, the low level sound analysis unit 24, and the high level sound analysis unit 26 are not necessarily required, and at least one may be provided.
[0030]
In the motion activity evaluation unit 3, regarding the shot Sn input from the audio level evaluation unit 2, the motion vector extraction unit 31 in FIG. 5 extracts the motion vector information MV of all frames belonging to the shot Sn. Then, the motion activity calculation unit 32 calculates the motion activity MASn of the entire shot Sn using the motion vector information MV, and uses the motion activity information calculation unit 32 to calculate a unit time (for example, 1 second, 1 Motion activity MA in a frame etc.). The processing in the motion activity calculation unit 32 and the unit time motion activity 33 can be expressed as follows, for example, for MPEG-encoded video. Here we assume motion activity per second.

Where ASMV is the sum of absolute values of motion vectors with a size greater than or equal to X, NMB is the number of macroblocks with motion vectors whose size is greater than or equal to X in the frame, and NPSn is the number of predicted encoded frames in a shot. , NVF is the number of video frames per second.
[0031]
In the movement activity determination unit 34, the obtained movement activity MA is compared with a specified threshold THMA, and when MA exceeds THMA, the corresponding shot Sn is excluded from the candidates for the summary information SUM, and the next shot Move on to Sn + 1 input processing. On the other hand, if the motion activity MA in a unit time of the shot Sn does not exceed the threshold value THMA in the motion activity determination unit 34, the corresponding shot Sn is set as a candidate for the summary information SUM.
[0032]
Through the above processing, in the section SAV in which the audio video information AV is divided at equal intervals, the shot S selected as the candidate for the summary information SUM is input to the summary information registration unit 4, and the information of the shot S is stored in the shot memory of FIG. 41 is stored. Next, the determination unit 42 determines whether or not the processing of all shots in the section SAV has been completed. When this determination is negative, processing of the next shot is performed.
[0033]
On the other hand, when the processing of all shots in the section SAV is completed, the total SSL of the shot lengths related to all the shots S in the section SAV is obtained by the summary information length determination unit 43 in the average summary information length MSL (= SL / NP) is judged to be close enough. If the total shot length SSL in the section SAV and the average summary information length MSL are considered sufficiently close, the summary information extraction process in the section SAV is terminated, and time information and the like are registered in the summary information registration 44 . Then, the process proceeds to input processing for the next section SAVn + 1. At this time, various feature values related to the representative shot registered in the representative shot registration 1B are cleared. On the other hand, if SSL and MSL are not considered to be close, the threshold value changing unit 47 changes some or all threshold values used for determination as candidates for the summary information SUM, and has been extracted so far For the candidate of the summary information SUM, the processing from the shot length evaluation unit 16 to the motion activity evaluation unit 3 is recursively repeated until SSL and MSL are sufficiently close.
[0034]
These processes are performed for all the sections SAV, finally, the summary information SUM of the audio video information AV is obtained, and the summary information SUM is registered. At this time, if the description of the summary information is designated, the process proceeds to the summary information description unit 5, and if not designated, the process is terminated.
[0035]
As shown in FIG. 7, the summary information description unit 5 describes at least the time information of all the shots S extracted as the summary information SUM by the summary information description unit 51, and the summary information output unit 52. To output as a summary information description file. As a format to be described, for example, a format defined by MPEG-7 can be used. When the description is finished for all the sections SAV, all the processes are finished. Further, all the shots S extracted as the summary information SUM can be combined and saved as a separate file. At this time, it can be saved as an audio video file, or audio and video can be saved separately.
[0036]
Next, the function of the audio / video outline information extracting apparatus of the above-described embodiment can be realized by software (program). The software can be recorded on a recording medium such as an optical disk, a flexible disk, or a hard disk.
[0037]
FIG. 8 shows an example of a program recorded on the recording medium 100. The recording medium 100 includes a compressed audio / video information content analysis function 111, an audio level evaluation function 112, a motion activity evaluation function 113, A summary information registration function 114 and a summary information description function 115 are recorded. The movement activity evaluation function 113 may be omitted.
[0038]
The content analysis function 111 evaluates the function of dividing video information into shots, the function of dividing input content into equal intervals according to a certain standard, and the length of shots included in the equal intervals. It can be composed of at least one of a function and a function for determining repetitive shots. The audio level evaluation function 112 can be constituted by a function for determining silence, a function for determining low level sound, and a function for determining high level sound.
[0039]
The motion activity evaluation function 113 has a function of extracting motion vector data of a frame belonging to the shot, a function of calculating motion activity from the extracted motion vector data, and a function of calculating motion activity in unit time. And a function for determining summary information candidates using the motion activity in the unit time.
[0040]
In addition, the extracted summary information can be combined as audio video or audio and video individually, and the combined summary information can be recorded on the recording medium 100 as a separate file.
[0041]
The recording medium 100 includes a transmission medium that temporarily records and holds data, such as a network.
[0042]
FIG. 9 is a block diagram showing a configuration of an audio summary information extraction apparatus according to the second embodiment of the present invention.
[0043]
First, when the compressed audio information CA is input, the subband data extraction unit A1 extracts the subband data SD. The extracted subband data SD is sent to the high level sound evaluation unit A3. The operation of the subband data extraction unit A1 is the same as that of the subband data extraction unit 51 in the silence / low level sound evaluation unit 5 shown in the first embodiment. On the other hand, when uncompressed audio information UA is input, the subband analysis unit A2 performs subband analysis on the input audio, and the analyzed subband data SD is similarly sent to the high-level sound evaluation unit A3. .
[0044]
The high level sound evaluation unit A3 has the same functions as the subband energy sum calculation unit 262 and the unit time subband energy sum calculation unit 263 included in the high level sound analysis unit 26 shown in the first embodiment. The subband energy total calculation unit A31 and the unit time subband energy total calculation unit A32 in FIG. 10 respectively calculate the subband energy total SBE and the subband energy total SB in unit time from the input subband data SD. Is done.
[0045]
Next, in the summary information start time determination unit A33, the time position at which the unit time subband energy sum SB is maximized is determined as the summary information start time T_start. Further, the summary information end time determination unit A34 determines the time position at which the unit time subband energy sum SB is α times the maximum value (0 <α <1) as the summary information end time T_end. At this time, T_start> T_end.
[0046]
The summary information registration unit A4 registers the summary information based on the summary information start time T_start and the summary information end time T_end determined by the high level sound evaluation unit A3. Then, at least the time information is described in the summary information description part A5 and output as a summary information description file. As a format to be described, for example, a format defined by MPEG-7 can be used.
[0047]
If there is a plurality of audio information, the above process is performed on all audio information.
[0048]
The function of the audio summary information extraction device of the above-described embodiment can be realized by software (program), and the software can be recorded on a recording medium 100 such as an optical disk, a floppy disk, or a hard disk. The extracted audio summary information can be recorded on the recording medium 100 as an individual file.
[0049]
FIG. 11 shows an example of a program recorded on the recording medium 100. The recording medium 100 includes a subband data extraction function 121 or / and a subband analysis function 122, a high level sound evaluation function 123, A summary information registration function 124 and a summary information description function 125 are recorded.
[0050]
FIG. 12 is a block diagram showing an embodiment of an audio / video outline information reproducing apparatus according to the present invention.
[0051]
When the audio video summary information SUM extracted by the above means is input, the audio video separation unit P1 separates the summary information into the video element VSUM and the audio element ASUM. Next, the video speed conversion unit P2 converts the video playback speed by spatially thinning out the video elements VSUM in accordance with the conversion speed parameter SP given from the outside. Similarly, the audio element ASUM is thinned out in time at the same rate as the video element VSUM in accordance with the conversion speed parameter SP in the audio speed conversion unit P3 to convert the audio playback speed. Audio playback speed conversion can be realized by, for example, periodic skipping of audio frames or a combination of repeated playback and periodic skipping. Here, if the audio is to be 1.5 times faster, the former <frame number to play> 1, 2, 4, 5, 7, 8, 10, 11, ...
Is achieved by playing back two consecutive frames and skipping the next one frame. In the latter,
<Frame number to play> 1, 1, 4, 4, 7, 7, 10, 10, ...
Is achieved by repeatedly playing the same frame twice and skipping the next two frames.
[0052]
The speed-converted video element VSUM ′ and audio element ASUM ′ are input to the audio video multiplexing / synchronizing unit P4, multiplexed and synchronized, and obtained speed-converted audio video summary information SUM ′. It is done. The obtained audio video summary information SUM 'is displayed and reproduced.
[0053]
【The invention's effect】
As is apparent from the above description, according to the present invention, uncompressed or compressed audio information or compressed audio video information is extracted with summary information for quickly and efficiently grasping the contents thereof. It becomes possible to do. In addition, this extraction enables high-speed browsing of audio-video information.
[0054]
In addition, the length of the extracted summary information can be arbitrarily specified, and at the same time, the summary information is uniformly extracted from the audio video information, so that it is possible to efficiently grasp the entire content.
[0055]
Also, by describing the time information included in the summary information, it becomes possible to describe the feature as the summary information of the corresponding audio video information, and it can be applied to MPEG-7, which is the standardization of content description. Is possible.
[0056]
In addition, it is possible to provide reproduction of high-performance outline information such as converting the display speed of the extracted outline information.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a detailed configuration of a content analysis unit in FIG.
3 is a block diagram showing a detailed configuration of an audio level evaluation unit in FIG. 1; FIG.
4 is a block diagram illustrating a detailed configuration of a high-level sound analysis unit in FIG. 3;
FIG. 5 is a block diagram showing a detailed configuration of a motion activity evaluation unit in FIG. 1;
6 is a block diagram illustrating a detailed configuration of a summary information registration unit in FIG. 1;
7 is a block diagram showing a detailed configuration of a summary information description unit in FIG. 1; FIG.
FIG. 8 is a diagram showing an outline of a program recorded on a recording medium.
FIG. 9 is a block diagram showing a configuration of an audio summary information extraction device according to another embodiment of the present invention.
10 is a block diagram illustrating a detailed configuration of a high-level sound evaluation unit in FIG. 8. FIG.
FIG. 11 is a diagram showing an outline of a program recorded on a recording medium.
FIG. 12 is a block diagram illustrating a configuration of an audio summary information reproducing device according to another embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Content analysis part, 2 ... Audio level evaluation part, 3 ... Motion activity evaluation part, 4 ... Outline information registration part, 5 ... Outline information description part, A1 ... Subband Data extraction unit, A2 ... subband analysis unit, A3 ... high level sound evaluation unit, A4 ... summary information registration unit, A5 ... summary information description unit, P1 ... audio video separation unit, P2 ... Video speed conversion unit, P3 ... Audio speed conversion unit, P4 ... Audio video multiplexing / synchronization unit, 100 ... Recording medium.

Claims

入力されたオーディオビデオコンテンツのビデオ情報の時間的構造を解析するコンテンツ解析手段と、
時間的構造を解析された該ビデオ情報に付随するオーディオ情報のオーディオレベルを評価するオーディオレベル評価手段と、
該オーディオレベル評価手段により評価されたオーディオビデオ概要情報を登録する概要情報登録手段とを具備し、
前記コンテンツ解析手段は、該ビデオ情報をショットに分割する手段と、該ショットの長さを評価する手段と、ショットの代表フレームまたは特徴値と類似する代表フレームまたは特徴値を有するショットを反復ショットと判定する手段とを具備し、
前記オーディオレベル評価手段は、無音を判定する手段および低レベル音を判定する手段の少なくとも一方を具備し、
前記コンテンツ解析手段により、該ショットの長さが予め定められた閾値より小さい時または反復ショットと判定された時、または前記オーディオレベル評価手段により該ショットのオーディオレベルが無音または低レベル音と評価された時に、該ショットをオーディオビデオ概要情報から除外するようにしたことを特徴とするオーディオビデオ概要情報の抽出装置。Content analysis means for analyzing the temporal structure of the video information of the input audio-video content;
Audio level evaluation means for evaluating the audio level of the audio information accompanying the video information whose temporal structure has been analyzed;
Comprising summary information registration means for registering the audio video summary information evaluated by the audio level evaluation means,
The content analysis means includes means for dividing the video information into shots, means for evaluating the length of the shots, and shots having representative frames or feature values similar to the representative frames or feature values of the shots as repetitive shots. Means for determining,
The audio level evaluation means comprises at least one of a means for determining silence and a means for determining a low level sound,
When the content analysis means determines that the length of the shot is smaller than a predetermined threshold value or when it is determined as a repetitive shot, or the audio level evaluation means evaluates the audio level of the shot as silence or low level sound. The audio video outline information extracting apparatus is characterized in that the shot is excluded from the audio video outline information.

請求項１に記載のオーディオビデオ概要情報の抽出装置において、
前記コンテンツ解析手段として、入力コンテンツを該ビデオ情報に含まれる全てのショット数と外部から指定される概要情報の長さと入力コンテンツの長さとから得られる分割数で等間隔区間に分割する手段をさらに具備し、
該等間隔区間ごとに前記除外の判定を行うことを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extraction device according to claim 1,
The content analysis means further includes means for dividing the input content into equal intervals by the number of divisions obtained from the number of all shots included in the video information, the length of summary information specified from the outside, and the length of the input content. Equipped,
An apparatus for extracting audio video outline information , wherein the exclusion determination is performed for each equal interval section .

請求項１または２に記載のオーディオビデオ概要情報の抽出装置において、
前記反復ショットを判定する手段は、前記分割されたショットの代表フレームを決定する手段と、該代表フレームの特徴値を抽出する手段と、該特徴値と他の代表フレームの特徴値とを比べて類似画像検索を行う手段とを具備し、
該類似画像検索により類似画像と判断された時に、該ショットを反復ショットと判定することを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to claim 1 or 2,
The means for determining the repetitive shot comprises: a means for determining a representative frame of the divided shot; a means for extracting a feature value of the representative frame; and comparing the feature value with a feature value of another representative frame. Means for performing a similar image search,
An audio video summary information extracting apparatus, wherein when a similar image is determined by the similar image search, the shot is determined as a repetitive shot.

請求項１または２に記載のオーディオビデオ概要情報の抽出装置において、
前記反復ショットを判定する手段は、前記分割されたショットの特徴値を抽出する手段と、該特徴値と他のショット特徴値とを比べて類似ショット検索を行う手段とを具備し、
該類似ショット検索により類似画像と判断された時に、該ショットを反復ショットと判定することを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to claim 1 or 2,
The means for determining the repetitive shot comprises: means for extracting a feature value of the divided shot; and means for performing a similar shot search by comparing the feature value with another shot feature value,
An audio video outline information extracting apparatus, wherein when a similar image is determined by the similar shot search, the shot is determined as a repetitive shot.

請求項１または２に記載のオーディオビデオ概要情報の抽出装置において、
高レベル音を判定する手段をさらに具備し、該高レベル音を判定する手段は、圧縮されたオーディオ情報からサブバンドデータを抽出する手段と、該サブバンドデータからオーディオのレベルを推定する手段と、予め定められた閾値を超える該オーディオレベルからオーディオレベルの時間密度を計算する手段とを具備し、
前記閾値を越えるオーディオレベル時間密度を有するショットをオーディオビデオ概要情報の候補と判定し、該オーディオビデオ概要情報の全長が外部から指定された概要情報長に近いと判定された場合にオーディオビデオ概要情報とすることを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to claim 1 or 2 ,
Means for determining a high level sound, the means for determining the high level sound, means for extracting subband data from the compressed audio information; means for estimating the audio level from the subband data; Means for calculating a time density of the audio level from the audio level exceeding a predetermined threshold,
Audio video summary information when a shot having an audio level time density exceeding the threshold is determined as a candidate for audio video summary information, and it is determined that the total length of the audio video summary information is close to the summary information length designated from the outside. An apparatus for extracting audio-video outline information.

請求項１または２に記載のオーディオビデオ概要情報の抽出装置において、
高レベル音を判定する手段をさらに具備し、該高レベル音を判定する手段は、圧縮されたオーディオ情報からサブバンドデータを抽出する手段と、該サブバンドデータからバンドにより重み付けしたサブバンドエネルギーの総和を計算する手段と、単位時間におけるサブバンドエネルギー総和を計算する手段とを具備し、
前記単位時間におけるサブバンドエネルギー総和が予め定められた閾値を越える場合、該当するオーディオ情報を含むショットを概要情報の候補と判定し、該オーディオビデオ概要情報の全長が外部から指定された概要情報長に近いと判定された場合にオーディオビデオ概要情報とすることを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to claim 1 or 2 ,
Means for determining a high level sound, the means for determining the high level sound comprising means for extracting subband data from the compressed audio information; and subband energy weighted by band from the subband data. Means for calculating the sum, and means for calculating the subband energy sum in unit time,
If exceeding the threshold subband energy sum predetermined in unit time, appropriate shots containing audio information judged as a candidate for summary information, summary information length overall length is specified from the outside of the audio video summary information An audio video summary information extraction device, characterized in that audio video summary information is determined when it is determined to be close to .

請求項５または６に記載のオーディオビデオ概要情報の抽出装置において、
画像内の動きを評価する動きアクティビティ評価手段をさらに具備し、
前記動きアクティビティ評価手段は、前記ショットに属するフレームの動きベクトルデータを抽出する手段と、該抽出された動きベクトルデータから動きアクティビティを計算する手段と、単位時間における動きアクティビティを計算する手段とを具備し、
該オーディオビデオ概要情報の全長が外部から指定された概要情報長に近いと判定された場合にオーディオビデオ概要情報とすることを特徴とするオーディオビデオ概要情報の抽出装置。The audio video outline information extracting device according to claim 5 or 6 ,
A movement activity evaluation means for evaluating movement in the image;
The motion activity evaluation means comprises means for extracting motion vector data of a frame belonging to the shot, means for calculating motion activity from the extracted motion vector data, and means for calculating motion activity in unit time. And
An audio video summary information extracting apparatus, wherein the audio video summary information is determined to be audio video summary information when it is determined that the overall length of the audio video summary information is close to a summary information length designated from outside .

請求項２，５，６，または７に記載のオーディオビデオ概要情報の抽出装置において、
前記概要情報の判定基準となる閾値を再帰的に制御することにより、外部から指定された概要情報長になるまで判定処理を繰り返すことを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video outline information extracting device according to claim 2, 5, 6, or 7,
An audio / video summary information extraction apparatus, characterized in that a judgment process is repeated until a summary information length designated from the outside is recursively controlled by recursively controlling a threshold value as a judgment criterion for the summary information.

請求項１ないし８のいずれかに記載のオーディオビデオ概要情報の抽出装置において、
抽出された概要情報の時間情報として、該概要情報の開始時間と終了時間、または開始時間と継続時間を少なくとも記述する手段を具備したことを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to any one of claims 1 to 8 ,
An audio video summary information extraction apparatus comprising means for describing at least a start time and an end time of the summary information, or a start time and a duration as time information of the extracted summary information.

請求項１ないし８のいずれかに記載のオーディオビデオ概要情報の抽出装置において、
抽出された概要情報を、オーディオビデオとして結合するか、またはオーディオとビデオ個別に結合する手段と、
該結合した概要情報を別ファイルとして出力し保存する手段とを具備したことを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to any one of claims 1 to 8 ,
Means for combining the extracted summary information as audio video or audio and video separately;
An audio video summary information extracting apparatus comprising: means for outputting and storing the combined summary information as a separate file.

請求項１ないし８のいずれかに記載のオーディオビデオ概要情報の抽出装置において、
抽出された概要情報のビデオ要素としてショットの先頭フレームまたはショットの代表フレームを抽出する手段と、
該概要情報のオーディオ要素としてショットに付随するオーディオ情報を抽出する手段を具備したことを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to any one of claims 1 to 8 ,
Means for extracting the first frame of the shot or the representative frame of the shot as a video element of the extracted summary information;
An audio video summary information extraction device comprising means for extracting audio information accompanying a shot as an audio element of the summary information.

請求項１１に記載のオーディオビデオ概要情報の抽出装置において、
該抽出されたビデオ要素の時間情報としてコンテンツ内での時間位置を記述する手段と、
該抽出されたオーディオ要素の時間情報としてコンテンツ内での開始時間と終了時間、または開始時間と継続時間を少なくとも記述する手段を具備したことを特徴とするオーディオビデオ概要情報の抽出装置。In the audio video summary information extracting device according to claim 11 ,
Means for describing a time position in the content as time information of the extracted video element;
An audio video summary information extracting apparatus comprising: means for describing at least a start time and an end time in a content, or a start time and a duration as time information of the extracted audio element.

オーディオビデオ概要情報を抽出するためにコンピュータを、
入力されたオーディオビデオコンテンツのビデオ情報をショットに分割する手段、
該ショットの長さを予め定められた閾値と比較する手段、
該ショットの代表フレームまたは特徴値と類似する代表フレームまたは特徴値を有するショットを反復ショットと判定する手段、
該ショットのうち、オーディオレベルが無音および低レベル音のショットを判定する手段、
該ショットの長さが前記閾値より小さい時または反復ショットと判定された時、または該ショットのオーディオレベルが無音または低レベル音と評価された時に、該ショットをオーディオビデオ概要情報から除外する手段、
および、
前記除外されないショットをオーディオビデオ概要情報として登録する手段、
として機能させるためのオーディオビデオ概要情報を抽出するプログラムを記録したコンピュータ読み取り可能な記録媒体。 Computer to extract audio video summary information,
Means for dividing the video information of the input audio-video content into shots;
Means for comparing the length of the shot to a predetermined threshold;
Means for determining a shot having a representative frame or feature value similar to the representative frame or feature value of the shot as a repetitive shot;
Means for determining shots of which the audio level is silence and low level sound among the shots;
Means for excluding the shot from the audio video summary information when the length of the shot is less than the threshold or when it is determined to be a repetitive shot, or when the audio level of the shot is evaluated as silence or low level sound;
and,
Means for registering the non-excluded shot as audio video summary information;
The computer-readable recording medium which recorded the program which extracts the audio-video outline information for functioning as a computer.