JP3999674B2

JP3999674B2 - Similar voice music search device, similar voice music search program, and recording medium for the program

Info

Publication number: JP3999674B2
Application number: JP2003008083A
Authority: JP
Inventors: 啓敏須賀
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-01-16
Filing date: 2003-01-16
Publication date: 2007-10-31
Anticipated expiration: 2023-01-16
Also published as: JP2004219804A

Description

【０００１】
【発明の属する技術分野】
本発明は，音声音楽信号を検索する技術に関し，特に非定常雑音を含む音声音楽信号でも精度よく，高速に検索可能な類似音声音楽検索装置，類似音声音楽検索プログラムおよびそのプログラムの記録媒体に関するものである。
【０００２】
【従来の技術】
音声音楽信号を高速に検索する従来技術（従来手法１）としては，例えば，特許文献１「高速信号探索方法，装置及びその記録媒体」および非特許文献１に示されているような，検索対象中から検索キーと一致する音声音楽信号を検索する技術がある。
【０００３】
また，雑音を含む音声音楽信号を検索する従来技術（従来手法２）として，検索対象の音声音楽信号を時間周波数領域で多数の小領域に分割し，各領域間で検索キーの音声音楽信号との類似度を計算する手法がある（例えば，特許文献２「情報送受信システム及び方法，情報処理装置及び方法」，非特許文献２参照）。この手法では，各領域の類似度を投票法により積算することで，検索対象と検索キーの音声音楽信号中のナレーションなどの突発的で非定常な雑音部分以外の背景音楽などの部分で一致検索を行う。
【０００４】
この他に，類似する音声音楽信号を検索する従来技術（従来手法３）として，一致する音声音楽信号だけでなく類似する音声音楽信号も検索でき，多次元インデックスを用いることで高速な検索ができる手法がある（例えば，非特許文献３参照）。
【０００５】
ここで，後述する本発明の実施の形態で利用している技術が記載された文献として，例えば非特許文献４，非特許文献５，非特許文献６，非特許文献７がある。
【０００６】
【特許文献１】
特許第３０６５３１４号公報
【特許文献２】
特開２００２−１０２３７号公報
【非特許文献１】
柏野邦夫，ガビンスミス，村瀬洋：“ヒストグラム特徴を用いた音響信号の高速探索法−時系列アクティブ探索法−”：電子情報通信学会論文誌，D-1 ，Vol.J82-D-II NO.9 ，pp.1365-1373，1999
【非特許文献２】
阿部素嗣，西口正之：“背景音楽同定のための自己最適化スペクトル相関法”：電子情報通信学会技術報告，PRMU2001-209，pp.25-30，2002
【非特許文献３】
須賀啓敏，寺本純司，片岡良治，芳西崇：“類似音声検索による映像検索”：電子情報通信学会，第１３回データ工学ワークショップ（DEWS2002 ISSN1347-4413）B1-1，2002
【非特許文献４】
鹿野清宏他，：“ＩＴｔｅｘｔ音声認識システム”，オーム社，2001
【非特許文献５】
Lawrence Rabiner，Biing-Hwang Juang 共著，古井貞煕監訳：“音声認識の基礎（上）”，ＮＴＴアドバンステクノロジ株式会社，1995
【非特許文献６】
Norio Katayama and Shin'ichi Satoh：“The SR-tree ：An Index Structure for High-Dimensional Nearest Neighbor Queries”，in Proc.ACM SIGMOD International Conference On Management of Data ，pp.368-380，May 1997
【非特許文献７】
Yasushi Sakurai ，Masatoshi Yoshikawa ，Shunsuke Uemura ，and Haruhiko Kojima ：“A-tree：An Index Structure for High-Dimensional Space Using Relative Approximations ”，In Proc.of the 26th International Conference on Very Large Data Bases (VLDB)，pp.516-526，Cairo ，September 2000
【０００７】
【発明が解決しようとする課題】
〔課題１〕前述した従来手法１では，一致する音声音楽信号を検索するため，信号に雑音が入ると検索できなくなってしまうという問題がある。
〔課題２〕従来手法２では，非定常な雑音が入った音声音楽信号でも検索が可能であるが，類似度計算の計算量が大きく，計算に時間がかかってしまうという問題がある。
〔課題３〕従来手法３では，音声音楽信号に雑音が入っても信号が類似していれば従来手法２よりも高速な検索が可能であるが，雑音がない場合に比べて検索精度は下がってしまうという問題がある。
【０００８】
本発明は，上記問題点の解決を図り，非定常な雑音を含む音声音楽信号を精度良く，高速に検索できる手段を確立することを目的とする。
【０００９】
【課題を解決するための手段】
本発明の類似音声音楽検索装置は，上記課題を解決するため，検索キー音声音楽信号入力手段と，短時間窓音声音楽特徴量抽出手段と，短時間窓音声音楽特徴量類似検索手段と，音声音楽情報比較統合手段と，音声音楽表示出力手段と，検索対象音声音楽信号入力手段と，特徴量の蓄積手段とを備える。
【００１０】
検索キー音声音楽信号入力手段は，検索キーとして数秒間の音声音楽信号を入力する。
【００１１】
短時間窓音声音楽特徴量抽出手段は，短時間窓を少しずつずらしながら，短時間窓長の音声音楽信号を切り出し，そこから短時間窓音声音楽特徴量を抽出する。短時間窓音声音楽特徴量は多次元ベクトルの形で表現される。
【００１２】
短時間窓音声音楽特徴量類似検索手段は，蓄積された短時間窓音声音楽特徴量の中から，検索キーから抽出されたそれぞれの短時間窓音声音楽特徴量に類似するものを検索する。類似度は，多次元ベクトル間の距離が近いものほど類似度が高いとする。なお，この類似度のことを部分類似度と呼ぶことにする。
【００１３】
音声音楽情報比較統合手段は，前記短時間窓音声音楽特徴量類似検索手段による短時間窓音声音楽特徴量ごとの類似検索結果から正解候補音声音楽区間を作成し，その正解候補音声音楽区間全体と検索キー音声音楽信号全体との間の類似度を計算し，その類似度の高い正解候補音声音楽区間のリストを作成する。
【００１４】
この類似度は，例えば，検索キー音声音楽信号中の短時間窓音声音楽特徴量を表す多次元ベクトルと，それに対応する正解候補音声音楽区間中の短時間窓音声音楽特徴量を表す多次元ベクトルとの間の距離を，対応する短時間窓音声音楽特徴量ごとに計算し，それらの距離のうち距離の近いものだけの和をとり，その和が小さいものほど類似度が高いものとする。なお，この類似度を全体類似度と呼ぶこととする。そして，正解候補音声音楽区間リストを全体類似度の高い順に並び替える。
【００１５】
音声音楽表示出力手段は，ディスプレイ等の表示装置に全体類似度の高い順に正解候補音声音楽区間のリストを表示し，マウス等のポインティングデバイスで選択したリスト中の正解候補音声音楽区間の音声音楽信号をスピーカー等で出力する。
【００１６】
検索対象音声音楽信号入力手段は，検索対象となる長時間の音声音楽信号を入力する。
【００１７】
蓄積手段は，検索対象音声音楽信号から抽出された個々の短時間窓音声音楽特徴量または平均短時間窓音声音楽特徴量を蓄積する。また抽出した短時間窓音声音楽特徴量または平均短時間窓音声音楽特徴量から多次元空間インデックスを構成する。
【００１８】
以上の各手段による処理は，コンピュータとソフトウェアプログラムとによって実現することができ，そのプログラムをコンピュータ読み取り可能な記録媒体に記録することも，ネットワークを通して提供することも可能である。
【００１９】
【発明の実施の形態】
本発明の実施の形態を説明するに先立ち，実施の形態の説明中で用いている言葉の意味について簡単に説明する。
【００２０】
「非定常雑音」：ある区間に対して，その全体に渡って入っていない雑音（例えば，雑音としての人の話し声は，息継ぎなどのために音が途切れるので，非定常雑音である）。
【００２１】
「短時間窓」：約２０ミリ秒から４０ミリ秒程度の時間窓。
【００２２】
「短時間窓音声音楽特徴量」：短時間窓長の音声音楽信号から抽出される特徴量。多次元ベクトルで表される。
【００２３】
「検索キー音声音楽信号」：検索キーとして入力される数秒（例えば４秒など）の音声音楽信号。
【００２４】
「検索対象音声音楽信号」：検索対象となる長時間の音声音楽信号（例えば，テレビ番組１週間分，ＣＤ音源１０００曲分等）。
【００２５】
「部分類似度」：多次元ベクトルで表される短時間窓音声音楽特徴量間（または平均短時間窓音声音楽特徴量間）の類似度である。多次元ベクトル間の距離が近いものほど，この類似度は高い。
【００２６】
「全体類似度」：検索キーと検索対象中の検索キーと同じ長さの音声音楽信号との間の類似度である。例えば，検索キー音声音楽信号中の短時間窓音声音楽特徴量を表す多次元ベクトルと，それに対応する検索対象中の検索キーの長さの音声音楽信号から抽出された短時間窓音声音楽特徴量を表す多次元ベクトルとの間の距離を，対応する短時間窓音声音楽特徴量ごとに計算し，それらの距離のうち距離の近いものだけの和をとり，その和の小さいものほどこの類似度が高い。
【００２７】
「正解候補音声音楽区間」：検索キーの短時間窓音声音楽特徴量の検索キー音声音楽信号中での位置と，部分類似度の高い検索対象中の短時間窓音声音楽信号の位置が同じになるように切り出した，検索対象中の音声音楽信号区間。これが全体類似度も高い正解の候補とする。
【００２８】
「平均短時間窓音声音楽特徴量」：短時間窓音声音楽特徴量を時間順に並ぶ複数ごとに平均をとったもの。これを短時間窓音声音楽特徴量の代わりとして扱うことにより，類似検索処理の際に検索回数を少なくでき，高速化が図れる。
【００２９】
以下，図面を用いて本発明の実施の形態を説明する。
【００３０】
〔実施の形態１〕
本実施の形態１では，ＣＤなどから切り出された雑音の入っていない音声音楽信号を検索キーとし，検索対象として用意される長時間のテレビ映像音声などからその音声音楽信号がオリジナルのまま使われている部分や背景音楽として使われている部分を検索する。背景音楽として使われる部分には非定常な雑音が入っているが，本発明では，そのような雑音が入っている音声音楽信号でも高速に検索が行える。
【００３１】
図１は，本発明の実施の形態における類似音声音楽検索装置の構成例を示す図である。類似音声音楽検索装置１０は，短時間窓音声音楽特徴量抽出部（検索フェーズ）１１，短時間窓音声音楽特徴量類似検索部１２，音声音楽情報比較統合部１３，短時間窓音声音楽特徴量抽出部（蓄積フェーズ）１４，蓄積部１５，記憶部１６とから構成されており，検索キー音声音楽信号入力装置２０，音声音楽表示出力装置２１，検索対象音声音楽信号入力装置２２と接続されている。
【００３２】
類似音声音楽検索装置１０の動作は，検索キーの短時間窓音声音楽特徴量で検索対象の短時間窓音声音楽特徴量を検索することにより類似音声音楽を検索する検索フェーズＰ１と，検索対象の音声音楽信号と短時間窓音声音楽特徴量とを蓄積する蓄積フェーズＰ２からなる。
【００３３】
図２は，本実施の形態における類似音声音楽検索処理フローチャートである。この例では，検索キー入力処理ステップＳ１０において，ＣＤなどの雑音の入っていない音源を入力する。そこから数秒程度の音声音楽信号を検索キーとして切り出す処理を行い，検索キー音声音楽信号を得る。
【００３４】
次に，特徴量抽出処理ステップＳ２０において，短時間窓音声音楽特徴量抽出部（検索フェーズ）１１は，約２０ミリ秒から４０ミリ秒程度の短時間窓を少しずつずらしながら，検索キー入力処理ステップＳ１０で得られた検索キー音声音楽信号から音声音楽信号を切り出し，その切り出した音声音楽信号から短時間窓音声音楽特徴量を抽出する。
【００３５】
ここで，短時間窓音声音楽特徴量としては，例えば，非特許文献４に述べられているメル周波数ケプストラム係数や，フィルタバンク分析による各帯域の音声パワーや，非特許文献５に述べられている重み付きケプストラム係数等を用いることができる。なお，短時間窓音声音楽特徴量は，多次元ベクトルとして表される。
【００３６】
類似検索処理ステップＳ３０による検索のために，あらかじめ短時間窓音声音楽特徴量抽出部（蓄積フェーズ）１４が，長時間の検索対象音声音楽信号から上記特徴量抽出処理ステップＳ２０の特徴量抽出処理と同様にして短時間窓音声音楽特徴量を抽出し，蓄積部１５が，抽出された短時間窓音声音楽特徴量を記憶部１６に蓄積しておく。また，それらの短時間窓音声音楽特徴量から，非特許文献６に述べられているＳＲ−ｔｒｅｅや，非特許文献７に述べられているＡ−ｔｒｅｅなどの多次元空間インデックスを構成しておく。
【００３７】
類似検索処理ステップＳ３０において，短時間窓音声音楽特徴量類似検索部１２は，検索キーから抽出された個々の短時間窓音声音楽特徴量を入力し，それぞれに類似する検索対象中の短時間窓音声音楽特徴量を，多次元空間インデックスを使って高速に検索する。検索キーの短時間窓音声音楽特徴量ごとに，部分類似度の高い検索対象中の短時間窓音声音楽特徴量のリストを作成する。
【００３８】
部分類似度は，短時間窓音声音楽特徴量を表す多次元ベクトル間の距離が近いほど高いものとする。なお，多次元空間インデックスを使うことで，使わない場合と比較した時に約１０倍高速に検索できていることが確認されている。
【００３９】
続いて，比較統合処理ステップＳ４０に進む。図３は，本実施の形態における比較統合処理フローチャートである。本実施の形態１における音声音楽情報比較統合部１３による比較統合処理ステップＳ４０は，図３のフローチャートを用いて詳細に説明する。
【００４０】
ステップＳ４１０において，類似検索処理ステップＳ３０で得られた類似検索の結果の部分類似度の高い短時間窓音声音楽特徴量のリストを入力し，検索キーの短時間窓音声音楽特徴量の位置と，対応する部分類似度の高い検索対象中の短時間窓音声音楽特徴量の位置とが同じ位置になるように合わせ，検索対象音声音楽信号から検索キーと同一の長さの音声音楽信号を切り出して正解候補音声音楽区間を作成する。これを入力されたすべての部分類似度の高い短時間窓音声音楽特徴量について行い，正解候補音声音楽区間のリストを作成する。
【００４１】
図４は，上記ステップＳ４１０の処理における検索対象からの正解候補音声音楽区間の切り出しを説明する図である。検索キー音声音楽信号における０，１，…，９および検索対象音声音楽信号におけるａ，ｂ，…は，それぞれ短時間窓音声音楽特徴量を表している。まず，図４（Ａ）のように，検索キー音声音楽信号中の短時間窓音声音楽特徴量の位置と，類似度が高い検索対象音声音楽信号中の短時間窓音声音楽特徴量の位置とを合わせる。図４（Ａ）の例では，検索キー短時間窓音声音楽特徴量「４」と検索対象の短時間窓音声音楽特徴量「ｈ」との類似度が高いので，その位置を合わせる。
【００４２】
次に，図４（Ｂ）のように，検索対象音声音楽信号から，検索キー音声音楽信号と同じ長さの区間を正解候補音声音楽区間として切り出す。図４（Ｂ）の例では，検索対象音声音楽信号（「ａ」〜…）から，検索キー音声音楽信号（「０」〜「９」）と同じ長さの区間（「ｄ」〜「ｍ」）が正解候補音声音楽区間として切り出される。
【００４３】
次に，図３のステップＳ４２０において，正解候補音声音楽区間のリストを入力し，そのリスト中の最上位にある正解候補音声音楽区間中の短時間窓音声音楽特徴量を読み込む。また，ステップＳ４３０において，ステップＳ４２０で読み込まれた正解候補音声音楽区間のリストの最上位の正解候補音声音楽区間をリストから削除する。
【００４４】
続いて，ステップＳ４４０において，ステップＳ４２０で読み込まれた正解候補音声音楽区間の全体の短時間窓音声音楽特徴量を入力し，それと検索キー全体の短時間窓音声音楽特徴量との全体類似度を計算をする。音声音楽情報比較統合部１３は，この正解候補音声音楽区間と全体類似度の組を蓄積部１５に出力し，蓄積部１５はそれらを記憶部１６に保持する。
【００４５】
全体類似度の計算方法としては，例えば次のような方法を用いることができる。検索キー音声音楽信号中の短時間窓音声音楽特徴量を表す多次元ベクトルと，それに対応する正解候補音声音楽区間中の短時間窓音声音楽特徴量を表す多次元ベクトルとの間の距離を，対応する短時間窓音声音楽特徴量ごとに計算し，それらの距離のうち距離の近いものの上位何個かの和をとり，その和が小さいものほど全体類似度が高いものとする。
【００４６】
すなわち，例えば検索キー音声音楽信号から短時間窓で切り出した音声音楽信号が３００個である場合に，検索キーと正解候補音声音楽区間との間において，短時間窓音声音楽特徴量を表す多次元ベクトル間の距離を，対応する短時間窓音声音楽特徴量ごとに計算し，それらの距離のうち距離が近い値の上位１００個だけの和を検索キーと正解候補音声音楽区間との距離とし，その距離が近いものほど全体類似度が高いものであると定義する。
【００４７】
これにより，雑音が入っていない部分や雑音の影響が少ない部分だけを扱って全体類似度の計算ができるため，非定常な雑音の影響を低減した検索をすることができる。なお，距離の近い上位のもののうち，いくつの距離の和とするかは，あらかじめ設定しておくものとする。上位何個の和を全体類似度して用いるかを，ユーザが設定できるようにするためのＧＵＩ（Graphical User Interface）を設ける実施も好適である。短時間窓音声音楽特徴量を表す多次元ベクトル間の距離が近いもののうちの和をとる個数を，雑音が多いときは少なく，雑音が少ないときは多くすることで，検索精度をさらに向上させることが可能である。
【００４８】
図５に従って全体類似度の計算方法の具体例を説明する。図５の例では，まず，検索キーと正解候補音声音楽区間との部分類似度を計算し，部分類似度の距離が小さい上位６件（「３」，「４」，「５」，「７」，「９」，「１０」）の和を，検索キーと正解候補音声音楽区間との全体類似度の距離としている。これによって，雑音の影響により部分類似度の距離が大きい部分（「１」，「２」，「６」，「８」）を除くことができ，非定常な雑音があっても類似する音声音楽信号を検索することができる。
【００４９】
図３のステップＳ４５０において，正解候補音声音楽区間リストを入力し，このリストがすでに空であれば，ステップＳ４６０に進む。空でなければステップＳ４２０に戻り，同様に処理を繰り返す。
【００５０】
すべての正解候補音声音楽区間について，ステップＳ４２０〜Ｓ４４０の処理が終了し，正解候補音声音楽区間リストが空になったならば，ステップＳ４６０では，ステップＳ４４０において記憶部１６に保持されたすべての正解候補音声音楽区間とその全体類似度の組を蓄積部１５から入力し，それらを全体類似度の高い順に並び替えてリストを作成する。
【００５１】
これらのステップＳ４１０からステップＳ４６０までの処理を行うことで，図２のフローチャートの比較統合処理ステップＳ４０は，類似検索処理ステップＳ３０の類似検索の結果の部分類似度の高い短時間窓音声音楽特徴量のリストを入力し，全体類似度の高い順に並び替えられた正解候補音声音楽区間のリストを出力することができる。
【００５２】
その後，図２の表示出力処理ステップＳ５０において，全体類似度の高い順の正解候補音声音楽区間のリストを，ディスプレイ等の音声音楽表示出力装置２１に出力し，マウス等のポインティングデバイスで選択されたリスト中の正解候補音声音楽区間の音声音楽信号を，スピーカー等の音声音楽表示出力装置２１で出力する。
【００５３】
〔実施の形態２〕
本実施の形態２では，放送されているテレビ映像音声などから非定常な雑音が含まれるような数秒の楽曲の音声音楽信号を逐次的に切り出して検索キーとし，検索対象として用意されているＣＤ等の雑音が含まれない音声音楽信号を格納した音楽データベースから，その雑音が入った楽曲の音声音楽信号と同じ楽曲の同じ部分を検索する。これにより，放送されている映像音声中の雑音が入っているような楽曲部分の楽曲名とそれが楽曲中のどの部分であるかを検索することができる。
【００５４】
本実施の形態２における類似音声音楽検索装置の構成例は，前述した実施の形態１と同様に，図１に示される構成例となる。また，本実施の形態２における類似音声音楽検索処理フローチャートは，前述の実施の形態１と同様に，図２に示されるフローチャートとなる。以下，本実施の形態２について，図２のフローチャートを用いて説明するが，前述した実施の形態１とは，検索キー入力処理ステップＳ１０と表示出力処理ステップＳ５０とが異なる。
【００５５】
特徴量抽出処理ステップＳ２０，類似検索処理ステップＳ３０，比較統合処理ステップＳ４０については，前述した実施の形態１における処理と同様の処理であるので，説明を省略する。
【００５６】
検索キー入力処理ステップＳ１０において，放送中のＴＶ番組の音声などのリアルタイムに流れている音声音楽を入力し，そこから逐次的に数秒程度の音声音楽信号を検索キーとして切り出す処理を行い，検索キー音声音楽信号を得る。
【００５７】
表示出力処理ステップＳ５０では，全体類似度の高い順の正解候補音声音楽区間のリストを，音声音楽表示出力装置２１（ディスプレイ等）に出力し，マウス等のポインティングデバイスで選択されたリスト中の正解候補音声音楽区間の音声音楽信号を，音声音楽表示出力装置２１（スピーカー等）で出力する。本実施の形態２では，この処理を逐次的に繰り返す。これによって，リアルタイムに流れている音声音楽信号に対して，その背景で使われている楽曲を検索することができる。
【００５８】
〔実施の形態３〕
本実施の形態３では，前述した実施の形態１，実施の形態２の検索時間をより高速化するため，類似検索する際に，短時間窓音声音楽特徴量をそのまま使わずに，時間順に並んだ複数個の短時間窓音声音楽特徴量の平均となる平均短時間窓音声音楽特徴量を使って類似検索を行う。平均短時間窓音声音楽特徴量は，それぞれの短時間窓音声音楽特徴量を表す多次元ベクトルの平均ベクトルにより表される。これにより類似検索の回数の削減と検索対象のデータ数が削減されるため，処理の高速化が図れる。
【００５９】
本実施の形態３における類似音声音楽検索装置の構成例は，前述した実施の形態１，実施の形態２と同様に，図１に示される構成例となる。また，本実施の形態３における類似音声音楽検索処理フローチャートは，前述した実施の形態１，実施の形態２と同様に，図２に示されるフローチャートとなる。以下，本実施の形態３について，図２のフローチャートを用いて説明するが，前述した実施の形態１，実施の形態２とは，類似検索処理ステップＳ３０と比較統合処理ステップＳ４０とが異なる。
【００６０】
検索キー入力処理ステップＳ１０，特徴量抽出処理ステップＳ２０，表示出力処理ステップＳ５０については，前述した実施の形態１，実施の形態２における処理と同様の処理であるので，説明を省略する。
【００６１】
図６は，本実施の形態３における類似検索処理フローチャートである。本実施の形態３における類似検索処理ステップＳ３０の処理を，図６のフローチャートを用いて詳細に説明する。
【００６２】
類似検索処理のために，あらかじめ以下のステップＳ３１０〜Ｓ３３０による蓄積フェーズＰ２を実行する。ステップＳ３１０において，短時間窓音声音楽特徴量抽出部（蓄積フェーズ）１４が，検索対象となる長時間の音声音楽信号を入力し，特徴量抽出処理ステップＳ２０と同様にして短時間窓音声音楽特徴量を抽出し，蓄積部１５が，抽出された短時間窓音声音楽特徴量を記憶部１６に蓄積しておく。
【００６３】
ステップＳ３２０において，検索対象音声音楽信号から抽出したすべての短時間窓音声音楽特徴量を入力し，それらの短時間窓音声音楽特徴量の時間順に並んだＫ個分ずつの平均をとって平均短時間窓音声音楽特徴量を作成する。例えば，Ｋ＝６とした場合，時間順に並ぶ６個ずつの短時間窓音声音楽特徴量の平均をとったものを平均短時間窓音声音楽特徴量とする。
【００６４】
ステップＳ３３０において，ステップＳ３２０で作成した平均短時間窓音声音楽特徴量を入力し，それらの短時間窓音声音楽特徴量から，前述した実施の形態１，実施の形態２と同様に，多次元空間インデックスを構築しておく。
【００６５】
検索フェーズＰ１では，ステップＳ３４０において，検索キーの短時間窓音声音楽特徴量の時間順に並んだＫ個分ずつの平均をとり，平均短時間窓音声音楽特徴量を作成する。例えば，Ｋ＝６とした場合，時間順に並ぶ６個ずつの短時間窓音声音楽特徴量の平均をとったものを平均短時間窓音声音楽特徴量とする。
【００６６】
ステップＳ３５０において，短時間窓音声音楽特徴量類似検索部１２は，検索キーの平均短時間窓音声音楽特徴量を入力し，それらの検索キーの平均短時間窓音声音楽特徴量と類似するものを，蓄積されている検索対象の平均短時間窓音声音楽特徴量の中から検索し，検索キーの平均短時間窓音声音楽特徴量ごとに，部分類似度の高い平均短時間窓音声音楽特徴量のリストを作成する。
【００６７】
ここでの部分類似度は，平均短時間窓音声音楽特徴量を表す多次元ベクトル間の距離が近いほど高いものとする。この際に，ステップＳ３３０で構築した多次元空間インデックスを使用することで高速に検索することができる。
【００６８】
また，例えば，Ｋ＝６として短時間窓音声音楽特徴量の６個分の平均を平均短時間窓音声音楽特徴量とすると，前述した実施の形態１，実施の形態２と比較して多次元インデックスを構成するデータ数は６分の１となり，さらに多次元インデックスを用いて行う検索回数も６分の１となることにより，検索の高速化が図られる。
【００６９】
本実施の形態３における比較統合処理ステップＳ４０については，図３に示すフローチャートのステップＳ４１０の処理（正解候補音声音楽区間のリストを作成する処理）だけが前述した実施の形態１，実施の形態２と異なる。ステップＳ４２０からステップＳ４６０までについては，前述した実施の形態１，実施の形態２と同様であるので説明を省略する。
【００７０】
以下，本実施の形態３における平均短時間窓音声音楽特徴量のリストから正解候補音声音楽区間のリストを作成する方法の例を説明するが，平均短時間窓音声音楽特徴量のリストから正解候補音声音楽区間のリストを作成する方法は，以下の例に限られるものではない。
【００７１】
図７は，本実施の形態３における正解候補音声音楽区間リスト作成処理フローチャートである。本実施の形態３における正解候補音声音楽区間のリストを作成する処理（前述した実施の形態１，実施の形態２において，図３のステップＳ４１０に該当する処理）は，図７のフローチャートを用いて詳細に説明する。
【００７２】
ステップＳ４１１において，本実施の形態３における類似検索処理ステップＳ３０の結果である平均短時間窓音声音楽特徴量のリストを入力し，このリストの最上位の平均短時間窓音声音楽特徴量を読み込む。また，ステップＳ４１２において，ステップＳ４１１で読み込んだ平均短時間窓音声音楽特徴量のリストの最上位の平均短時間窓音声音楽特徴量をリストから削除する。
【００７３】
ステップＳ４１３において，ステップＳ４１１で読み込んだ平均短時間窓音声音楽特徴量を入力し，この平均短時間窓音声音楽特徴量の平均をとった元であるＫ個の短時間窓音声音楽特徴量を，記憶部１６から蓄積部１５を介して読み込む。
【００７４】
ステップＳ４１４において，平均をとった元のＫ個の短時間窓音声音楽特徴量ごとに，その平均をとった元の短時間窓音声音楽特徴量の位置が，検索キー中の対応する短時間窓音声音楽特徴量（例えば，平均をとった区間の中央の短時間窓音声音楽特徴量）と同じ位置になるように検索対象音声音楽信号の位置を合わせ，正解候補音声音楽区間を切り出す。切り出された正解候補音声音楽区間は合計でＫ個となる。
【００７５】
ステップＳ４１５において，Ｋ個の正解候補音声音楽区間を入力し，それらＫ個の正解候補音声音楽区間を正解候補音声音楽区間のリストに記載する。
【００７６】
ステップＳ４１６において，平均短時間窓音声音楽特徴量のリストを入力し，そのリストが空でなければＳ４１１に戻り，空になったならば正解候補音声音楽区間のリストを出力する。以上のステップＳ４１１〜Ｓ４１６の処理を，すべての平均短時間窓音声音楽特徴量のリストについて実行する。
【００７７】
図８は，本実施の形態における類似度が高い平均短時間窓音声音楽特徴量から正解候補音声音楽区間を作成する例を説明する図である。図８の例では，短時間窓音声音楽特徴量の３個（Ｋ＝３）の平均を平均短時間窓音声音楽特徴量としている。また，検索キーの対応する短時間窓音声音楽特徴量を，平均をとった区間の中央の短時間窓音声音楽特徴量としている。
【００７８】
図中，「ｓＸ」（Ｘ＝０，１，２，…）は検索キーにおける短時間窓音声音楽特徴量を表し，「Ｍｅａｎ−ｓＸ」（Ｘ＝０，１，２，…）は検索キーにおける平均短時間窓音声音楽特徴量を表す。また，「ｔＸ」（Ｘ＝０，１，２，…）は検索対象における短時間窓音声音楽特徴量を表し，「Ｍｅａｎ−ｔＸ」（Ｘ＝０，１，２，…）は検索対象における平均短時間窓音声音楽特徴量を表す。
【００７９】
図８（Ａ）において，検索キー音声音楽信号の「Ｍｅａｎ−ｓ１」と検索対象音声音楽信号の「Ｍｅａｎ−ｔ３」との間の類似度が高いものとする。「Ｍｅａｎ−ｓ１」の元になっている短時間窓音声音楽特徴量は「ｓ３」，「ｓ４」，「ｓ５」であり，「Ｍｅａｎ−ｔ３」の元になっている短時間窓音声音楽特徴量は「ｔ９」，「ｔ１０」，「ｔ１１」である。検索キーの対応する短時間窓音声音楽特徴量を，平均をとった区間の中央の短時間窓音声音楽特徴量とすると，ここでは「ｓ４」である。
【００８０】
これをもとに正解候補音声音楽区間を切り出す場合，図８（Ｂ）に示すように，検索対象音声音楽信号の「ｔ９」，「ｔ１０」，「ｔ１１」位置を，それぞれ検索キー音声音楽信号の「ｓ４」の位置に合わせて，「ｔ９」，「ｔ１０」，「ｔ１１」ごとに検索キーの長さと同じ長さで音声音楽信号を切り出し，正解候補音声音楽区間を作成する。Ｋ＝３であるので，「ｓ４」の位置に「ｔ９」を合わせたもの，「ｔ１０」を合わせたもの，「ｔ１１」を合わせたものの３つの正解候補音声音楽区間が作成される。
【００８１】
以上の図７，図８によって，平均短時間窓音声音楽特徴量のリストから正解候補音声音楽区間を作成する方法の一例を示したが，これに限られるものではなく，例えば，検索キーの対応する短時間窓音声音楽特徴量を，平均をとった区間の中央の短時間窓音声音楽特徴量ではなく，他のものにすることも可能である。また，例えば図８の例において，作成する正解候補音声音楽区間の数は，Ｋ＝３個に限らず，Ｋ＋２＝５個，Ｋ−１＝１個のように任意の数を設定することも可能である。
【００８２】
【発明の効果】
本発明は，検索キーとそれと同じ長さに切り出された検索対象の音声音楽信号との全体類似度を表す距離を，短時間窓音声音楽特徴量間の部分類似度を表す距離のうち距離の近い上位のものだけの和とすることによって，非定常な雑音の影響を低減した音声音楽信号の類似検索が可能になるという効果を有する（課題１，課題３の解決）。
【００８３】
また，短時間窓音声音楽特徴量間の部分類似度の高いものを検索する際に多次元空間インデックスを用いることにより，高速な検索ができるという効果を有する（課題２の解決）。
【図面の簡単な説明】
【図１】本発明の実施の形態における類似音声音楽検索装置の構成例を示す図である。
【図２】本実施の形態における類似音声音楽検索処理フローチャートである。
【図３】本実施の形態における比較統合処理フローチャートである。
【図４】本実施の形態における検索対象からの正解候補音声音楽区間の切り出しを説明する図である。
【図５】本実施の形態における全体類似度の計算方法を説明する図である。
【図６】本実施の形態における類似検索処理フローチャートである。
【図７】本実施の形態における正解候補音声音楽区間リスト作成処理フローチャートである。
【図８】本実施の形態における類似度が高い平均短時間窓音声音楽特徴量から正解候補音声音楽区間を作成する例を説明する図である。
【符号の説明】
Ｐ１検索フェーズ
Ｐ２蓄積フェーズ
１０類似音声音楽検索装置
１１短時間窓音声音楽特徴量抽出部（検索フェーズ）
１２短時間窓音声音楽特徴量類似検索部
１３音声音楽情報比較統合部
１４短時間窓音声音楽特徴量抽出部（蓄積フェーズ）
１５蓄積部
１６記憶部
２０検索キー音声音楽信号入力装置
２１音声音楽表示出力装置
２２検索対象音声音楽信号入力装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technology for searching for audio music signals, and more particularly, a similar audio music search device capable of high-speed search even with audio music signals including non-stationary noise. , Kind The present invention relates to a similar audio music search program and a recording medium for the program.
[0002]
[Prior art]
As the conventional technique (conventional technique 1) for searching a speech and music signal at high speed, for example, as shown in Patent Document 1, “High-speed signal search method, apparatus and recording medium thereof” and Non-Patent Document 1, for example, There is a technique for searching for a voice music signal that matches a search key.
[0003]
In addition, as a conventional technique (conventional method 2) for searching for a voice music signal including noise, the voice music signal to be searched is divided into a number of small areas in the time frequency domain, and the voice music signal of the search key is divided between each area. (See, for example, Patent Document 2 “Information Transmission / Reception System and Method, Information Processing Device and Method”, Non-Patent Document 2). In this method, the similarity of each region is accumulated by the voting method, and a matching search is performed in parts such as background music other than sudden and non-stationary noise parts such as narration in the audio music signal of the search target and search key. I do.
[0004]
In addition, as a conventional technique for searching for similar audio music signals (conventional method 3), not only matching audio music signals but also similar audio music signals can be searched, and high-speed search can be performed using a multidimensional index. There is a method (for example, see Non-Patent Document 3).
[0005]
Here, there are, for example, non-patent document 4, non-patent document 5, non-patent document 6, and non-patent document 7 as a document describing a technique used in an embodiment of the present invention described later.
[0006]
[Patent Document 1]
Japanese Patent No. 30653314
[Patent Document 2]
JP 2002-10237 A
[Non-Patent Document 1]
Kunio Kanno, Gavin Smith, Hiroshi Murase: "High-speed search method of acoustic signals using histogram features-Time series active search method": IEICE Transactions, D-1, Vol.J82-D-II NO.9 , Pp.1365-1373, 1999
[Non-Patent Document 2]
Motoaki Abe, Masayuki Nishiguchi: “Self-optimized spectral correlation method for background music identification”: IEICE Technical Report, PRMU2001-209, pp.25-30, 2002
[Non-Patent Document 3]
Suga Keitoshi, Teramoto Junji, Kataoka Ryoji, Yoshinishi Takashi: "Video Search by Similar Voice Search": IEICE, 13th Data Engineering Workshop (DEWS2002 ISSN1347-4413) B1-1, 2002
[Non-Patent Document 4]
Kiyohiro Shikano et al .: “IT text speech recognition system”, Ohmsha, 2001
[Non-Patent Document 5]
Co-authored by Lawrence Rabiner and Biing-Hwang Juang, supervised by Sadahiro Furui: “Basics of Speech Recognition (Part 1)”, NTT Advanced Technology Corporation, 1995
[Non-Patent Document 6]
Norio Katayama and Shin'ichi Satoh: “The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries”, in Proc. ACM SIGMOD International Conference On Management of Data, pp. 368-380, May 1997
[Non-Patent Document 7]
Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura, and Haruhiko Kojima: “A-tree: An Index Structure for High-Dimensional Space Using Relative Approximations”, In Proc. Of the 26th International Conference on Very Large Data Bases (VLDB), pp. 516-526, Cairo, September 2000
[0007]
[Problems to be solved by the invention]
[Problem 1] The above-described conventional method 1 has a problem in that, since a matching speech music signal is searched, it cannot be searched if noise enters the signal.
[Problem 2] Conventional method 2 can search even a speech music signal containing unsteady noise, but has a problem that the calculation amount of similarity is large and the calculation takes time.
[Problem 3] Conventional method 3 can perform a higher-speed search than conventional method 2 if the signal is similar even if noise is included in the audio-music signal, but the search accuracy is lower than that in the case of no noise. There is a problem that it ends up.
[0008]
An object of the present invention is to solve the above-described problems and to establish means capable of searching a speech music signal including unsteady noise with high accuracy and at high speed.
[0009]
[Means for Solving the Problems]
In order to solve the above problems, a similar speech / music search device of the present invention includes a search key speech / music signal input means, a short-time window speech / music feature quantity extraction means, a short-time window speech / music feature quantity similarity search means, A music information comparison and integration unit, a voice and music display output unit, a search target voice and music signal input unit, and a feature amount storage unit are provided.
[0010]
The search key voice music signal input means inputs a voice music signal for several seconds as a search key.
[0011]
The short-time window audio music feature quantity extraction unit extracts a short-window audio music signal having a short window length while gradually shifting the short-time window, and extracts the short-time window audio music feature quantity therefrom. Short-time window audio music features are expressed in the form of multidimensional vectors.
[0012]
The short-time window audio music feature value similarity search means searches the accumulated short-time window audio music feature values that are similar to the respective short-time window audio music feature values extracted from the search key. It is assumed that the similarity is higher as the distance between multidimensional vectors is shorter. This similarity is referred to as partial similarity.
[0013]
The speech and music information comparison and integration means creates a correct candidate speech music section from the similar search results for each short time window speech and music feature amount by the short time window speech and music feature quantity similarity search means, The similarity between the entire search key speech music signal is calculated, and a list of correct candidate speech music sections with high similarity is created.
[0014]
This similarity is, for example, a multidimensional vector representing a short-time window speech music feature in the search key speech music signal and a multi-dimensional vector representing a short-time window speech music feature in the corresponding correct candidate speech music section. Is calculated for each corresponding short-time window audio music feature, and only the closest one of the distances is calculated, and the smaller the sum is, the higher the similarity is. This similarity is referred to as the overall similarity. Then, the correct answer candidate voice music section list is rearranged in descending order of overall similarity.
[0015]
The voice and music display output means displays a list of correct candidate voice music sections on a display device such as a display in descending order of overall similarity, and a voice music signal of the correct candidate voice music sections in the list selected by a pointing device such as a mouse. Is output with a speaker.
[0016]
The search target audio music signal input means inputs a long time audio music signal to be searched.
[0017]
The storage means stores the individual short-time window audio music feature values or the average short-time window audio music feature values extracted from the search target audio music signal. In addition, a multidimensional spatial index is constructed from the extracted short-time window audio music feature quantity or average short-time window audio music feature quantity.
[0018]
The processing by each means described above can be realized by a computer and a software program, and the program can be recorded on a computer-readable recording medium or provided through a network.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Prior to describing the embodiment of the present invention, the meaning of the words used in the description of the embodiment will be briefly described.
[0020]
"Non-stationary noise": Noise that does not enter the entire section of a certain section (for example, a person's voice as noise is non-stationary noise because the sound is interrupted due to breathing, etc.).
[0021]
“Short time window”: A time window of about 20 milliseconds to 40 milliseconds.
[0022]
“Short-time window audio music feature”: A feature extracted from a short-window audio signal. Represented as a multidimensional vector.
[0023]
“Search key voice music signal”: a voice music signal of several seconds (for example, 4 seconds) input as a search key.
[0024]
“Search target audio music signal”: A long time audio music signal to be searched (for example, one week of TV program, 1000 CD sound source).
[0025]
“Partial similarity”: similarity between short-time window speech and music feature quantities (or average short-time window speech and music feature quantities) represented by a multidimensional vector. The closer the distance between multidimensional vectors, the higher the similarity.
[0026]
“Overall similarity”: similarity between a search key and a voice music signal having the same length as the search key being searched. For example, a short-time window audio music feature amount extracted from a multi-dimensional vector representing a short-time window audio music feature amount in the search key audio music signal and a corresponding search key length audio music signal in the search target Is calculated for each corresponding short-time window audio music feature, and the sum of those distances is calculated, and the smaller the sum, the higher the similarity. Is expensive.
[0027]
“Correct answer voice music section”: The position of the search key in the search key voice music signal of the short-time window voice music feature of the search key is the same as the position of the short-time window voice music signal in the search target having a high partial similarity. The audio / music signal section being searched for. This is a correct candidate with a high overall similarity.
[0028]
“Average short-time window audio music feature value”: An average of multiple short-time window audio music feature values arranged in time order. By treating this as a substitute for short-time window audio music features, the number of searches can be reduced during the similarity search process, and the speed can be increased.
[0029]
Embodiments of the present invention will be described below with reference to the drawings.
[0030]
[Embodiment 1]
In the first embodiment, a sound / music signal without noise extracted from a CD or the like is used as a search key, and the sound / music signal is used as it is from a long-time TV video / audio prepared as a search target. Search for parts that are currently used or used as background music. The part used as background music contains non-stationary noise, but in the present invention, even a speech music signal containing such noise can be searched at high speed.
[0031]
FIG. 1 is a diagram illustrating a configuration example of a similar speech music search device according to an embodiment of the present invention. The similar speech / music search apparatus 10 includes a short-time window speech / music feature extraction unit (search phase) 11, a short-time window speech / music feature similarity search unit 12, a speech / music information comparison / integration unit 13, and a short-time window speech / music feature. The extraction unit (storage phase) 14, the storage unit 15, and the storage unit 16 are connected to the search key voice music signal input device 20, the voice music display output device 21, and the search target voice music signal input device 22. Yes.
[0032]
The operation of the similar voice music search device 10 includes a search phase P1 for searching for similar voice music by searching for a short time window voice music feature quantity to be searched with a short window voice music feature quantity of a search key, and a search target P1. It comprises an accumulation phase P2 for accumulating the audio music signal and the short time window audio music feature.
[0033]
FIG. 2 is a flowchart of the similar voice music search process in the present embodiment. In this example, in search key input processing step S10, a sound source that does not contain noise such as a CD is input. From there, a process of cutting out a voice music signal of about several seconds as a search key is performed to obtain a search key voice music signal.
[0034]
Next, in the feature amount extraction processing step S20, the short time window audio music feature amount extraction unit (search phase) 11 performs the search key input processing while gradually shifting the short time window of about 20 milliseconds to 40 milliseconds. A speech music signal is extracted from the search key speech music signal obtained in step S10, and a short-time window speech music feature is extracted from the extracted speech music signal.
[0035]
Here, as short-time window audio music features, for example, Mel frequency cepstrum coefficients described in Non-Patent Document 4, audio power of each band by filter bank analysis, and Non-Patent Document 5 are described. A weighted cepstrum coefficient or the like can be used. Note that the short-time window audio music feature is expressed as a multidimensional vector.
[0036]
For the search by the similar search processing step S30, the short time window audio music feature amount extraction unit (accumulation phase) 14 performs the feature amount extraction processing of the feature amount extraction processing step S20 from the long time search target audio music signal. Similarly, the short-time window audio music feature value is extracted, and the storage unit 15 stores the extracted short-time window audio music feature value in the storage unit 16. Also, a multi-dimensional spatial index such as SR-tree described in Non-Patent Document 6 or A-tree described in Non-Patent Document 7 is configured from these short-time window audio music feature quantities. .
[0037]
In the similar search processing step S30, the short-time window speech and music feature quantity similarity search unit 12 inputs each short-time window speech and music feature quantity extracted from the search key, and the short-time window in the search target similar to each of the short-time window speech and music feature quantities. Search for speech and music features using a multidimensional spatial index. For each short-time window audio music feature quantity of the search key, a list of short-time window audio music feature quantities in the search target having a high partial similarity is created.
[0038]
It is assumed that the partial similarity is higher as the distance between multidimensional vectors representing the short-time window audio music feature amount is shorter. It has been confirmed that by using a multidimensional spatial index, the search can be performed about 10 times faster than when not using it.
[0039]
Subsequently, the process proceeds to comparison and integration processing step S40. FIG. 3 is a comparison and integration process flowchart according to the present embodiment. The comparison and integration processing step S40 by the audio and music information comparison and integration unit 13 in the first embodiment will be described in detail with reference to the flowchart of FIG.
[0040]
In step S410, a short window audio music feature quantity list having a high degree of partial similarity as a result of the similarity search obtained in the similarity search process step S30 is input, and the position of the short window audio music feature quantity of the search key; Match the position of the corresponding short-time window audio music feature in the search target with high partial similarity to the same position, and cut out the audio music signal of the same length as the search key from the search target audio music signal Create correct answer voice music section. This is performed for all input short-time window music features with high partial similarity, and a list of correct candidate voice music sections is created.
[0041]
FIG. 4 is a diagram for explaining extraction of a correct candidate speech music section from a search target in the process of step S410. .., 9 in the search key audio music signal and a, b,... In the search target audio music signal respectively represent short-time window audio music features. First, as shown in FIG. 4A, the position of the short-time window audio music feature in the search key audio music signal and the position of the short-time window audio music feature in the search target audio music signal having a high similarity. Adjust. In the example of FIG. 4A, since the similarity between the search key short time window audio music feature “4” and the short window audio music feature “h” to be searched is high, the positions thereof are matched.
[0042]
Next, as shown in FIG. 4B, a section having the same length as the search key voice music signal is cut out from the search target voice music signal as a correct answer candidate voice music section. In the example of FIG. 4B, a section ("d" to "m" having the same length as the search key voice music signal ("0" to "9") from the search target voice music signal ("a" to ...). ") Is cut out as a correct candidate speech music section.
[0043]
Next, in step S420 in FIG. 3, a list of correct candidate speech music sections is input, and the short-time window speech music feature amount in the correct candidate speech music section at the top of the list is read. In step S430, the highest-ranked candidate speech music segment in the list of correct candidate speech music segments read in step S420 is deleted from the list.
[0044]
Subsequently, in step S440, the short time window sound music feature amount of the entire correct answer candidate sound music section read in step S420 is input, and the overall similarity between this and the short time window sound music feature amount of the entire search key is calculated. Calculate. The speech and music information comparison and integration unit 13 outputs a set of the correct answer candidate speech and music sections and the overall similarity to the storage unit 15, and the storage unit 15 stores them in the storage unit 16.
[0045]
As a method for calculating the overall similarity, for example, the following method can be used. The distance between the multi-dimensional vector representing the short-time window speech music feature in the search key speech music signal and the corresponding multi-dimensional vector representing the short-time window speech music feature in the correct candidate speech music section, Calculation is made for each corresponding short-time window audio music feature, and the sum of the top of the closest distances among those distances is taken. The smaller the sum, the higher the overall similarity.
[0046]
That is, for example, when there are 300 speech music signals cut out from the search key speech music signal in the short time window, the multi-dimensional representing the short time window speech music feature amount between the search key and the correct candidate speech music section. The distance between the vectors is calculated for each corresponding short-time window audio music feature, and the sum of only the top 100 closest values among the distances is defined as the distance between the search key and the correct candidate audio music section. The closer the distance is, the higher the overall similarity is defined.
[0047]
As a result, it is possible to calculate the overall similarity by treating only the part that does not contain noise or the part that is less affected by noise, so that it is possible to perform a search with reduced influence of non-stationary noise. In addition, it is assumed that the number of distances to be summed among the higher-ranked ones in advance is set in advance. It is also preferable to provide a GUI (Graphical User Interface) for allowing the user to set how many upper sums are used as the overall similarity. Retrieval accuracy can be further improved by increasing the number of sums of short distances between multi-dimensional vectors representing short-time window speech music features when there is a lot of noise and when there is little noise. Is possible.
[0048]
A specific example of the method for calculating the overall similarity will be described with reference to FIG. In the example of FIG. 5, first, the partial similarity between the search key and the correct candidate speech music section is calculated, and the top six cases (“3”, “4”, “5”, “7”) with the smallest partial similarity distance are calculated. ”,“ 9 ”,“ 10 ”) is the distance of the overall similarity between the search key and the correct candidate speech music section. As a result, portions (“1”, “2”, “6”, “8”) having a large distance of partial similarity due to the influence of noise can be removed, and similar speech music can be obtained even if there is non-stationary noise. The signal can be searched.
[0049]
In step S450 of FIG. 3, the correct candidate speech music section list is input, and if this list is already empty, the process proceeds to step S460. If it is not empty, the process returns to step S420 and the process is repeated in the same manner.
[0050]
If the processing of steps S420 to S440 is completed for all correct candidate speech music sections and the correct candidate speech music section list is empty, in step S460, all correct answers held in the storage unit 16 in step S440. A set of candidate speech music sections and their overall similarity is input from the storage unit 15 and rearranged in descending order of overall similarity to create a list.
[0051]
By performing the processing from step S410 to step S460, the comparison and integration processing step S40 of the flowchart of FIG. 2 is a short-time window audio music feature having a high partial similarity as a result of the similarity search in the similarity search processing step S30. And a list of correct candidate speech music sections rearranged in descending order of overall similarity.
[0052]
Thereafter, in the display output processing step S50 of FIG. 2, a list of correct candidate speech music sections in order of high overall similarity is output to the speech music display output device 21 such as a display and selected by a pointing device such as a mouse. The voice music signal of the correct candidate voice music section in the list is output by the voice music display output device 21 such as a speaker.
[0053]
[Embodiment 2]
In the second embodiment, the audio / sound signal of a music piece of several seconds that contains non-stationary noise from a broadcast TV video / audio is sequentially cut out as a search key and a CD prepared as a search target. The same part of the same music as the audio music signal of the music containing the noise is searched from the music database storing the audio music signal not including noise. As a result, it is possible to search for the music name of the music part that contains noise in the video and audio being broadcast and to which part of the music it is.
[0054]
The configuration example of the similar speech and music search device according to the second embodiment is the configuration example shown in FIG. 1 as in the first embodiment. Further, the similar voice music search processing flowchart in the second embodiment is the flowchart shown in FIG. 2, as in the first embodiment. Hereinafter, the second embodiment will be described with reference to the flowchart of FIG. 2, but the search key input processing step S10 and the display output processing step S50 are different from the first embodiment described above.
[0055]
Since the feature amount extraction processing step S20, the similarity search processing step S30, and the comparison integration processing step S40 are the same as the processing in the first embodiment described above, description thereof is omitted.
[0056]
In search key input processing step S10, voice music flowing in real time such as the sound of a TV program being broadcast is input, and a voice music signal of about several seconds is sequentially extracted therefrom as a search key. Get a voice music signal.
[0057]
In the display output processing step S50, the list of correct candidate speech music sections in descending order of overall similarity is output to the speech music display output device 21 (display, etc.), and the correct answers in the list selected by the pointing device such as a mouse are displayed. The voice music signal of the candidate voice music section is output by the voice music display output device 21 (speaker or the like). In the second embodiment, this process is sequentially repeated. As a result, the music used in the background can be searched for the audio music signal flowing in real time.
[0058]
[Embodiment 3]
In the third embodiment, in order to speed up the search time of the first and second embodiments described above, when performing a similar search, the short window audio music feature values are not used as they are, but are arranged in time order. Similarity search is performed using an average short-time window audio music feature value that is an average of a plurality of short-time window audio music feature values. The average short window audio music feature is represented by an average vector of multidimensional vectors representing the respective short window audio music features. As a result, the number of similar searches is reduced and the number of data to be searched is reduced, so that the processing speed can be increased.
[0059]
The configuration example of the similar speech and music search device in the third embodiment is the configuration example shown in FIG. 1 as in the first and second embodiments. Further, the similar voice music search processing flowchart according to the third embodiment is the flowchart shown in FIG. 2 as in the first and second embodiments. Hereinafter, the third embodiment will be described with reference to the flowchart of FIG. 2, but the similar search processing step S30 and the comparison integration processing step S40 are different from the above-described first and second embodiments.
[0060]
The search key input processing step S10, the feature amount extraction processing step S20, and the display output processing step S50 are the same as the processing in the first and second embodiments described above, and thus the description thereof is omitted.
[0061]
FIG. 6 is a flowchart of similarity search processing according to the third embodiment. The process of the similarity search process step S30 in the third embodiment will be described in detail with reference to the flowchart of FIG.
[0062]
For the similar search process, an accumulation phase P2 is executed in advance in steps S310 to S330 below. In step S310, the short time window sound and music feature extraction unit (storage phase) 14 inputs a long time sound and music signal to be searched, and performs the short time window sound and music feature in the same manner as the feature amount extraction processing step S20. The amount is extracted, and the storage unit 15 stores the extracted short-time window audio music feature amount in the storage unit 16.
[0063]
In step S320, all short-time window audio music feature values extracted from the search target audio-music signal are input, and an average of K short-time window audio music feature values arranged in time order is averaged. Create time window audio music features. For example, when K = 6, the average of the six short time window sound and music feature values arranged in time order is set as the average short time window sound and music feature value.
[0064]
In step S330, the average short-time window audio music feature value created in step S320 is input, and the multi-dimensional space is obtained from these short-time window audio music feature values in the same manner as in the first and second embodiments. Build an index.
[0065]
In the search phase P1, in step S340, an average of K times of the short-time window audio music feature quantities of the search key arranged in time order is taken to create an average short-window audio music feature quantity. For example, when K = 6, the average of the six short time window sound and music feature values arranged in time order is set as the average short time window sound and music feature value.
[0066]
In step S350, the short-time window audio music feature quantity similarity search unit 12 inputs the average short-time window audio music feature quantity of the search key, and searches for those similar to the average short-time window audio music feature quantity of these search keys. , Search from the stored average short-time window audio music features of the search object, and for each average short-time window audio music feature of the search key, Create a list.
[0067]
Here, the partial similarity is higher as the distance between the multidimensional vectors representing the average short-time window audio music feature amount is closer. At this time, it is possible to search at high speed by using the multidimensional spatial index constructed in step S330.
[0068]
Further, for example, assuming that K = 6 and an average of six short-time window audio music feature quantities is an average short-window audio music feature quantity, it is multidimensional compared to the first and second embodiments. The number of data constituting the index is 1/6, and the number of searches performed using the multidimensional index is also 1/6, thereby speeding up the search.
[0069]
As for the comparison and integration processing step S40 in the third embodiment, only the processing in step S410 (processing for creating a list of correct candidate speech music sections) in the flowchart shown in FIG. 3 is described in the first and second embodiments. And different. Steps S420 to S460 are the same as those in the first and second embodiments described above, and a description thereof will be omitted.
[0070]
In the following, an example of a method for creating a list of correct candidate speech music sections from the list of average short-time window speech music feature values according to the third embodiment will be described. The method of creating a list of speech music segments is not limited to the following example.
[0071]
FIG. 7 is a flowchart of processing for creating a correct candidate speech music section list according to the third embodiment. The process of creating a list of correct candidate speech music sections in the third embodiment (the process corresponding to step S410 in FIG. 3 in the first and second embodiments described above) is performed using the flowchart in FIG. This will be described in detail.
[0072]
In step S411, a list of average short-time window sound and music feature values as a result of the similarity search step S30 in the third embodiment is input, and the highest average short-time window sound and music feature value of this list is read. In step S412, the highest average short-time window audio music feature amount in the list of average short-time window audio music feature values read in step S411 is deleted from the list.
[0073]
In step S413, the average short-time window audio music feature value read in step S411 is input, and the K short-time window audio music feature values, which are the average of the average short-time window audio music feature values, are obtained. Reading from the storage unit 16 via the storage unit 15.
[0074]
In step S414, for each of the K original short-time window audio music feature values that have been averaged, the position of the original short-time window audio music feature value that has been averaged corresponds to the corresponding short-time window in the search key. The position of the search target audio music signal is aligned so as to be the same position as the audio music feature (for example, the short-time window audio music feature in the center of the averaged interval), and the correct candidate audio music segment is cut out. The total number of the correct answer candidate speech music segments thus cut out is K.
[0075]
In step S415, K correct candidate speech music sections are input, and the K correct candidate speech music sections are described in the list of correct candidate speech music sections.
[0076]
In step S416, a list of average short-time window audio music features is input. If the list is not empty, the process returns to S411, and if empty, a list of correct candidate audio music intervals is output. The processes in steps S411 to S416 described above are executed for all the average short time window audio music feature quantity lists.
[0077]
FIG. 8 is a diagram for explaining an example of creating a correct candidate speech music section from an average short-time window speech music feature amount having a high similarity according to the present embodiment. In the example of FIG. 8, the average of the three short-time window audio music feature quantities (K = 3) is used as the average short-time window audio music feature quantity. In addition, the short-time window audio music feature corresponding to the search key is used as the short-time window audio music feature in the center of the averaged interval.
[0078]
In the figure, “sX” (X = 0, 1, 2,...) Represents the short-time window audio music feature quantity in the search key, and “Mean-sX” (X = 0, 1, 2,...) Represents the search key. Represents the average short-time window audio music feature quantity. Further, “tX” (X = 0, 1, 2,...) Represents the short-time window audio music feature quantity in the search target, and “Mean-tX” (X = 0, 1, 2,...) In the search target. Represents the average short window audio music feature.
[0079]
In FIG. 8A, it is assumed that the degree of similarity between “Mean-s1” of the search key voice music signal and “Mean-t3” of the search target voice music signal is high. The short-time window audio music features that are the basis of “Mean-s1” are “s3”, “s4”, and “s5”, and the short-time window audio music features that are the basis of “Mean-t3”. The amounts are “t9”, “t10”, and “t11”. If the short-time window audio music feature corresponding to the search key is defined as the short-time window audio music feature in the center of the averaged section, it is “s4” here.
[0080]
When the correct candidate speech music section is cut out based on this, as shown in FIG. 8B, the search key speech music signal is set to the positions “t9”, “t10”, and “t11” of the search target speech music signal, respectively. In accordance with the position of “s4”, a voice music signal is cut out with the same length as the length of the search key for each of “t9”, “t10”, and “t11” to create a correct candidate voice music section. Since K = 3, three correct candidate speech music segments are created: “t4” combined with “s4”, “t10” combined, and “t11” combined.
[0081]
FIG. 7 and FIG. 8 show an example of a method for creating a correct candidate speech music section from the list of average short-time window speech music feature quantities. However, the present invention is not limited to this. The short-time window sound and music feature value to be used may be other than the short-time window sound and music feature value at the center of the averaged interval. Further, for example, in the example of FIG. 8, the number of correct answer candidate speech music sections to be created is not limited to K = 3, and an arbitrary number such as K + 2 = 5 and K−1 = 1 may be set. Is possible.
[0082]
【The invention's effect】
According to the present invention, the distance representing the overall similarity between the search key and the search-target speech / music signal cut out to the same length as the search key is the distance of the distances representing the partial similarity between the short-time window speech / music features. By making the sum of only the closest higher ranks, there is an effect that it is possible to perform a similar search of a speech music signal with reduced influence of non-stationary noise (Solution of Problem 1 and Problem 3).
[0083]
In addition, it has the effect of being able to perform a high-speed search by using a multidimensional spatial index when searching for a part with a high degree of partial similarity between short-time window audio music feature quantities (solution of Problem 2).
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration example of a similar speech music search device according to an embodiment of the present invention.
FIG. 2 is a flowchart of similar voice music search processing in the present embodiment.
FIG. 3 is a flowchart of comparison and integration processing in the present embodiment.
FIG. 4 is a diagram for describing extraction of correct candidate speech music sections from search targets according to the present embodiment.
FIG. 5 is a diagram for explaining a method of calculating the overall similarity according to the present embodiment.
FIG. 6 is a flowchart of similarity search processing in the present embodiment.
FIG. 7 is a flowchart of processing for creating a correct candidate speech music section list according to the present embodiment.
FIG. 8 is a diagram for explaining an example of creating a correct candidate speech music section from an average short-time window speech music feature amount having a high similarity according to the present embodiment.
[Explanation of symbols]
P1 search phase
P2 accumulation phase
10 Similar voice music search device
11 Short window audio music feature extraction unit (search phase)
12 Short window audio music feature similarity search unit
13 Audio Music Information Comparison and Integration Department
14 Short-time window audio music feature extraction unit (accumulation phase)
15 Accumulator
16 Memory unit
20 Search key voice music signal input device
21 Voice music display output device
22 Search target audio music signal input device

Claims

検索対象となる音声音楽信号から，検索キーとなる音声音楽信号と類似する音声音楽信号を検索する類似音声音楽検索装置であって，
検索キーとなる音声音楽信号を入力する検索キー入力手段と，
前記検索キーとなる音声音楽信号から短時間窓を用いて短時間窓音声音楽特徴量を抽出する特徴量抽出手段と，
前記抽出された短時間窓音声音楽特徴量を用いて，蓄積された検索対象の音声音楽信号の短時間窓音声音楽特徴量の中から，部分類似度の高い短時間窓音声音楽特徴量を検索する類似検索手段と，
前記類似検索の結果により，前記検索キーの音声音楽信号中での前記検索キーの短時間窓音声音楽特徴量の位置と，前記検索対象の音声音楽信号中での前記部分類似度の高い短時間窓音声音楽特徴量の位置とを合わせ，検索対象の音声音楽信号における前記合わせた位置から検索キーに対応する音声音楽信号を切り出して正解候補音声音楽区間を作成し，その正解候補音声音楽区間と前記検索キーとの対応する短時間窓音声音楽特徴量ごとに，それぞれの短時間窓音声音楽特徴量を表す多次元ベクトル間の距離を計算し，それらの距離のうち距離の近いものの上位何個かの和をとり，その和が小さいものほど高く評価される値となる全体類似度を計算する比較統合手段と，
前記全体類似度の高い順に，前記正解候補音声音楽区間を出力する表示出力手段とを備える
ことを特徴とする類似音声音楽検索装置。A similar audio music search device for searching an audio music signal similar to an audio music signal as a search key from an audio music signal as a search target,
Search key input means for inputting a voice music signal as a search key;
Feature quantity extraction means for extracting a short time window audio music feature quantity from the audio music signal as the search key using a short time window;
Using the extracted short-time window audio music feature value, the short-time window audio music feature value having a high partial similarity is searched from the short-time window audio music feature value of the stored audio music signal to be searched. Similar search means to
As a result of the similarity search, the position of the short-time window audio music feature of the search key in the audio music signal of the search key and the short time with a high partial similarity in the audio music signal to be searched. The position of the window sound music feature is matched, the sound music signal corresponding to the search key is cut out from the position of the sound music signal to be searched, and a correct candidate sound music section is created. For each short-time window audio music feature corresponding to the search key, the distance between multi-dimensional vectors representing each short-time window audio music feature is calculated, and the top number of those distances closest to each other is calculated. A comparison and integration means for calculating the overall similarity that takes a sum of the values, and the smaller the sum is, the higher the value is.
A similar speech music search apparatus, comprising: a display output means for outputting the correct candidate speech music sections in descending order of the overall similarity.

検索対象となる音声音楽信号から，検索キーとなる音声音楽信号と類似する音声音楽信号を検索する類似音声音楽検索装置であって，
検索キーとなる音声音楽信号を入力する検索キー入力手段と，
前記検索キーとなる音声音楽信号から短時間窓を用いて短時間窓音声音楽特徴量を抽出する特徴量抽出手段と，
前記抽出された短時間窓音声音楽特徴量について，時間順に並ぶＫ個（Ｋは２以上の任意の整数）ごとの短時間窓音声音楽特徴量の平均の値である平均短時間窓音声音楽特徴量を算出し，その検索キーにおける平均短時間窓音声音楽特徴量を用いて，予め蓄積された検索対象の音楽信号の時間順に並ぶＫ個（Ｋは２以上の任意の整数）ごとの短時間窓音声音楽特徴量の平均の値である平均短時間窓音声音楽特徴量の中から，部分類似度の高い平均短時間窓音声音楽特徴量を検索する類似検索手段と，
前記類似検索の結果により，前記検索キーにおける平均短時間窓音声音楽特徴量の算出区間に含まれる一つの短時間窓音声音楽特徴量の位置と，前記検索対象の音声音楽信号中での前記部分類似度の高い平均短時間窓音声音楽特徴量の算出区間に含まれる短時間窓音声音楽特徴量のいずれかの位置とを合わせ，検索対象の音声音楽信号における前記合わせた位置から検索キーに対応する音声音楽信号を切り出して正解候補音声音楽区間を作成し，その正解候補音声音楽区間と前記検索キーとの対応する短時間窓音声音楽特徴量ごとに，それぞれの短時間窓音声音楽特徴量を表す多次元ベクトル間の距離を計算し，それらの距離のうち距離の近いものの上位何個かの和をとり，その和が小さいものほど高く評価される値となる全体類似度を計算する比較統合手段と，
前記全体類似度の高い順に，前記正解候補音声音楽区間を出力する表示出力手段とを備える
ことを特徴とする類似音声音楽検索装置。A similar audio music search device for searching an audio music signal similar to an audio music signal as a search key from an audio music signal as a search target,
Search key input means for inputting a voice music signal as a search key;
Feature quantity extraction means for extracting a short time window audio music feature quantity from the audio music signal as the search key using a short time window;
With respect to the extracted short-time window sound and music feature quantity, the average short-time window sound and music characteristic that is the average value of the short-time window sound and music feature quantities for every K pieces (K is an arbitrary integer of 2 or more) arranged in time order. A short time for each of K pieces (K is an arbitrary integer equal to or greater than 2) arranged in the time order of the music signals to be searched that are stored in advance using the average short-time window audio music feature amount in the search key. Similarity search means for searching for an average short-time window audio music feature having a high partial similarity from an average short-time window audio music feature that is an average value of window audio music features;
As a result of the similarity search, the position of one short window audio music feature amount included in the calculation section of the average short window audio music feature amount in the search key and the portion in the audio music signal to be searched Match the position of one of the short-time window audio music features included in the average short-time window audio music feature calculation section with a high degree of similarity, and support the search key from the combined position in the audio music signal to be searched A voice answer signal is cut out to create a correct candidate voice music section, and for each short time window voice music feature corresponding to the correct candidate voice music section and the search key, Calculate the distance between the multi-dimensional vectors to be represented, take the sum of the top of the closest ones of those distances, and calculate the overall similarity that gives a higher value as the sum is smaller And comparing integration means that,
A similar speech music search apparatus, comprising: a display output means for outputting the correct candidate speech music sections in descending order of the overall similarity.

請求項１または請求項２に記載の類似音声音楽検索装置を構成する各手段としてコンピュータを機能させる類似音声音楽検索プログラム。 A similar speech music search program for causing a computer to function as each means constituting the similar speech music search device according to claim 1 .

請求項３に記載の類似音声音楽検索プログラムを記録したコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium on which the similar speech / music search program according to claim 3 is recorded.