JP2004348594A

JP2004348594A - Time series data search method, device, and time program, and program storage medium

Info

Publication number: JP2004348594A
Application number: JP2003146794A
Authority: JP
Inventors: Yasushi Sakurai; 保志櫻井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-23
Filing date: 2003-05-23
Publication date: 2004-12-09
Anticipated expiration: 2023-05-23
Also published as: JP4355824B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a time series data search method for the speed-up of time series data search using dynamic time warping, a time series data search device, a time series data search program, and a program storage medium. <P>SOLUTION: A computer system having a database storing a plurality of sequences executes an approximation step, in which a distance between a search objective sequence and a sequence read from the database is approximated by using a coefficient value of discrete Fourier transform, and a distance computing step in which a dynamic time warping distance between the search objective sequence and the sequence read from the database is found if the approximate distance found in the approximation step is within a predetermined range. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、データベース検索技術に関し、より詳しくは、ダイナミックタイムワーピングに基づいて時系列データを検索する時系列データ検索方法、時系列データ検索装置、時系列データ検索プログラム、およびプログラム記録媒体に関する。
【０００２】
【従来の技術】
時系列データは、時間軸に沿って要素値が定められているシーケンスとして表現される。
【０００３】
このような時系列データを扱う手法として、ダイナミックタイムワーピング（ＤＴＷ：ＤｙｎａｍｉｃＴｉｍｅＷａｒｐｉｎｇ、時間軸正規化）と呼ばれる手法が知られている。ダイナミックタイムワーピングとは、時間軸に沿ってシーケンスを伸長させ、二つのシーケンスの距離を最小化する変換のことである。
【０００４】
一般に、ユークリッド距離においては、長さやサンプリングレートの異なる時系列データを扱うことが難しいが、ダイナミックタイムワーピングの手法を用いれば、そのような時系列データを比較的容易に扱うことができ、二つのシーケンス間の距離、すなわち類似度をより的確に求めることができる。
【０００５】
ダイナミックタイムワーピング距離について、さらに具体的に説明する。
長さｎのシーケンスＰ＝｛ｐ_０，ｐ_１，・・・，ｐ_ｎ−１｝と長さｍのシーケンスＱ＝｛ｑ_０，ｑ_１，・・・，ｑ_ｍ−１｝が与えられているとき、ダイナミックタイムワーピング距離Ｄ_ＤＴＷ（Ｐ，Ｑ）は、次式で定義される：
【数１】

ここで、ｇ_ｓｅｇ（−１， −１）＝０であり、さらに、任意の整数ｉ，ｊに対してｇ_ｓｅｇ（ｉ，−１）＝ｇ_ｓｅｇ（−１，ｊ）＝∞である。また、式（２）の右辺第２項は、ｇ_ｓｅｇ（ｉ−１，ｊ）、ｇ_ｓｅｇ（ｉ，ｊ−１）、ｇ_ｓｅｇ（ｉ−１，ｊ−１）のうちの最小値を意味する。さらに、式（３）の右辺のαの値は任意であるが、以後の説明においては、便宜上α＝２であるとする。
【０００６】
二つのシーケンスＰとＱの距離は、各シーケンスの要素を昇順にマッチングすることによって得られる。すなわち、ダイナミックタイムワーピングでは、二つのシーケンスの長さが異なっていても、シーケンス間の距離を定義することができる。このようなダイナミックタイムワーピングは、ダイナミックプログラミングと呼ばれるアルゴリズムにしたがって計算され、その計算コストは、およそＯ（ｎｍ）のオーダーであることが知られている。したがって、シーケンス長が長くなると、非常に多くの計算コストが必要となる。
【０００７】
従来、この計算コストを低減するためのさまざまな技術が開示されている。
【０００８】
このうち、非特許文献１に開示されている技術（以後、この技術を従来技術１と呼ぶ）は、ダイナミックタイムワーピングに基づく二つのシーケンス間の距離であるダイナミックタイムワーピング距離を近似し、時系列データの検索を高速化するための手法である。この手法では、シーケンスをなす要素から、（時間的に最初の要素値、時間的に最後の要素値、要素の最小値、要素の最大値）を抽出し、これら４つの要素から成る４次元ベクトルを作成し、この４次元ベクトル間のユークリッド距離をシーケンス間の距離の近似値として採用する。この近似値は、ダイナミックタイムワーピングの下限距離（厳密な距離と等しいか、または下回る値をとる）を示しており、このような近似値を用いることによって、探索漏れを発生させずに、ダイナミックタイムワーピングによる厳密な距離計算回数の削減を図っている。
【０００９】
また、非特許文献２に開示されている技術（以後、この技術を従来技術２と呼ぶ）では、シーケンスを等間隔に分割してサブシーケンスを作成し、このサブシーケンスのうちの最大値と最小値を計算し、そのユークリッド距離を、ダイナミックタイムワーピングの近似値としている。ちなみにこの手法は、ＰＣＡ（ＰｉｅｃｅｗｉｓｅＣｏｎｓｔａｎｔＡｐｐｒｏｘｉｍａｔｉｏｎ）と呼ばれている。
【００１０】
【非特許文献１】
Ｓａｎｇ−ＷｏｏｋＫｉｍ，ＳａｎｇｈｙａｕｎＰａｒｋ，ａｎｄＷｅｓｌｅｙＷ．Ｃｈｕ， ”ＡｎＩｎｄｅｘ−ｂａｓｅｄＡｐｐｒｏａｃｈｆｏｒＳｉｍｉｌａｒｉｔｙＳｅａｒｃｈＳｕｐｐｏｒｔｉｎｇＴｉｍｅＷａｒｐｉｎｇｉｎＬａｒｇｅＳｅｑｕｅｎｃｅＤａｔａｂａｓｅｓ”，ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＣＤＥ，ｐｐ．６０７−６１４（Ａｐｒｉｌ２００１）．
【００１１】
【非特許文献２】
ＥａｍｏｎｎＪ．Ｋｅｏｇｈ， ”ＥｘａｃｔＩｎｄｅｘｉｎｇｏｆＤｙｎａｍｉｃＴｉｍｅＷａｒｐｉｎｇ”，ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＶＬＤＢ，ｐｐ．４０６−４１７（Ａｕｇｕｓｔ２００２）．
【００１２】
【発明が解決しようとする課題】
しかしながら、上述した従来技術１の場合、近似値を求めるために採用する４次元ベクトルは、時系列データ全体に対して要素数が少ないため、近似値の精度が低く、ダイナミックタイムワーピング距離の計算回数を十分に削減することができないという問題があった。
【００１３】
また、従来技術２の場合にも、従来技術１と同様の問題、すなわち、近似値の精度が低く、ダイナミックタイムワーピング距離の計算回数を十分に削減できないという問題があった。
【００１４】
本発明はこのような事情に鑑みてなされたものであり、その目的は、ダイナミックタイムワーピングを用いた時系列データの検索を高速化することのできる時系列データ検索方法、時系列データ検索装置、時系列データ検索プログラム、およびプログラム記録媒体を提供することにある。
【００１５】
【課題を解決するための手段】
上記目的を達成するために、請求項１記載の発明は、時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索方法であって、複数のシーケンスを格納して記憶するデータベースを備えたコンピュータシステムが、（Ａ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する近似ステップと、（Ｂ）前記（Ａ）の近似ステップで近似した近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記データベースから読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出ステップとを実行することを要旨とする。
【００１６】
請求項２記載の発明は、請求項１記載の発明において、前記（Ａ）の近似ステップと、前記（Ｂ）の距離算出ステップを、前記データベースに記憶した全てのシーケンスに対して繰り返し実行し、（Ｃ）この繰り返しの結果に基づいて、前記検索対象となるシーケンスの近傍に位置するシーケンスを検索結果として出力する検索結果出力ステップをさらに実行することを要旨とする。
【００１７】
請求項３記載の発明は、時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索方法であって、複数のシーケンスを格納して記憶するデータベースを備えたコンピュータシステムが、（Ａ）近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成ステップと、（Ｂ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、前記（Ａ）の作成ステップで作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する近似ステップと、（Ｃ）この近似ステップで求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記データベースから読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出ステップとを実行することを要旨とする。
【００１８】
請求項４記載の発明は、請求項３記載の発明において、前記（Ａ）の作成ステップから前記（Ｃ）の距離算出ステップに至る処理を、前記データベースに記憶した全てのシーケンスに対して繰り返し実行し、（Ｄ）この繰り返しの結果に基づいて、前記検索対象となるシーケンスの近傍に位置するシーケンスを検索結果として出力する検索結果出力ステップをさらに実行することを要旨とする。
【００１９】
請求項５記載の発明は、時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索方法であって、複数のシーケンスを格納して記憶するデータベースを備えたコンピュータシステムが、（Ａ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する第１の近似ステップと、（Ｂ）前記（Ａ）の第１の近似ステップで近似した近似距離が所定の範囲内にある場合には、近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成ステップと、（Ｃ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、前記（Ｂ）の作成ステップで作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する第２の近似ステップと、（Ｄ）前記（Ｃ）の第２の近似ステップで求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記データベースから読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出ステップとを実行することを要旨とする。
【００２０】
請求項６記載の発明は、請求項５記載の発明において、前記（Ａ）の第１の近似ステップから前記（Ｄ）の距離算出ステップに至る処理を、前記データベースに記憶した全てのシーケンスに対して繰り返し実行し、（Ｅ）この繰り返しの結果に基づいて、前記検索対象となるシーケンスの近傍に位置するシーケンスを検索結果として出力する検索結果出力ステップをさらに実行することを要旨とする。
【００２１】
請求項７記載の発明は、時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索装置であって、複数のシーケンスを格納して記憶する記憶手段と、検索対象となるシーケンスと前記記憶手段で記憶したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する近似手段と、この近似手段で近似した近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記記憶手段から読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出手段とを備えたことを要旨とする。
【００２２】
請求項８記載の発明は、時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索装置であって、複数のシーケンスを格納して記憶する記憶手段と、近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを各々のシーケンスに対して作成する作成手段と、検索対象となるシーケンスと前記記憶手段で記憶したシーケンスの距離を、前記作成手段で作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する近似手段と、この近似手段で求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記記憶手段から読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出手段とを備えたことを要旨とする。
【００２３】
請求項９記載の発明は、時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索装置であって、複数のシーケンスを格納して記憶する記憶手段と、検索対象となるシーケンスと前記記憶手段で記憶したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する第１の近似手段と、この第１の近似手段で近似した近似距離が所定の範囲内にある場合には、近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成手段と、検索対象となるシーケンスと前記記憶手段から読み出したシーケンスの距離を、前記作成手段で作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する第２の近似手段と、この第２の近似手段で求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記記憶手段で記憶したシーケンスのダイナミックタイムワーピング距離を求める距離算出手段とを備えたことを要旨とする。
【００２４】
請求項１０記載の発明は、請求項１乃至６のいずれか１項に記載した時系列データ検索方法をコンピュータに実行させることを要旨とする。
【００２５】
請求項１１記載の発明は、請求項１乃至６のいずれか１項に記載した時系列データ検索方法をコンピュータに実行させるための時系列データ検索プログラムを記録したことをを要旨とする。
【００２６】
【発明の実施の形態】
以下、添付図面を参照して本発明の実施の形態を説明する。
【００２７】
（第１の実施形態）
図１は、本発明の第１の実施形態に係る時系列データ検索方法を実行するコンピュータシステムである時系列データ検索装置の概略構成を示す機能ブロック図である。同図に示す時系列データ検索装置１は、各種データを入力するためのキーボード、マウス等の入力装置から成る入力部１１、中央処理装置（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を備え、後述する各種処理の制御および演算を行う制御演算部１２、入力部１１からの入力情報や制御演算部１２からの演算結果等を格納して記憶する記憶部１３、この記憶部１３で記憶した情報を出力するための（液晶）ディスプレイ画面等の出力装置から成る出力部１４を少なくとも有している。
【００２８】
記憶手段の少なくとも一部をなす記憶部１３は、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等から構成される主記憶装置と、ハードディスクドライブ、フレキシブルディスクドライブ、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）ドライブ、光磁気ディスクドライブ、ＰＣカードドライブ等から構成される補助記憶装置とを具備しており、データ検索に必要なシーケンス情報を管理、記憶するデータベースとしての機能を有する。また、後述する計算結果を随時格納するために必要なメモリ領域が確保されている。
【００２９】
なお、本実施形態の各種処理は、一つの電子的な装置が実行する場合だけでなく、各ステップの実行を適宜分割して二つ以上の電子的な装置から構築されたシステムが全体で実行する場合も含まれる。この意味で、本実施形態に係る「時系列データ検索装置」は、一つまたは複数のコンピュータ（システム）によって構成されることはいうまでもない。この点は、本発明の全ての実施の形態に共通する事項である。
【００３０】
次に、以上の構成を有する時系列データ検索装置１が実行する時系列データ検索処理の詳細を説明する。
【００３１】
最初に、ダイナミックタイムワーピング距離の近似方法（距離近似方法）について説明した後、この距離近似方法を用いた時系列データ検索方法について説明する。
【００３２】
＜距離近似方法＞
本実施形態に係る時系列データ検索方法では、ダイナミックタイムワーピング距離Ｄ_ＤＴＷの計算回数をできるだけ削減するために、離散フーリエ変換（ＤＦＴ：ＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）を用いて近似距離の計算を行う。
【００３３】
長さｎのシーケンスＰ＝｛ｐ_０，ｐ_１，・・・，ｐ_ｎ−１｝が与えられているとき、このシーケンスＰの離散フーリエ変換Ｆ（Ｐ）＝｛ｆ_０（Ｐ），ｆ_１（Ｐ），・・・，ｆ_ｎ−１（Ｐ）｝の各成分（ＤＦＴ係数）は、次式で定義される：
【数２】

ここで、ｋ≠０に対して、ｆ_ｎ−ｋ（Ｐ）はｆ_ｋ（Ｐ）の複素共役となる。本実施形態では、シーケンスＰの各要素が実数であることを仮定する。このため、ｆ_０（Ｐ）は式（４）により実数となる。
【００３４】
逆に、離散フーリエ変換Ｆ（Ｐ）が与えられているとき、シーケンスＰの要素は、Ｆ（Ｐ）に対して逆離散フーリエ変換（ＩＤＦＴ：ＩｎｖｅｒｓｅＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）
【数３】

を施すことによって求めることができる。
【００３５】
以上のように定義される離散フーリエ変換を用いて、検索対象となるシーケンス（問い合わせシーケンスと呼ぶ）Ｑ＝｛ｑ_０，ｑ_１，・・・，ｑ_ｎ−１｝から任意のシーケンスＰまでの距離近似方法を説明する。ここでは、両シーケンスの位置関係に応じて近似距離を求める。なお、以後の説明においては、ｍｉｎ（Ｐ）およびｍｉｎ（Ｑ）を、それぞれシーケンスＰおよびＱの最小値とする一方、ｍａｘ（Ｐ）およびｍａｘ（Ｑ）を、それぞれシーケンスＰおよびＱの最大値とする。
【００３６】
二つのシーケンスＰとＱの位置関係は、図２の概念図に示すように、次の３つの場合がある：
（１）ｍｉｎ（Ｐ）≧ｍａｘ（Ｑ）（図２（ａ）：この配置をｄｉｓｊｏｉｎｔ呼ぶ）
（２）ｍｉｎ（Ｐ）＜ｍａｘ（Ｑ）かつｍａｘ（Ｐ）＞ｍｉｎ（Ｑ）（図２（ｂ）：この配置をｏｖｅｒｌａｐと呼ぶ）
（３）ｍａｘ（Ｐ）≦ｍｉｎ（Ｑ）（図２（ｃ）：この配置をｄｉｓｊｏｉｎｔ（−）と呼ぶ）
以下、これら３つの位置関係の場合について、それぞれ近似距離を計算する。
【００３７】
（１）ｄｉｓｊｏｉｎｔ
問合せシーケンスＱの下限シーケンスＱ_{ｌｏｗｅｒ}＝｛ｑｌ_０，ｑｌ_１，・・・，ｑｌ_ｎ−１｝と上限シーケンスＱ_{ｕｐｐｅｒ}＝｛ｑｕ_０，ｑｕ_１，・・・，ｑｕ_ｎ−１｝を求める。二つのシーケンスの要素は、ｉ＝０，１，・・・，ｎ−１として、それぞれ次のように定義される。
【数４】

ここで、ｍｉｎ（ｑｕ_ｉ：ｑｕ_ｉ＋ｊ）とｍａｘ（ｑｕ_ｉ：ｑｕ_ｉ＋ｊ）は、部分シーケンス｛ｑ_ｉ，ｑ_ｉ＋１，・・・，ｑ_ｉ＋ｊ｝の最小値と最大値をそれぞれ表す。
【００３８】
式（６）および（７）の右辺に出てくるｌ_ｅｘｔは、ダイナミックタイムワーピングによってシーケンスが伸長する場合の最大伸長の値を表す。換言すれば、ダイナミックタイムワーピングによるシーケンスの伸長は、たかだかｌ_ｅｘｔに制限される。このｌ_ｅｘｔは、ｍｉｎ（Ｐ）≧０かつｍｉｎ（Ｑ）≧０のとき、
【数５】

である。ここで、右辺に用いられている記号［ｘ］は、ｘの値を超えない最大の整数を意味している。
【００３９】
式（８）の右辺において、
【数６】

はシーケンスＰとＱの２乗ユークリッド距離である。ｗはダイナミックタイムワーピングの幅であり、特に制限がなければ、ｗ＝ｎ−１である。Ｅ（Ｐ）とＥ（Ｑ）は、シーケンスＰとＱのエネルギーであり、それぞれ次式で与えられる：
【数７】

また、δ（Ｐ，Ｑ）＝（ｍｉｎ（Ｐ）−ｍａｘ（Ｑ））^２である。
【００４０】
ここで、ｍｉｎ（Ｐ）＜ｍｉｎ（Ｑ）かつｍｉｎ（Ｐ）＜０のときには、式（１０）のＥ（Ｐ）と式（１１）のＥ（Ｑ）を、以下のＥ’（Ｐ）とＥ’（Ｑ）にそれぞれ置き換える。
【数８】

【００４１】
また、ｍｉｎ（Ｐ）＞ｍｉｎ（Ｑ）かつｍｉｎ（Ｑ）＜０のときには、式（１２）および式（１３）のｍｉｎ（Ｐ）をｍｉｎ（Ｑ）としてｌ_ｅｘｔを求めることができる。
【００４２】
シーケンスＰが全てのＤＦＴ係数を用いて展開されている場合には、２乗ユークリッド距離Ｄ^２ _{Ｅｕｃｌｉｄ}（Ｐ，Ｑ）は式（９）で与えられるが、そのＤＦＴ係数のうちの（ｍ＋１）個（ｍ≦ｎ−１）のみを用いて展開されている場合には、式（６）において、Ｄ^２ _{Ｅｕｃｌｉｄ}（Ｐ，Ｑ）の代わりに、以下で定義されるＤ_{ｍｏｄｉｆｉｅｄ}（Ｐ，Ｑ）を用いる：
【数９】

【００４３】
以上により、ｄｉｓｊｏｉｎｔの場合のダイナミックタイムワーピング距離Ｄ_ＤＴＷ（Ｐ，Ｑ）の近似距離は、次のように求められる。
【数１０】

【００４４】
図３は、Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ）の計算を概念的に示す説明図である。同図において、問合せシーケンスＱは、下限シーケンスＱ_{ｌｏｗｅｒ}と上限シーケンスＱ_{ｕｐｐｅｒ}に囲まていれる。ＰとＱの位置関係がｄｉｓｊｏｉｎｔ（ｍｉｎ（Ｐ）≧ｍａｘ（Ｑ））の場合、Ｑ_{ｕｐｐｅｒ}を用いて近似距離の計算を行う（式（１５）を参照）。このとき、Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ）は、ダイナミックタイムワーピング距離Ｄ_ＤＴＷ（Ｐ，Ｑ）を近似しており、なおかつＤ_ＤＴＷ（Ｐ，Ｑ）以下の値をとる。
【００４５】
（２）ｏｖｅｒｌａｐ
図２（ｂ）に示すように、二つのシーケンスＰとＱの少なくとも一部が重なり合うとき、最大伸長の値は、ｌ_ｅｘｔ＝ｗである。上述したＱ_{ｌｏｗｅｒ}およびＱ_{ｕｐｐｅｒ}に加えて、
【数１１】

を用いることにより、ｏｖｅｒｌａｐにおける近似距離Ｄ_{ｏｖｅｒｌａｐ}（Ｐ，Ｑ）を次のように求めることができる：
【数１２】

【００４６】
（３）ｄｉｓｊｏｉｎｔ（−）
本質的には、上記（１）ｄｉｓｊｏｉｎｔの場合と同じであり、近似距離
Ｄ_{ｄｉｓｊｏｉｎｔ（−）}（Ｐ，Ｑ）は、次式で定義される：
【数１３】

【００４７】
このｄｉｓｊｏｉｎｔ（−）の場合、式（８）におけるδ（Ｐ，Ｑ）＝（ｍｉｎ（Ｑ）−ｍａｘ（Ｐ））^２であるが、他の量については、上述したｄｉｓｊｏｉｎｔの場合と同じである。
【００４８】
＜時系列データ検索方法＞
時系列データ検索処理を高速化するための特徴量として、本実施形態では離散フーリエ変換に基づいたＤＦＴ特徴ベクトルを用いる。
【００４９】
シーケンスＰのＤＦＴ特徴ベクトルＶ（Ｐ）は、次のように定義される：
Ｖ（Ｐ）＝｛Ｆ（Ｐ），ｍｉｎ（Ｐ），ｍａｘ（Ｐ），Ｅ_ｒｅｓｔ（Ｐ）， γ（Ｐ）｝（２１）
ここで、Ｐの離散フーリエ変換Ｆ（Ｐ）が（ｍ＋１）次元であるとすると、このＤＦＴ特徴ベクトルＶ（Ｐ）の次元は２ｍ＋５である。
【００５０】
このような多次元データの問合せとして、ここではｋ近傍距離（ｋは所定の正の整数）以下の近似距離を有するシーケンスを探索するｋ近傍問合せを採用する。これ以外に、例えば範囲指定問合せを用いても、時系列データ検索のアルゴリズムは本質的に同じである。この点に関しては、後述する実施形態でも同様である。
【００５１】
図４は、本実施形態に係る時系列データ検索方法において、離散フーリエ変換を用いた近傍探索処理の流れを示すフローチャート図である。
【００５２】
まず、問合せシーケンスＱを入力してこのシーケンスＱのＤＦＴ特徴ベクトルＶ（Ｑ）を計算するとともに、近傍探索数ｋを入力する（ステップＳ１０１）。
【００５３】
その後、シーケンスを順序付けしてカウントするカウンタｉの値を０とする（ステップＳ１０２）。
【００５４】
次に、ループＬ１１を実行する。
【００５５】
ループＬ１１では，まずカウンタｉの値を１増加させ（ステップＳ１１１）、記憶部１３からシーケンスＰおよびこのシーケンスのＤＦＴ特徴ベクトルＶ（Ｐ）をロードし（ステップＳ１１２）、ＰとＱのＤＦＴ特徴ベクトルを用いた近似距離Ｄ_１を計算する（ステップＳ１１３）。
【００５６】
このステップでは、近似距離Ｄ_１として、Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ），Ｄ_{ｏｖｅｒｌａｐ}（Ｐ，Ｑ），Ｄ_{ｄｉｓｊｏｉｎｔ（−）}（Ｐ，Ｑ）のいずれかを計算することになるが、全てのシーケンスに対して近似距離を求める計算を行うと、多くのＣＰＵコストを要することになる。そこで、制御演算部１２では、記憶部１３に格納されているシーケンスにアクセスする前に、問合せシーケンスＱにおけるさまざまなシーケンス伸長ｌ_１，ｌ_２，・・・，ｌ_ｈを想定し、これらのｈ個のシーケンス伸長の上限および下限を用意することもできる。
【００５７】
一例として、Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ）が伸長ｌ_ｅｘｔを要求する場合、ｌ_ｉ≧ｌ_ｅｘｔを満足するシーケンス伸長の中で最小のものを選択し、ｌ’_ｅｘｔとする：
【数１４】

制御演算部１２がＤ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ）を計算する際には、このｌ’_ｅｘｔに関する上限および下限を用いる。他の近似距離についても同様である。
【００５８】
ステップＳ１１３での計算の結果、求めたＤ_１がｋ近傍距離以下であれば（ステップＳ１１４でＹＥＳ）、厳密なダイナミックタイムワーピング距離Ｄ_３＝Ｄ_ＤＴＷ（Ｐ，Ｑ）を求める（ステップＳ１１５）。
【００５９】
ステップＳ１１５で求めたＤ_３がｋ近傍距離以下であれば（ステップＳ１１６でＹＥＳ）、シーケンスＰのＩＤと距離Ｄ_３を記憶部１３の最近傍リストへ格納し、ソートする（ステップＳ１１７）。
【００６０】
ステップＳ１１７が終了後、または、ステップＳ１１４かステップＳ１１６の判断でＮＯの場合には、その時点でのシーケンスの数ｉが記憶部１３に格納されているシーケンスの数に達するまで（ｉ＞＝ｅｎｔ）、ステップＳ１１１に戻って処理を繰り返す。すなわち、記憶部１３に格納されているシーケンスの各々に対して、ループＬ１１の処理が繰り返し実行される。
【００６１】
いうまでもなく、このループＬ１１では、最初のｋ回は常にステップＳ１１４およびステップＳ１１６における判断がＹＥＳとなる。
【００６２】
ループＬ１１が終了後、最近傍リストに格納されている候補シーケンスを最終的な探索結果として出力する（ステップＳ１２１）。
【００６３】
ここで説明した近似距離Ｄ_１とダイナミックタイムワーピング距離Ｄ_３の計算は、時系列データ検索装置１の制御演算部１２で行われる。したがって、制御演算部１２が、検索対象となる問合せシーケンスＱと記憶部１３に格納されているシーケンスＰの距離をＤＦＴ係数値を用いてを用いて近似する近似手段としての機能と、近似距離が所定の範囲内にある場合に、ＰとＱの間のダイナミックタイムワーピング距離を求める距離算出手段としての機能とを具備していることはいうまでもない。
【００６４】
近似距離の計算は、厳密なダイナミックタイムワーピングの距離計算よりも計算コストが低い。したがって、以上説明したように、まず近似距離を計算した上で、この計算した近似距離がその時点での最近傍処理よりも大きければ、厳密な計算をせずに初めから除外することができ、厳密な計算は、最近傍距離よりも小さいものに関してのみ行うことが可能となる。
【００６５】
以上説明した本発明の第１の実施形態によれば、ダイナミックタイムワーピングに基づく時系列データの検索を、検索漏れを発生させずに高速化することが可能となる。
【００６６】
本実施形態における時系列データ検索方法は、時系列データとして表現することができる画像、映像、音声、文書等を対象とする広範囲の時系列データ検索に適用可能である。
【００６７】
なお、本実施形態に係る時系列データ検索処理は、時系列データ検索プログラムがインストールされた所定のコンピュータを用いて実施しても同様の効果を得ることができる。
【００６８】
さらに、そのような時系列データ検索プログラムを記録したコンピュータ読み取り可能なプログラム記録媒体をコンピュータに装着し、そのプログラム記録媒体に格納されているプログラムを読み出すことによって、コンピュータが上述した処理を実行するようにしてもよい。ここで、「コンピュータ読み取り可能な」プログラム記録媒体としては、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ、光磁気ディスク、ＰＣカード等を用いることができる。このようなプログラム記録媒体を提供することによって、本実施形態の時系列データ検索プログラムを広く流通させることができるようになる。
【００６９】
これらの時系列データプログラムならびに当該プログラムを記録したプログラム記録媒体については、本発明の全ての実施の形態において同様のことがいえる。
【００７０】
（第１の実施形態の変形例）
本実施形態においては、二つのシーケンスＰとＱの長さが等しい場合を取り扱ったが、異なる長さのシーケンス長を計算するために、全てのシーケンスの長さが等しくなるように調整することも可能である。
【００７１】
この場合、まず基準長ｎ_ｂａｓｅを決定し、この基準長ｎ_ｂａｓｅよりも長いシーケンス長を縮小する一方で、基準長ｎ_ｂａｓｅよりも短いシーケンス長を伸長する。このような調整を行った後は、二つのシーケンスの長さが見かけ上等しくなるため、第１の実施形態と同じ方法で近似距離を求めることが可能となる。
【００７２】
シーケンス長の調整法をより詳細に説明する。
【００７３】
まず、Ｐ（の要素）を用いることにより、Ｐの下限シーケンスＰ_{ｌｏｗｅｒ}と上限シーケンスＰ_{ｕｐｐｅｒ}を以下のように計算する：
【数１５】

ここで、下限シーケンスＰ_{ｌｏｗｅｒ}の要素ｐｌ_ｉと上限シーケンスＰ_{ｕｐｐｅｒ}の要素ｐｕ_ｉ（ともにｉ＝０，１， … ，ｎ_ｂａｓｅ）は、
【数１６】

であり、これらのシーケンス長は、基準長ｎ_ｂａｓｅである。
【００７４】
以上のように計算したＰ_{ｌｏｗｅｒ}およびＰ_{ｕｐｐｅｒ}の少なくともいずれか一方を、もとのシーケンスＰの代わりに用いることにより、ダイナミックタイムワーピングの近似距離を求める。
【００７５】
問合せシーケンスＱについても、その長さを調整する。まず、式（１７）、（１８）と同様にＱ_{ｌｏｗｅｒ}とＱ_{ｕｐｐｅｒ}を作成する。その後、前述したＱ_{ｌｏｗｅｒ}から下限シーケンスＱ’_{ｌｏｗｅｒ}＝｛ｑｌ’_０，ｑｌ’_１，・・・，ｑｌ’_ｎ−１｝を、Ｑ_{ｕｐｐｅｒ}から上限シーケンスＱ’_ｕ _ｐｐｅｒ＝｛ｑｕ’_０，ｑｕ’_１，・・・，ｑｕ’_ｎ−１｝を計算する。ここで、
【数１７】

である。式（２４）および式（２５）の右辺の定義は、それぞれ式（６）および式（７）の右辺の定義と同じである。
【００７６】
ここで、ＰとＱの位置関係がｄｉｓｊｏｉｎｔの場合、ｌ_ｅｘｔは、式（８）において、ＰにＰ_{ｌｏｗｅｒ}を代入し、ＱにＱ_{ｕｐｐｅｒ}を代入することによって計算される。また、ＰとＱの位置関係がｏｖｅｒｌａｐの場合、シーケンスの伸長幅はｌ_ｅｘｔ＝ｗである。ＰとＱの位置関係がｄｉｓｊｏｉｎｔ（−）の場合、ｌ_ｅｘｔは、式（８）において、ＰにＰ_{ｕｐｐｅｒ}を代入し、ＱにＱ_{ｌｏｗｅｒ}を代入することによって計算される。
【００７７】
ダイナミックタイムワーピングの近似距離は、ｄｉｓｊｏｉｎｔを計算するためにＰ_{ｌｏｗｅｒ}とＱ’_{ｕｐｐｅｒ}を用い、ｄｉｓｊｏｉｎｔ（−）を計算するためにＰ_{ｕｐｐｅｒ}とＱ’_{ｌｏｗｅｒ}を用いる。また、ｏｖｅｒｌａｐを計算するために、４つの量全てを用いる。
【００７８】
以上の点を除く時系列データ検索方法については、上記第１の実施形態と同じである。
【００７９】
したがって、このような第１の実施形態の変形例においても、上記第１の実施形態と同様の効果を得ることができる。
【００８０】
（第２の実施形態）
本発明の第２の実施形態に係る時系列データ検索方法は、ダイナミックタイムワーピング距離を計算するための近似値の算出に、セグメンティッド・シーケンスを用いることを特徴とする。
【００８１】
本実施形態に係る時系列データ検索装置の基本的な構成は、上記第１の実施形態で説明したものと同様である（図１を参照）。
【００８２】
＜セグメンティッド・シーケンス＞
まず、セグメンティッド・シーケンスについて説明する。
【００８３】
範囲ｒと長さｎのシーケンスＰが与えられたとき、セグメンティッド・シーケンスＳ_ｒは次のように定義される。
【００８４】
Ｓ_ｒ＝｛ｓ_０，ｓ_１，・・・，ｓ_ｎｓ−１｝（２６）
ｓ_ｉ＝｛ｓｌ_ｉ，ｓｕ_ｉ，ｓｒ_ｉ｝
（ｎ_ｓ≦ｎ，１≦ｓｒ_ｉ≦ｎ，ｓｕ_ｉ−ｓｌ_ｉ≦ｒ）
ここで、ｓｌ_ｉはセグメンティッド・シーケンスＳ_ｒの中のセグメントｓ_ｉの最小値であり、ｓｕ_ｉはセグメンティッド・シーケンスＳ_ｒの中のセグメントｓ_ｉの最大値である。したがって、ｓｕ_ｉ−ｓｌ_ｉはｓ_ｉの範囲であり、ｓｕ_ｉ−ｓｌ_ｉ≦ｒである。また、ｓｒ_ｉはｓ_ｉの長さを、ｎ_ｓはＳ_ｒにおけるセグメントの数を示している。
【００８５】
図５は、セグメンティッド・シーケンスを作成するためのアルゴリズムを示すフローチャート図である。
【００８６】
まず、シーケンスの長さｎ、範囲ｒ、およびシーケンスＰの要素ｐ_ｉ（ｉ＝０，１，・・・，ｎ−１）を入力し（ステップＳ２０１）、カウンタｉを０にセットする（ステップＳ２０２）。
【００８７】
次に、ループＬ２１を実行する。
【００８８】
このループＬ２１では、隣接するｉ番目の要素ｐ_ｉと（ｉ＋１）番目の要素のｐ_ｉ＋１の差｜ｐ_ｉ−ｐ_ｉ＋１｜（ＬＩＳＴ．ｄｉｆｆ_ｉという関数として定義）をｉ＝ｎ−１まで全て求める（ステップＳ２１１、Ｓ２１２）。
【００８９】
ループＬ２１が終了後、記憶部１３に格納されているデータをＬＩＳＴ．ｄｉｆｆ_ｉのｉの昇順にソートする（ステップＳ２２１）。
【００９０】
その後、カウンタｉを０にリセットし（ステップＳ２２２）、ループＬ２３に移行する。
【００９１】
ループＬ２３では、近接するシーケンスの要素のペアを、要素の値の差が小さいものから順番に結合してセグメントにする（ステップＳ２３２）。
【００９２】
そして、近接するセグメントを結合して、より大きなセグメントにする（ステップＳ２３１、Ｓ２３２、Ｓ２３３、Ｓ２３４）。
【００９３】
近接する要素もしくはセグメントの差が、閾値ｒを超えた時点でループＬ２３を終了する。
【００９４】
この後、ふたたびセグメントとカウンタを初期値に戻し（ステップＳ２４１）、ループＬ２５に移行する。
【００９５】
ループＬ２５では、最初のセグメントから順に生成されたセグメントの数を数えていく。この処理は、シーケンスの長さｎに達するまで処理を続ける（ステップＳ２５１）。
【００９６】
最終的なｉの値をｎ_ｓとして（ステップＳ２６１）、セグメンティッド・シーケンスＳ_ｒとセグメント数ｎ_ｓを結果として出力する（ステップＳ２６２）。
【００９７】
図６は、このようにして生成されたセグメンティッド・シーケンスＳ_ｒの例を示す図である。
【００９８】
同図に示すセグメンティッド・シーケンスＳ_ｒは、３つのセグメントｓ_０，ｓ_１，ｓ_２から構成されている。これらのセグメントの範囲ｓｕ_ｉ−ｓｌ_ｉ（ｉ＝０，１，２）は、所定の範囲（閾値）ｒよりも小さい。
【００９９】
＜距離近似方法＞
本実施形態に係る時系列データ検索方法は、セグメンティッド・シーケンスを用いることによって近似を行うものである。
【０１００】
セグメント数ｎ_ｓのセグメンティッド・シーケンスＳ＝｛ｓ_０，ｓ_１，・・・，ｓ_ｎｓ−１｝とセグメント数ｎ_ｔのセグメンティッド・シーケンスＴ＝｛ｔ_０，ｔ_１，・・・，ｔ_ｎｔ−１｝が与えられているとき、近似値Ｄ_{ｓｅｇｍｅｎｔ}（Ｓ，Ｔ）は、次の演算によって得られる。
【０１０１】
【数１８】

【０１０２】
ここで、ｇ_ｕｒ（−１，−１）＝０であり、さらに、任意の整数ｉ，ｊに対してｇ_ｌｒ（ｉ，−１）＝ｇ_ｕｒ（ｉ，−１）＝ｇ_ｕｌ（−１，ｊ）＝ｇ_ｕｒ（−１，ｊ）＝∞であるとする。また、ｇ_ｓｅｇ（ｉ，ｊ）の定義式の右辺にあるαの値は任意であるが、少なくとも上述したユークリッド２乗距離におけるαと同じ値でなければならないので、ここではα＝２とするが、これにより、本実施形態が特段の限定を受けるわけでないことは勿論である。ちなみに、ｇ_ｌｌ，ｇ_ｌｒ，ｇ_ｕｌ，ｇ_ｕｒは、それぞれ各セグメントにおける左下、右下、左上、右上の計算結果である。
【０１０３】
図７は、セグメンティッド・シーケンスを用いて行うシーケンスＳとＴの近似距離の計算を概念的に示す図である。同図に示す場合には、二つのシーケンスＳとＴのセグメントの個数が、それぞれｎ_ｓ＝３，ｎ_ｔ＝５に対応している。したがって、この場合のシーケンス間の距離の近似値Ｄ_{ｓｅｇｍｅｎｔ}（Ｓ，Ｔ）は、Ｄ_{ｓｅｇｍｅｎｔ}（Ｓ，Ｔ）＝ｇ_ｕｒ（２，４）となる。
【０１０４】
＜時系列データ検索方法＞
図８は、時系列データ検索方法においてセグメンティッド・シーケンスを用いた時系列データ検索処理の流れを示すフローチャート図である。なお、ここでの探索処理においても、ｋ近傍問合せを採用する。これ以外に、例えば範囲指定問合せを用いても、時系列データ検索のアルゴリズムは本質的に同じであることは第１の実施形態と同様である。
【０１０５】
まず、問合せシーケンスＱを入力して、このＱのセグメンティッド・シーケンスを上述した方法に従って計算するとともに、記憶部１３に格納されているシーケンスの数を入力する（ステップＳ３０１）。
【０１０６】
その後、シーケンスを順序付けしてカウントするカウンタｉの値を０とする（ステップＳ３０２）。
【０１０７】
次に、ループＬ３１を実行する。
【０１０８】
ループＬ３１では，まずカウンタｉの値を１増加させ（ステップＳ３１１）、記憶部１３からシーケンスＰのセグメンティッド・シーケンスをロードして（ステップＳ３１２）、ＰとＱのセグメンティッド・シーケンスを用いた近似距離Ｄ_２＝Ｄ_{ｓｅｇｍｅｎｔ}（Ｐ，Ｑ）を計算する（ステップＳ３１３）。
【０１０９】
ステップＳ３１３で求めたＤ_２がｋ近傍距離以下であれば（ステップＳ３１４でＹＥＳ）、厳密な距離であるダイナミックタイムワーピング距離Ｄ_３＝Ｄ_ＤＴＷ（Ｐ，Ｑ）を求める（ステップＳ３１５）。
【０１１０】
このステップで求めたＤ_３がｋ近傍距離以下であれば（ステップＳ３１６でＹＥＳ）、シーケンスＰのＩＤと距離Ｄ_３を記憶部１３の最近傍リストへ格納し、ソートする（ステップＳ３１７）。
【０１１１】
ステップＳ３１７が終了後、またはステップＳ３１４かステップＳ３１６の判断でＮＯの場合には、その時点でのシーケンスの数ｉが記憶部１３に格納されているシーケンスの数に達するまで（ｉ＞＝ｅｎｔ）、ステップＳ３１１に戻って処理を繰り返す。すなわち、記憶部１３に格納されているシーケンスの各々に対して、ループＬ３１の処理を行う。
【０１１２】
このループＬ３１でも、最初のｋ回は常にステップＳ３１４およびステップＳ３１６での判断がＹＥＳとなる。
【０１１３】
ループＬ３１が終了後、最近傍リストに格納されている候補シーケンスを最終的な探索結果として出力する（ステップＳ３２１）。
【０１１４】
ここで説明した近似距離Ｄ_２とダイナミックタイムワーピング距離Ｄ_３の計算は、時系列データ検索装置１の制御演算部１２で行われる。したがって、本実施形態においては、制御演算部１２が、ＰとＱ各々において近接する要素から順番に要素を結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成手段としての機能と、ＰとＱの距離の近似値を、各々のシーケンスに対応するセグメンティッド・シーケンスを用いて計算する近似手段としての機能と、その近似距離が所定の範囲内にある場合に、ＰとＱの間のダイナミックタイムワーピング距離を求める距離算出手段としての機能とを具備していることはいうまでもない。
【０１１５】
以上説明した本発明の第２の実施形態によれば、第１の実施形態と同様の効果を得ることができる。
【０１１６】
（第３の実施形態）
本発明の第３の実施形態に係る時系列データ検索方法は、上述した二つの実施形態の距離近似方法（ＤＦＴ，セグメンティッド・シーケンス）を共に利用するものである。
【０１１７】
本実施形態に係る時系列データ検索装置の構成は、図１に示す時系列データ検索装置１と同じである。
【０１１８】
このために、本実施形態においては、インデックスを構成する。図９は、このインデックスを概念的に示すと共に、そのインデックスの構成要素を示す説明図である。同図に示すように、インデックス５１は、多次元インデックス５３とシーケンスファイル５５から構成されている。
【０１１９】
多次元インデックス５３は、ＤＦＴ特徴ベクトルＶ（Ｐ）（式（１５）を参照）と包囲矩形から構成される一方で、シーケンスファイル５５は、シーケンスＰとセグメンティッド・シーケンス（式（２０）を参照）から構成される。
【０１２０】
本実施形態では、多次元インデックス５３として、ＤＦＴ特徴ベクトルＶ（Ｐ）を（２ｍ＋５）次元空間の点とみなし、これらの点を、予め超矩形面によって包囲することにより形成される包囲矩形が、検索木におけるノードの構造を有している場合を想定し、以後の説明を行う。
【０１２１】
この仮想的な検索木は、最上位層の根ノードから出発して枝が延出し、各ノードが包囲矩形に対応している。根ノード以外のノードのうち、一つの包囲矩形に囲まれた複数の点（すなわち、複数のＤＦＴ特徴ベクトル）を格納したノードををリーフノードと呼ぶことにする。また、ここでは延出する枝の数（レベルの数に対応）が同じ値で終端するような平行木の構造を有する場合を想定するが、これが一例に過ぎないのは勿論である。
【０１２２】
＜時系列データ検索処理＞
図１０、図１１および図１２は、本実施形態に係る時系列データ検索処理の流れを示すフローチャート図である。これらの図においては、多次元インデックスを用いた探索アルゴリズムを示しており、ＤＦＴ特徴ベクトルＶ（Ｐ）を用いた近似距離計算と同様に、ＤＦＴ特徴ベクトルＶ（Ｐ）を包含する包囲矩形と問合せシーケンスＱとの近似距離計算についても式（１７）、（１９）、（２０）のいずれかを用いて行う。
【０１２３】
まず、図１０を用いて処理を説明する。
【０１２４】
問合せシーケンスＱにおけるｈ個のペアの下限、上限シーケンスを作成し、それらのＤＦＴ特徴ベクトルＶ（Ｐ）を計算するとともに、制御演算部１２において、キューに根ノードへのポインタ、および距離０を設定する（ステップＳ４０１）。
【０１２５】
次に、ループＬ４１の処理を行う。
【０１２６】
まず、問合せ点（問合せシーケンスＱ）から最も近いノードＮをキューから取り出し、Ｎで示されるノード内のエントリ数を入力する（ステップＳ４１１）。
【０１２７】
この後、ノードＮがリーフノードか否かを判定し（ステップＳ４１２）、リーフノード以外のとき（ステップＳ４１２でＮＯ）、図１１に示す処理Ａを実行する。他方、ノードＮがリーフノードのとき（ステップＳ４１３でＹＥＳ）、処理Ｂを実行する（ステップＳ４１５）。処理ＡおよびＢについては、後述する。
【０１２８】
これらの処理を経て、キューが空になるまでループＬ４１を繰り返し実行した後、最近傍リストに格納されている候補シーケンスを探索結果として出力する（ステップＳ４２１）。
【０１２９】
ここで、図１１のフローチャート図を用いて、処理Ａの詳細を説明する。
【０１３０】
まず、カウンタｉを初期化して０とし（ステップＡ１）、ループＬＡ１を実行する。ループＬＡ１では、カウンタｉの設定後（ステップＡ２）、ノードＮの中でｉ番目に格納されているＤＦＴ特徴ベクトルを用いて近似距離Ｄ_１を計算する（ステップＡ３、Ａ４）。近似距離Ｄ_１として、Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ），Ｄ_{ｏｖｅｒｌａｐ}（Ｐ，Ｑ），Ｄ_{ｄｉｓｊｏｉｎｔ（−）}（Ｐ，Ｑ）のいずれかを計算することはいうまでもない。
【０１３１】
Ｄ_１がｋ近傍距離以下であれば（ステップＡ５でＹＥＳ）、ノードＮのｉ番目のエントリに格納されている子ポインタと近似距離Ｄ_１をキューへ格納する（ステップＡ６）。
【０１３２】
Ｄ_１がｋ近傍距離より大きい場合（ステップＡ５でＮＯ）、ステップＡ２に戻って処理を繰り返す。
【０１３３】
このようにして、ループＬＡ１では、ノードＮに格納されている全エントリ（全データオブジェクト）に対する計算を行い、ループＬＡ１が終了後、キューの先頭データを削除してデータを前詰めし、近似距離Ｄ_１の昇順にキュー内のデータをソートする（ステップＡ７）。
【０１３４】
以上の処理が終了後、図１０のフローチャート図に戻って処理を行う。
【０１３５】
次に、図１２のフローチャート図を用いて、Ｎがリーフノードであるときの処理Ｂの詳細を説明する。
【０１３６】
まず、カウンタｉを初期化して０とし（ステップＢ１）、ループＬＢ１を実行する。ループＬＡ１では、カウンタｉの設定後（ステップＢ２）、ノードＮの中でｉ番目に格納されているＤＦＴ特徴ベクトルを用いて近似距離Ｄ_１を計算する（ステップＢ３、Ｂ４）。ここでも、近似距離Ｄ_１は、Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ），Ｄ_{ｏｖｅｒｌａｐ}（Ｐ，Ｑ），Ｄ_{ｄｉｓｊｏｉｎｔ（−）}（Ｐ，Ｑ）のいずれかである。
【０１３７】
Ｄ_１がｋ近傍距離以下であれば（ステップＢ４でＹＥＳ）、セグンメンティッド・シーケンスを用いて近似距離Ｄ_２＝Ｄ_{ｓｅｇｍｅｎｔ}（Ｐ，Ｑ）を計算する（ステップＢ５）。
【０１３８】
このＤ_２もｋ近傍距離以下であれば（ステップＢ６でＹＥＳ）、厳密なダイナミックタイムワーピング距離Ｄ_３＝Ｄ_ＤＴＷ（Ｐ，Ｑ）を計算する（ステップＢ８）。
【０１３９】
ステップＢ８で求めたＤ_３がｋ近傍距離以下であれば（ステップＢ９でＹＥＳ）、シーケンスＰのＩＤと距離Ｄ_３を記憶部１３の最近傍リストへ格納し、ソートする（ステップＢ１０）。
【０１４０】
ステップＢ１０が終了後、または、ステップＢ５、Ｂ７，Ｂ９のいずれかにおける判断でＮＯの場合には、その時点でのシーケンスの数ｉが記憶部１３に格納されているシーケンスの数に達するまで（ｉ＞＝ｅｎｔ）、ステップＢ２に戻って処理を繰り返す。すなわち、記憶部１３に格納されているシーケンスの各々に対して、ループＬＢ１の処理が繰り返し実行される。
【０１４１】
ループＬＢ１が終了後、最近傍リストによるキューのフィルタリングを行い、キューの中でアクセスする必要のないデータを削除する（ステップＢ１１）。
【０１４２】
その後、キューの先頭データを削除してデータを前詰めする（ステップＢ１２）。
【０１４３】
以上の処理が終了後、図１０のフローチャート図に戻って処理を行う。
【０１４４】
なお、本実施形態においても、ｋ近傍問合せの代わりに、範囲問合せを用いても勿論かまわない。
【０１４５】
ここで説明した近似距離Ｄ_１，Ｄ_２，ならびにダイナミックタイムワーピング距離Ｄ_３の計算は、時系列データ検索装置１の制御演算部１２で行われる。したがって、本実施形態においては、制御演算部１２が、検索対象となる問合せシーケンスＱと記憶部１３に格納されているシーケンスＰの距離の近似値を、ＤＦＴ係数値を用いて計算する第１の近似手段としての機能と、この近似距離が所定の範囲内にある場合には、ＰとＱ各々において近接する要素から順番に要素を結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成手段としての機能と、検索対象となるシーケンスと前記記憶手段から読み出したシーケンスの距離を、各々のシーケンスに対応するセグメンティッド・シーケンスを用いてシーケンス間の距離の近似値を計算する第２の近似手段としての機能と、この第２の近似手段で求めた近似距離が所定の範囲内にある場合に、ＰとＱの間のダイナミックタイムワーピング距離を求める距離算出手段としての機能とを具備している。
【０１４６】
以上説明した本発明の第３の実施形態によれば、包囲矩形に相当するノードがリーフノードのときには、まず精度はやや劣るが高速に計算可能な離散フーリエ変換による近似で候補を絞った後、速度的にはやや劣るものの精度がよいセグメンティッド・シーケンスによる近似でさらに候補を絞り込むことにより、実際に厳密なダイナミックタイムワーピング距離の計算を行う回数を削減することができ、時系列データのさらに高速な検索が可能となる。
【０１４７】
【実施例】
本発明の一実施例として、第３の実施形態で説明した時系列データ検索方法を用いた場合を例にとり、検索時間を調査し、従来技術１（４次元ベクトルを用いる方法）および従来技術２（ＰＣＡを用いる方法）と比較した結果を示す。
【０１４８】
実験条件は以下の通りである：
・手法の性能を計測するための時系列データは、ランダムウォーク関数を用いて人工的に長さ１０２４のシーケンスデータを、漸化式ｐ_ｉ＝ｐ_ｉ−１＋ｘ_ｉに基づいて１００，０００件作成している。ここで、ｐ_０は各々のシーケンスの最初の要素であり、範囲（０，１０）からランダムに取得している。また、ｘ_ｉは正規分布関数によって取得しており、その正規分布関数の分散は１である。
【０１４９】
・最近傍探索の探索数（ｋ）は２０である。
【０１５０】
・ＣＰＵ時間は、ＳＵＮＵｌｔｒａＳＰＡＲＣ−ＩＩ４５０ＭＨｚによって計測した。
【０１５１】
・多次元インデックスとして、Ｒ＊−ｔｒｅｅ（ＮｏｒｂｅｒｔＢｅｃｋｍａｎｎ，Ｈａｎｓ−ＰｅｔｅｒＫｒｉｅｇｅｌ，ＲａｌｆＳｃｈｎｅｉｄｅｒ，ａｎｄＢｅｒｎｈａｒｄＳｅｅｇｅｒ， ”ＴｈｅＲ＊−Ｔｒｅｅ：ＡｎＥｆｆｉｃｉｅｎｔａｎｄＲｏｂｕｓｔＡｃｃｅｓｓＭｅｔｈｏｄｆｏｒＰｏｉｎｔｓａｎｄＲｅｃｔａｎｇｌｅｓ”，Ｐｒｏｃ．ｏｆＩｎｔ．Ｃｏｎｆ．ｏｎＡＣＭＳＩＧＭＯＤ，ｐｐ３２２−３３１（Ｊｕｎｅ１９９０）．を参照）を用いた。
【０１５２】
・データサイズは２５，０００件から１００，０００件まで変化させた。
【０１５３】
・１５次元（ｍ＝５）のＤＦＴ特徴ベクトルＶ（Ｐ）を用いてインデックスを構築し、検索処理の前には１３ペアの下限シーケンスおよび上限シーケンスのＤＦＴ係数をそれぞれ計算した。
【０１５４】
・セグメンティッド・シーケンスの範囲ｒとして、（ｍａｘ（Ｐ）−ｍｉｎ（Ｐ））／１４の値の平均を用いた。
【０１５５】
図１３は、ＣＰＵ時間に関する比較結果を示す図であり、横軸がデータサイズ、縦軸がＣＰＵ時間を表している。本実施例に基づく実験結果は、直線７１で与えられる。
【０１５６】
同図からも明らかなように、従来技術１（直線８１）および２（直線８２）の実験結果は、本発明の一実施例と比較して、同じデータサイズの時系列データの検索に多くのＣＰＵ時間を要していることを示している。具体的には、本発明のデータ検索方法によれば、データ検索時間を従来法よりも最大で１３倍程度軽減できることが分かる。
【０１５７】
【発明の効果】
以上の説明からも明らかなように、本発明によれば、ダイナミックタイムワーピングを用いた時系列データの検索を高速化することのできる時系列データ検索方法、時系列データ検索装置、時系列データ検索プログラム、およびプログラム記録媒体を提供することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る時系列データ検索装置の基本構成を示すブロック図である。
【図２】二つのシーケンスの位置関係を模式的に示す図である。
【図３】Ｄ_{ｄｉｓｊｏｉｎｔ}（Ｐ，Ｑ）の計算を概念的に示す説明図である。
【図４】本発明の第１の実施形態に係る時系列データ検索方法の処理の流れを示すフローチャート図である。
【図５】本発明の第２の実施形態において近似距離の計算に用いられるセグメンティッド・シーケンスの作成処理のアルゴリズムを示すフローチャート図である。
【図６】セグメンティッド・シーケンスの一例を示す図である。
【図７】セグメンティッド・シーケンスを用いた近似距離計算を概念的に示す説明図である。
【図８】本発明の第２の実施形態に係る時系列データ検索方法の処理の流れを示すフローチャート図である。
【図９】本発明の第３の実施形態において用いられるインデックスを概念的に説明する図である。
【図１０】本発明の第３の実施形態に係る時系列データ検索方法の処理の流れを示すフローチャート図である。
【図１１】図１０の処理Ａの詳細を示すフローチャート図である。
【図１２】図１０の処理Ｂの詳細を示すフローチャート図である。
【図１３】本発明の一実施例の計算結果を示す図である。
【符号の説明】
１時系列データ検索装置
１１入力部
１２制御演算部
１３記憶部
１４出力部
５１インデックス
５３多次元インデックス
５５シーケンスファイル[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a database search technique, and more particularly, to a time-series data search method, a time-series data search device, a time-series data search program, and a program recording medium for searching for time-series data based on dynamic time warping.
[0002]
[Prior art]
Time-series data is expressed as a sequence in which element values are determined along a time axis.
[0003]
As a method for handling such time-series data, a method called Dynamic Time Warping (DTW: Dynamic Time Warping, time axis normalization) is known. Dynamic time warping is a transformation that extends a sequence along the time axis and minimizes the distance between two sequences.
[0004]
Generally, in the Euclidean distance, it is difficult to handle time-series data having different lengths and sampling rates.However, if the dynamic time warping method is used, such time-series data can be relatively easily handled. The distance between the sequences, that is, the similarity can be obtained more accurately.
[0005]
The dynamic time warping distance will be described more specifically.
Sequence P of length n = ｛p₀, P₁, ..., p_n-1シーケンス and a sequence Q of length m = ｛q₀, Q₁, ..., q_m-1When｝ is given, the dynamic time warping distance D_DTW(P, Q) is defined by:
(Equation 1)

Where g_seg(-1, -1) = 0, and for any integer i, j, g_seg(I, -1) = g_seg(-1, j) = ∞. The second term on the right side of the equation (2) is g_seg(I-1, j), g_seg(I, j-1), g_segIt means the minimum value of (i-1, j-1). Further, the value of α on the right side of Expression (3) is arbitrary, but in the following description, it is assumed that α = 2 for convenience.
[0006]
The distance between two sequences P and Q is obtained by matching the elements of each sequence in ascending order. That is, in dynamic time warping, the distance between sequences can be defined even if the lengths of the two sequences are different. Such dynamic time warping is calculated according to an algorithm called dynamic programming, and its calculation cost is known to be on the order of O (nm). Therefore, when the sequence length is long, a great deal of calculation cost is required.
[0007]
Conventionally, various techniques for reducing the calculation cost have been disclosed.
[0008]
Among them, the technology disclosed in Non-Patent Document 1 (hereinafter, this technology is referred to as Conventional Technology 1) approximates a dynamic time warping distance, which is a distance between two sequences based on dynamic time warping, and obtains a time series. This is a method for speeding up data retrieval. In this method, (the first element value in time, the last element value in time, the minimum value of the element, the maximum value of the element) are extracted from the elements forming the sequence, and a four-dimensional vector composed of these four elements is extracted. And the Euclidean distance between the four-dimensional vectors is adopted as an approximate value of the distance between the sequences. This approximation indicates the lower limit distance (takes a value equal to or less than the exact distance) of the dynamic time warping. By using such an approximation, the dynamic time can be reduced without causing search omission. The strict distance calculation count is reduced by warping.
[0009]
Further, in the technique disclosed in Non-Patent Document 2 (hereinafter, this technique is referred to as Conventional Technique 2), a sequence is divided at equal intervals to create a subsequence, and the maximum value and the minimum The value is calculated, and the Euclidean distance is used as an approximate value of dynamic time warping. Incidentally, this method is called PCA (Piecewise Constant Application).
[0010]
[Non-patent document 1]
Sang-Wook Kim, Sangyaun Park, and Wesley W. Chu, "An Index-based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases", in Proceedings of ICDE, pp. 607-614 (April 2001).
[0011]
[Non-patent document 2]
Eamon J. et al. Keough, "Exact Indexing of Dynamic Time Warping", in Proceedings of VLDB, pp. 147-64. 406-417 (August 2002).
[0012]
[Problems to be solved by the invention]
However, in the case of the above-described conventional technique 1, the four-dimensional vector employed for obtaining the approximate value has a small number of elements with respect to the entire time-series data, so the accuracy of the approximate value is low, and the number of times of calculating the dynamic time warping distance However, there was a problem that it was not possible to sufficiently reduce the amount.
[0013]
Also, in the case of the related art 2, there is a problem similar to the related art 1, that is, the accuracy of the approximate value is low, and the number of calculations of the dynamic time warping distance cannot be sufficiently reduced.
[0014]
The present invention has been made in view of such circumstances, and an object of the present invention is to provide a time-series data search method and a time-series data search device capable of speeding up time-series data search using dynamic time warping. A time series data search program and a program recording medium are provided.
[0015]
[Means for Solving the Problems]
In order to achieve the above object, the invention according to claim 1 provides a time-series data search method for searching time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping. A computer system including a database for storing and storing a plurality of sequences, wherein (A) approximating a distance between a sequence to be searched and a sequence read from the database using coefficient values of a discrete Fourier transform. And (B) when the approximation distance approximated in the approximation step (A) is within a predetermined range, the dynamic time warping distance between the sequence to be searched and the sequence read from the database is calculated. The gist is to execute the distance calculation step to be obtained.
[0016]
According to a second aspect of the present invention, in the first aspect, the approximating step (A) and the distance calculating step (B) are repeatedly executed for all the sequences stored in the database. (C) The gist of the present invention is to further execute a search result output step of outputting a sequence located near the search target sequence as a search result based on the result of the repetition.
[0017]
The invention according to claim 3 is a time-series data search method for searching based on dynamic time warping for time-series data expressed as a sequence in which element values are determined along a time axis. A computer system having a database for storing and storing, (A) a creating step of creating a segmented sequence, which is segmented time-series data by sequentially joining adjacent elements, and (B) a search target An approximation step of approximating the distance between the sequence and the sequence read from the database using the segmented sequences corresponding to the respective sequences created in the creation step (A); and (C) If the obtained approximate distance is within a predetermined range, the search And summarized in that to perform a distance calculation step of obtaining a dynamic time warping distance sequence and sequence of the elephants read from the database.
[0018]
According to a fourth aspect of the present invention, in the third aspect of the present invention, the processing from the creation step (A) to the distance calculation step (C) is repeatedly executed for all the sequences stored in the database. The gist of the present invention is to further execute (D) a search result output step of outputting a sequence located in the vicinity of the search target sequence as a search result based on a result of the repetition.
[0019]
The invention according to claim 5 is a time-series data search method for searching, based on dynamic time warping, time-series data expressed as a sequence in which element values are determined along a time axis. (A) a first approximation step of approximating a distance between a sequence to be searched and a sequence read from the database by using a coefficient value of a discrete Fourier transform, (B) When the approximation distance approximated in the first approximation step of (A) is within a predetermined range, segmented time series data that is segmented by sequentially joining adjacent elements is combined. A creating step of creating a sequence, and (C) reading a sequence to be searched and the database. A second approximation step of approximating the distance of the set sequence by using a segmented sequence corresponding to each sequence created in the creation step of (B); and (D) a second approximation of (C). When the approximate distance obtained in the step is within a predetermined range, the gist is to execute a distance calculation step of obtaining a dynamic time warping distance of the sequence to be searched and the sequence read from the database.
[0020]
The invention according to claim 6 is the invention according to claim 5, wherein the processing from the first approximation step (A) to the distance calculation step (D) is performed for all the sequences stored in the database. And (E) further executing a search result output step of outputting a sequence located near the search target sequence as a search result based on the result of the repetition.
[0021]
An invention according to claim 7 is a time-series data search device that searches for time-series data represented as a sequence in which element values are determined along a time axis based on dynamic time warping, Storage means for storing and storing, approximation means for approximating the distance between the sequence to be searched and the sequence stored in the storage means using the coefficient value of the discrete Fourier transform, and approximation distance approximated by the approximation means. When it is within the predetermined range, the gist is that it comprises a distance calculation unit for calculating a dynamic time warping distance of the sequence to be searched and the sequence read from the storage unit.
[0022]
The invention according to claim 8 is a time-series data search device that searches for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping. Storage means for storing and storing, creating means for creating for each sequence a segmented sequence that is segmented time-series data by sequentially joining from adjacent elements, and a sequence to be searched. Approximating means for approximating the distance of the sequence stored in the storing means using a segmented sequence corresponding to each sequence created by the creating means, and the approximate distance obtained by the approximating means is within a predetermined range. In some cases, the sequence to be searched and the sequence read from the storage unit And summarized in that and a distance calculation means for obtaining a dynamic time warping distance.
[0023]
The invention according to claim 9 is a time-series data search device that searches for time-series data represented as a sequence in which element values are determined along a time axis based on dynamic time warping, Storage means for storing and storing; first approximation means for approximating the distance between a sequence to be searched and the sequence stored in the storage means using coefficient values of a discrete Fourier transform; and first approximation means. A generating means for generating a segmented sequence, which is time-series data segmented by sequentially joining adjacent elements when the approximate distance approximated by is within a predetermined range; and a sequence to be searched. And the distance of the sequence read from the storage means, and the segment corresponding to each sequence created by the creation means. A second approximating means for approximating the sequence to be searched using the input sequence, and when the approximate distance obtained by the second approximating means is within a predetermined range, the sequence to be searched and the storage means are stored in the storage means. The present invention further comprises a distance calculating means for calculating a dynamic time warping distance of the sequence.
[0024]
According to a tenth aspect of the present invention, a computer executes the time-series data search method according to any one of the first to sixth aspects.
[0025]
According to an eleventh aspect of the present invention, a time-series data search program for causing a computer to execute the time-series data search method according to any one of the first to sixth aspects is recorded.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
[0027]
(1st Embodiment)
FIG. 1 is a functional block diagram showing a schematic configuration of a time-series data search device which is a computer system that executes a time-series data search method according to the first embodiment of the present invention. The time-series data search device 1 shown in FIG. 1 includes an input unit 11 composed of an input device such as a keyboard and a mouse for inputting various data, a central processing unit (CPU), and performs various processes described later. A control operation unit 12 for performing control and operation, a storage unit 13 for storing and storing input information from the input unit 11 and an operation result from the control operation unit 12, and the like for outputting information stored in the storage unit 13 (Liquid crystal) It has at least an output unit 14 composed of an output device such as a display screen.
[0028]
The storage unit 13 forming at least a part of the storage unit includes a main storage device including a random access memory (RAM), a hard disk drive, a flexible disk drive, a CD-ROM (Compact Disc Read Only Memory) drive, and a DVD ( A digital versatile disk drive, a magneto-optical disk drive, a PC card drive, and other auxiliary storage devices are provided, and have a function as a database for managing and storing sequence information necessary for data search. Further, a memory area necessary for storing a calculation result to be described later as needed is secured.
[0029]
It should be noted that the various processes according to the present embodiment are performed not only by a single electronic device but also by a system constructed from two or more electronic devices by dividing the execution of each step as appropriate. It is also included. In this sense, it goes without saying that the “time-series data search device” according to the present embodiment is configured by one or more computers (systems). This point is common to all embodiments of the present invention.
[0030]
Next, the details of the time-series data search process executed by the time-series data search device 1 having the above configuration will be described.
[0031]
First, a method of approximating the dynamic time warping distance (distance approximation method) will be described, and then a time-series data search method using this distance approximation method will be described.
[0032]
<Distance approximation method>
In the time-series data search method according to the present embodiment, the dynamic time warping distance D_DTWIn order to reduce the number of times of calculation as much as possible, an approximate distance is calculated using a discrete Fourier transform (DFT).
[0033]
Sequence P of length n = ｛p₀, P₁, ..., p_n-1Given｝, the discrete Fourier transform of this sequence P, F (P) = ｛f₀(P), f₁(P), ..., f_n-1Each component (DFT coefficient) of (P)｝ is defined by the following equation:
(Equation 2)

Here, for k ≠ 0, f_nk(P) is f_kIt becomes the complex conjugate of (P). In the present embodiment, it is assumed that each element of the sequence P is a real number. For this reason, f₀(P) is a real number according to equation (4).
[0034]
Conversely, when a discrete Fourier transform F (P) is given, the elements of the sequence P are represented by an inverse discrete Fourier transform (IDFT) with respect to F (P).
(Equation 3)

Can be obtained.
[0035]
Using the discrete Fourier transform defined as above, a sequence to be searched (called an inquiry sequence) Q = ｛q₀, Q₁, ..., q_n-1A method of approximating the distance from｝ to an arbitrary sequence P will be described. Here, the approximate distance is obtained according to the positional relationship between the two sequences. In the following description, min (P) and min (Q) are the minimum values of sequences P and Q, respectively, while max (P) and max (Q) are the maximum values of sequences P and Q, respectively. And
[0036]
As shown in the conceptual diagram of FIG. 2, there are three cases of the positional relationship between the two sequences P and Q:
(1) min (P) ≧ max (Q) (FIG. 2 (a): This arrangement is called “disjoint”)
(2) min (P) <max (Q) and max (P)> min (Q) (FIG. 2 (b): This arrangement is called overlap)
(3) max (P) ≦ min (Q) (FIG. 2 (c): This arrangement is called “disjoint (−)”)
Hereinafter, the approximate distance is calculated for each of these three positional relationships.
[0037]
(1) disjoint
Lower limit sequence Q of query sequence Q_lower= ｛Ql₀, Ql₁, ..., ql_n-1｝ And upper limit sequence Q_upper= ｛Qu₀, Qu₁, ..., qu_n-1Ask for｝. The elements of the two sequences are defined as i = 0, 1,..., N−1 as follows.
(Equation 4)

Here, min (qu_i: Qu_{i + j}) And max (qu_i: Qu_{i + j}) Is the partial sequence {q_i, Q_{i + 1}, ..., q_{i + j}最小 represents the minimum and maximum values, respectively.
[0038]
L appearing on the right side of equations (6) and (7)_extRepresents the maximum expansion value when the sequence is expanded by dynamic time warping. In other words, the extension of the sequence by dynamic time warping is at most l_extIs limited to This l_extIs, when min (P) ≧ 0 and min (Q) ≧ 0,
(Equation 5)

It is. Here, the symbol [x] used on the right side means the largest integer that does not exceed the value of x.
[0039]
On the right side of equation (8),
(Equation 6)

Is the squared Euclidean distance between the sequences P and Q. w is the width of the dynamic time warping, and w = n−1 unless otherwise specified. E (P) and E (Q) are the energies of the sequences P and Q, respectively, given by:
(Equation 7)

Also, δ (P, Q) = (min (P) −max (Q))²It is.
[0040]
Here, when min (P) <min (Q) and min (P) <0, E (P) in equation (10) and E (Q) in equation (11) are converted into the following E ′ (P) And E '(Q).
(Equation 8)

[0041]
When min (P)> min (Q) and min (Q) <0, min (P) in Equations (12) and (13) is set as min (Q) and_extCan be requested.
[0042]
If the sequence P is expanded using all DFT coefficients, the squared Euclidean distance D² _Euclid(P, Q) is given by equation (9). If the expansion is performed using only (m + 1) (m ≦ n−1) of the DFT coefficients, in equation (6), D² _EuclidInstead of (P, Q), D defined below_modifiedUse (P, Q):
(Equation 9)

[0043]
As described above, the dynamic time warping distance D in the case of “disjoint”_DTWThe approximate distance of (P, Q) is obtained as follows.
(Equation 10)

[0044]
FIG._disjointIt is explanatory drawing which shows calculation of (P, Q) notionally. In the figure, an inquiry sequence Q is a lower limit sequence Q_lowerAnd upper limit sequence Q_upperSurrounded by When the positional relationship between P and Q is disjoint (min (P) ≧ max (Q)), Q_upperIs used to calculate the approximate distance (see equation (15)). At this time, D_disjoint(P, Q) is the dynamic time warping distance D_DTW(P, Q), and D_DTW(P, Q) The following values are taken.
[0045]
(2) overlap
As shown in FIG. 2B, when at least part of the two sequences P and Q overlap, the value of the maximum extension is l_ext= W. Q mentioned above_lowerAnd Q_upperIn addition to,
(Equation 11)

, The approximate distance D in the overlap_overlap(P, Q) can be determined as follows:
(Equation 12)

[0046]
(3) disjoint (-)
This is essentially the same as the case of the above (1) disjoint, and the approximate distance
D_{disjoint (-)}(P, Q) is defined by:
(Equation 13)

[0047]
In the case of this disjoint (−), δ (P, Q) in equation (8) = (min (Q) −max (P))²However, other quantities are the same as in the case of the disjoint described above.
[0048]
<Time series data search method>
In the present embodiment, a DFT feature vector based on a discrete Fourier transform is used as a feature amount for speeding up the time-series data search processing.
[0049]
The DFT feature vector V (P) of the sequence P is defined as:
V (P) = ｛F (P), min (P), max (P), E_rest(P), γ (P)｝ (21)
Here, assuming that the discrete Fourier transform F (P) of P has (m + 1) dimensions, the dimension of this DFT feature vector V (P) is 2m + 5.
[0050]
Here, as a query for such multidimensional data, a k-neighbor query for searching for a sequence having an approximate distance equal to or less than a k-neighbor distance (k is a predetermined positive integer) is employed. In addition to this, even when a range designation query is used, for example, the algorithm of the time-series data search is essentially the same. This is the same in the embodiment described later.
[0051]
FIG. 4 is a flowchart illustrating a flow of a neighborhood search process using a discrete Fourier transform in the time-series data search method according to the present embodiment.
[0052]
First, an inquiry sequence Q is input, a DFT feature vector V (Q) of the sequence Q is calculated, and the number k of neighbor searches is input (step S101).
[0053]
Thereafter, the value of a counter i for counting the sequence is set to 0 (step S102).
[0054]
Next, a loop L11 is executed.
[0055]
In the loop L11, first, the value of the counter i is increased by 1 (step S111), and the sequence P and the DFT feature vector V (P) of this sequence are loaded from the storage unit 13 (step S112). Approximate distance D using₁Is calculated (step S113).
[0056]
In this step, the approximate distance D₁As D_disjoint(P, Q), D_overlap(P, Q), D_{disjoint (-)}Any one of (P, Q) is calculated. However, if the calculation for obtaining the approximate distance is performed for all the sequences, a large CPU cost is required. Therefore, before accessing the sequence stored in the storage unit 13, the control operation unit 12 performs various sequence expansions 1 in the inquiry sequence Q.₁, L₂, ..., l_hAnd the upper and lower limits of these h sequence expansions can be prepared.
[0057]
As an example, D_disjoint(P, Q) is extended l_extL_i≧ l_extSelect the smallest sequence extension that satisfies_extAnd:
[Equation 14]

When the control operation unit 12 is D_disjointWhen calculating (P, Q), this l ′_extThe upper and lower limits for are used. The same applies to other approximate distances.
[0058]
D obtained as a result of the calculation in step S113₁Is less than or equal to the k neighborhood distance (YES in step S114), the strict dynamic time warping distance D₃= D_DTW(P, Q) is obtained (step S115).
[0059]
D obtained in step S115₃Is less than or equal to the k neighborhood distance (YES in step S116), the ID of the sequence P and the distance D₃Is stored in the nearest neighbor list of the storage unit 13 and sorted (step S117).
[0060]
After the end of step S117, or in the case of NO in step S114 or step S116, until the number i of sequences at that time reaches the number of sequences stored in the storage unit 13 (i> = ent ), Returning to step S111 to repeat the processing. That is, the processing of the loop L11 is repeatedly executed for each of the sequences stored in the storage unit 13.
[0061]
Needless to say, in this loop L11, the determination in steps S114 and S116 is always YES for the first k times.
[0062]
After the loop L11 ends, the candidate sequence stored in the nearest neighbor list is output as a final search result (step S121).
[0063]
Approximate distance D described here₁And dynamic time warping distance D₃Is calculated by the control operation unit 12 of the time-series data search device 1. Therefore, the control operation unit 12 has a function as an approximation unit that approximates the distance between the query sequence Q to be searched and the sequence P stored in the storage unit 13 using the DFT coefficient value. It is needless to say that a function as a distance calculating means for obtaining a dynamic time warping distance between P and Q when the distance is within the predetermined range is provided.
[0064]
The calculation of the approximate distance is lower in calculation cost than the distance calculation of strict dynamic time warping. Therefore, as described above, after calculating the approximate distance first, if the calculated approximate distance is larger than the nearest neighbor processing at that time, it can be excluded from the beginning without performing strict calculation, Exact calculations can be performed only for those smaller than the nearest neighbor distance.
[0065]
According to the first embodiment of the present invention described above, it is possible to speed up the search for time-series data based on dynamic time warping without causing search omission.
[0066]
The time-series data search method according to the present embodiment is applicable to a wide range of time-series data search for images, videos, sounds, documents, and the like that can be expressed as time-series data.
[0067]
Note that the same effect can be obtained even if the time-series data search processing according to the present embodiment is performed using a predetermined computer in which a time-series data search program is installed.
[0068]
Further, a computer-readable program recording medium storing such a time-series data search program is mounted on a computer, and the computer executes the above-described processing by reading out the program stored in the program recording medium. It may be. Here, as a “computer-readable” program recording medium, a hard disk, a flexible disk, a CD-ROM, a DVD, a magneto-optical disk, a PC card, or the like can be used. By providing such a program recording medium, the time-series data search program of the present embodiment can be widely distributed.
[0069]
The same can be said for all of the embodiments of the present invention for these time-series data programs and the program recording medium on which the programs are recorded.
[0070]
(Modification of First Embodiment)
In the present embodiment, the case where the lengths of the two sequences P and Q are equal is dealt with. However, in order to calculate the sequence lengths having different lengths, it is also possible to adjust the lengths of all the sequences to be equal. It is possible.
[0071]
In this case, first, the reference length n_baseAnd the reference length n_baseWhile reducing the sequence length longer than the reference length n_baseExtend a shorter sequence length. After such adjustment, the lengths of the two sequences are apparently equal, so that the approximate distance can be obtained by the same method as in the first embodiment.
[0072]
A method for adjusting the sequence length will be described in more detail.
[0073]
First, by using (elements of) P, the lower limit sequence P of P_lowerAnd upper limit sequence P_upperIs calculated as follows:
(Equation 15)

Here, the lower limit sequence P_lowerElement pl_iAnd upper limit sequence P_upperElement pu_i(Both i = 0, 1,..., N_base)
(Equation 16)

And these sequence lengths are the reference length n_baseIt is.
[0074]
P calculated as above_lowerAnd P_upperBy using at least one of the above in place of the original sequence P, an approximate distance of dynamic time warping is obtained.
[0075]
The length of the inquiry sequence Q is also adjusted. First, as in the equations (17) and (18), Q_lowerAnd Q_upperCreate Then, the Q_lowerFrom the lower limit sequence Q '_lower= 'Ql'₀, Ql '₁, ..., ql '_n-1｝, Q_upperTo the upper limit sequence Q '_u _pper= 'Qu'₀, Qu '₁, ..., qu '_n-1Calculate｝. here,
[Equation 17]

It is. The definitions on the right side of Equations (24) and (25) are the same as the definitions on the right side of Equations (6) and (7), respectively.
[0076]
Here, if the positional relationship between P and Q is disjoint, l_extIs given by P in equation (8)_lowerAnd substitute Q for Q_upperIs calculated by substituting When the positional relationship between P and Q is overlap, the extension width of the sequence is l_ext= W. When the positional relationship between P and Q is disjoint (-), l_extIs given by P in equation (8)_upperAnd substitute Q for Q_lowerIs calculated by substituting
[0077]
The approximate distance for dynamic time warping is P to calculate the disjoint_lowerAnd Q '_upperIs used to calculate disjoint (-)._upperAnd Q '_lowerIs used. Also, all four quantities are used to calculate the overlap.
[0078]
The time-series data search method except for the above points is the same as in the first embodiment.
[0079]
Therefore, even in such a modified example of the first embodiment, the same effect as that of the first embodiment can be obtained.
[0080]
(Second embodiment)
The time-series data search method according to the second embodiment of the present invention is characterized in that a segmented sequence is used for calculating an approximate value for calculating a dynamic time warping distance.
[0081]
The basic configuration of the time-series data search device according to the present embodiment is the same as that described in the first embodiment (see FIG. 1).
[0082]
<Segmented sequence>
First, the segmented sequence will be described.
[0083]
Given a sequence P of range r and length n, the segmented sequence S_rIs defined as follows:
[0084]
S_r= ｛S₀, S₁, ..., s_ns-1｝ (26)
s_i= ｛Sl_i, Su_i, Sr_i｝
(N_s≦ n, 1 ≦ sr_i≤n, su_i-Sl_i≦ r)
Where sl_iIs the segmented sequence S_rSegment s in_iIs the minimum value of su_iIs the segmented sequence S_rSegment s in_iIs the maximum value of Therefore, su_i-Sl_iIs s_i, And su_i-Sl_i≦ r. Also, sr_iIs s_iThe length of n_sIs S_rIndicates the number of segments.
[0085]
FIG. 5 is a flowchart illustrating an algorithm for creating a segmented sequence.
[0086]
First, the sequence length n, the range r, and the elements p of the sequence P_i(I = 0, 1,..., N−1) is input (step S201), and the counter i is set to 0 (step S202).
[0087]
Next, a loop L21 is executed.
[0088]
In this loop L21, the adjacent i-th element p_iAnd p of the (i + 1) th element_{i + 1}Difference | p_i-P_{i + 1}| (LIST.diff_iAre defined up to i = n-1 (steps S211 and S212).
[0089]
After the loop L21 ends, the data stored in the storage unit 13 is stored in the LIST. diff_iAre sorted in ascending order of i (step S221).
[0090]
Thereafter, the counter i is reset to 0 (step S222), and the flow shifts to loop L23.
[0091]
In the loop L23, pairs of elements in adjacent sequences are combined in order from the one with the smallest difference in element values to form a segment (step S232).
[0092]
Then, adjacent segments are combined into a larger segment (steps S231, S232, S233, and S234).
[0093]
When the difference between adjacent elements or segments exceeds the threshold value r, the loop L23 ends.
[0094]
Thereafter, the segment and the counter are returned to the initial values again (step S241), and the flow shifts to loop L25.
[0095]
In the loop L25, the number of segments generated sequentially from the first segment is counted. This process continues until the sequence length n is reached (step S251).
[0096]
Let the final value of i be n_s(Step S261), the segmented sequence S_rAnd the number of segments n_sIs output as a result (step S262).
[0097]
FIG. 6 shows the segmented sequence S generated in this manner._rIt is a figure showing the example of.
[0098]
The segmented sequence S shown in FIG._rHas three segments s₀, S₁, S₂It is composed of The range of these segments su_i-Sl_i (I = 0, 1, 2) is smaller than a predetermined range (threshold) r.
[0099]
<Distance approximation method>
The time-series data search method according to the present embodiment performs approximation by using a segmented sequence.
[0100]
Number of segments n_sSegmented sequence S = ｛s₀, S₁, ..., s_ns-1｝ And the number of segments n_tSegmented sequence T = ｛t₀, T₁, ..., t_nt-1Given｝, the approximate value D_segment(S, T) is obtained by the following calculation.
[0101]
(Equation 18)

[0102]
Where g_ur(-1, -1) = 0, and for any integer i, j, g_lr(I, -1) = g_ur(I, -1) = g_ul(-1, j) = g_urIt is assumed that (-1, j) = ∞. Also, g_segAlthough the value of α on the right side of the definition expression of (i, j) is arbitrary, it must be at least the same value as α in the above-mentioned Euclidean square distance, and here α = 2. Of course, the present embodiment is not particularly limited. By the way, g_ll, G_lr, G_ul, G_urAre the calculation results of the lower left, lower right, upper left, and upper right in each segment, respectively.
[0103]
FIG. 7 is a diagram conceptually showing calculation of an approximate distance between sequences S and T performed using a segmented sequence. In the case shown in the figure, the number of segments in the two sequences S and T is n_s= 3, n_t= 5. Therefore, the approximate value D of the distance between the sequences in this case_segment(S, T) is D_segment(S, T) = g_ur(2, 4).
[0104]
<Time series data search method>
FIG. 8 is a flowchart illustrating a flow of a time-series data search process using a segmented sequence in the time-series data search method. It should be noted that also in this search processing, a k-neighbor inquiry is employed. Other than this, even when a range designation query is used, for example, the algorithm of the time-series data search is essentially the same as in the first embodiment.
[0105]
First, an inquiry sequence Q is input, a segmented sequence of this Q is calculated according to the above-described method, and the number of sequences stored in the storage unit 13 is input (step S301).
[0106]
Thereafter, the value of a counter i for counting the sequence is set to 0 (step S302).
[0107]
Next, a loop L31 is executed.
[0108]
In the loop L31, first, the value of the counter i is increased by 1 (step S311), the segmented sequence of the sequence P is loaded from the storage unit 13 (step S312), and the approximation using the segmented sequence of P and Q is performed. Distance D₂= D_segment(P, Q) is calculated (step S313).
[0109]
D obtained in step S313₂Is less than or equal to the k proximity distance (YES in step S314), the dynamic time warping distance D which is a strict distance₃= D_DTW(P, Q) is obtained (step S315).
[0110]
D obtained in this step₃Is less than or equal to the k proximity distance (YES in step S316), the ID of sequence P and the distance D₃Is stored in the nearest neighbor list of the storage unit 13 and sorted (step S317).
[0111]
After step S317 is completed, or if the determination in step S314 or step S316 is NO, the number i of sequences at that time reaches the number of sequences stored in the storage unit 13 (i> = ent). , And returns to step S311 to repeat the processing. That is, the processing of the loop L31 is performed for each of the sequences stored in the storage unit 13.
[0112]
Also in this loop L31, the determination in steps S314 and S316 is always YES for the first k times.
[0113]
After the loop L31 ends, the candidate sequence stored in the nearest neighbor list is output as a final search result (step S321).
[0114]
Approximate distance D described here₂And dynamic time warping distance D₃Is calculated by the control operation unit 12 of the time-series data search device 1. Therefore, in the present embodiment, the control operation unit 12 functions as a creating unit that creates a segmented sequence, which is time-series data segmented by combining elements in order from the nearest element in each of P and Q. A function as an approximation means for calculating an approximate value of the distance between P and Q by using a segmented sequence corresponding to each sequence, and P and Q when the approximate distance is within a predetermined range. Needless to say, it has a function as a distance calculating means for obtaining a dynamic time warping distance between Q.
[0115]
According to the second embodiment of the present invention described above, the same effects as those of the first embodiment can be obtained.
[0116]
(Third embodiment)
The time-series data search method according to the third embodiment of the present invention uses both the distance approximation methods (DFT, segmented sequence) of the above-described two embodiments.
[0117]
The configuration of the time-series data search device according to the present embodiment is the same as that of the time-series data search device 1 shown in FIG.
[0118]
For this purpose, in the present embodiment, an index is configured. FIG. 9 is an explanatory diagram conceptually showing the index and showing constituent elements of the index. As shown in the figure, the index 51 is composed of a multidimensional index 53 and a sequence file 55.
[0119]
The multidimensional index 53 is composed of a DFT feature vector V (P) (see equation (15)) and an enclosing rectangle, while the sequence file 55 contains a sequence P and a segmented sequence (see equation (20)). ).
[0120]
In the present embodiment, as the multidimensional index 53, the DFT feature vector V (P) is regarded as a point in a (2m + 5) -dimensional space, and an enclosing rectangle formed by enclosing these points in advance with a hyper-rectangular surface is: The following description will be made on the assumption that the search tree has a node structure.
[0121]
In this virtual search tree, branches extend from the root node of the uppermost layer, and each node corresponds to an enclosing rectangle. Of the nodes other than the root node, a node that stores a plurality of points (that is, a plurality of DFT feature vectors) surrounded by one enclosing rectangle is referred to as a leaf node. In addition, here, a case is assumed in which a parallel tree structure in which the number of extending branches (corresponding to the number of levels) ends with the same value is assumed, but this is, of course, only an example.
[0122]
<Time-series data search processing>
FIG. 10, FIG. 11, and FIG. 12 are flowcharts illustrating the flow of the time-series data search process according to the present embodiment. In these figures, a search algorithm using a multidimensional index is shown. Similar to the approximate distance calculation using the DFT feature vector V (P), an enclosing rectangle including the DFT feature vector V (P) and a query The calculation of the approximate distance to the sequence Q is also performed using any of the equations (17), (19), and (20).
[0123]
First, the processing will be described with reference to FIG.
[0124]
A lower limit and upper limit sequence of h pairs in the inquiry sequence Q are created, their DFT feature vectors V (P) are calculated, and a pointer to a root node and a distance 0 are set in the queue in the control operation unit 12. (Step S401).
[0125]
Next, the processing of the loop L41 is performed.
[0126]
First, the node N closest to the inquiry point (inquiry sequence Q) is extracted from the queue, and the number of entries in the node indicated by N is input (step S411).
[0127]
Thereafter, it is determined whether or not the node N is a leaf node (step S412). If the node N is not a leaf node (NO in step S412), the process A shown in FIG. 11 is executed. On the other hand, when the node N is a leaf node (YES in step S413), the process B is executed (step S415). Processes A and B will be described later.
[0128]
After these processes, the loop L41 is repeatedly executed until the queue becomes empty, and then the candidate sequence stored in the nearest neighbor list is output as a search result (step S421).
[0129]
Here, the details of the process A will be described with reference to the flowchart of FIG.
[0130]
First, the counter i is initialized to 0 (step A1), and the loop LA1 is executed. In the loop LA1, after setting the counter i (step A2), the approximate distance D is calculated using the DFT feature vector stored in the i-th node N.₁Is calculated (steps A3 and A4). Approximate distance D₁As D_disjoint(P, Q), D_overlap(P, Q), D_{disjoint (-)}It goes without saying that one of (P, Q) is calculated.
[0131]
D₁Is less than or equal to the k neighborhood distance (YES in step A5), the child pointer stored in the i-th entry of the node N and the approximate distance D₁Is stored in the queue (step A6).
[0132]
D₁Is larger than the k proximity distance (NO in step A5), the process returns to step A2 and the process is repeated.
[0133]
In this way, in the loop LA1, calculations are performed on all entries (all data objects) stored in the node N, and after the loop LA1, the head data of the queue is deleted, the data is left-justified, and the approximate distance is calculated. D₁In the queue in ascending order (step A7).
[0134]
After the above processing is completed, the processing returns to the flowchart of FIG.
[0135]
Next, the details of the process B when N is a leaf node will be described with reference to the flowchart of FIG.
[0136]
First, the counter i is initialized to 0 (step B1), and the loop LB1 is executed. In the loop LA1, after setting the counter i (step B2), the approximate distance D is calculated using the DFT feature vector stored at the i-th node N.₁Is calculated (steps B3 and B4). Again, the approximate distance D₁Is D_disjoint(P, Q), D_overlap(P, Q), D_{disjoint (-)}(P, Q).
[0137]
D₁Is less than or equal to the k neighborhood distance (YES in step B4), the approximate distance D is calculated using the segmented sequence.₂= D_segment(P, Q) is calculated (step B5).
[0138]
This D₂Is also less than or equal to the k proximity distance (YES in step B6), the strict dynamic time warping distance D₃= D_DTW(P, Q) is calculated (step B8).
[0139]
D obtained in step B8₃Is less than or equal to the k proximity distance (YES in step B9), the ID of the sequence P and the distance D₃Is stored in the nearest neighbor list of the storage unit 13 and sorted (step B10).
[0140]
After step B10 is completed, or if the determination in any of steps B5, B7, and B9 is NO, the number i of sequences at that time reaches the number of sequences stored in the storage unit 13 ( i> = ent), returning to step B2 and repeating the processing. That is, the processing of the loop LB1 is repeatedly executed for each of the sequences stored in the storage unit 13.
[0141]
After the loop LB1, the queue is filtered by the nearest neighbor list, and data that does not need to be accessed in the queue is deleted (step B11).
[0142]
Thereafter, the head data of the queue is deleted and the data is left-justified (step B12).
[0143]
After the above processing is completed, the processing returns to the flowchart of FIG.
[0144]
In the present embodiment, a range inquiry may be used instead of the k neighborhood inquiry.
[0145]
Approximate distance D described here₁, D₂, And dynamic time warping distance D₃Is calculated by the control operation unit 12 of the time-series data search device 1. Therefore, in the present embodiment, the control calculation unit 12 calculates the approximate value of the distance between the query sequence Q to be searched and the sequence P stored in the storage unit 13 using the DFT coefficient value. A function as an approximation means, and when this approximation distance is within a predetermined range, a segmented sequence which is a time-series data segmented by combining elements in order from the nearest element in each of P and Q And a distance between a sequence to be searched and a sequence read from the storage unit, and an approximate value of a distance between the sequences is calculated using a segmented sequence corresponding to each sequence. When the function as the second approximating means and the approximate distance obtained by the second approximating means are within a predetermined range, It is provided with a function of a distance calculating means for the determining the dynamic time warping distance between Q.
[0146]
According to the third embodiment of the present invention described above, when the node corresponding to the enclosing rectangle is a leaf node, after narrowing down candidates by approximation by a discrete Fourier transform that is slightly inferior in accuracy but can be calculated at high speed, By narrowing down the candidates further by approximation with a segmented sequence that is slightly inferior in speed but high in accuracy, the number of times of actually calculating the dynamic time warping distance can be reduced, and the time series data can be further speeded up Search is possible.
[0147]
【Example】
As an example of the present invention, taking a case where the time-series data search method described in the third embodiment is used as an example, search time is investigated, and the related art 1 (method using a four-dimensional vector) and the related art 2 The result of comparison with (method using PCA) is shown.
[0148]
The experimental conditions are as follows:
The time-series data for measuring the performance of the method is obtained by artificially converting the 1024-length sequence data using a random walk function into a recurrence formula p._i= P_i-1+ X_i100,000 are created based on Where p₀Is the first element of each sequence and is obtained randomly from the range (0,10). Also, x_iIs obtained by the normal distribution function, and the variance of the normal distribution function is 1.
[0149]
The search number (k) of the nearest neighbor search is 20.
[0150]
-CPU time was measured by SUNUTRASPARC-II 450MHz.
[0151]
-As a multidimensional index, R * -tree (Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seegger, "The R * -Trees in the Republic of Ancestry and Rob. Conf. On ACM SIGMOD, pp 322-331 (June 1990).) Was used.
[0152]
-The data size was changed from 25,000 to 100,000.
[0153]
An index was constructed using the 15-dimensional (m = 5) DFT feature vector V (P), and the DFT coefficients of the 13 pairs of lower and upper sequences were calculated before the search processing.
[0154]
As the range r of the segmented sequence, the average of the values of (max (P) -min (P)) / 14 was used.
[0155]
FIG. 13 is a diagram illustrating a comparison result regarding the CPU time. The horizontal axis indicates the data size, and the vertical axis indicates the CPU time. The experimental result based on the present embodiment is given by a straight line 71.
[0156]
As is clear from the figure, the experimental results of the prior arts 1 (straight line 81) and 2 (straight line 82) can be used to search for time-series data having the same data size, compared to the embodiment of the present invention. This indicates that CPU time is required. Specifically, according to the data search method of the present invention, it can be seen that the data search time can be reduced up to about 13 times compared to the conventional method.
[0157]
【The invention's effect】
As is apparent from the above description, according to the present invention, a time-series data search method, a time-series data search device, and a time-series data search method capable of speeding up the search for time-series data using dynamic time warping A program and a program recording medium can be provided.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a time-series data search device according to a first embodiment of the present invention.
FIG. 2 is a diagram schematically showing a positional relationship between two sequences.
FIG. 3_disjointIt is explanatory drawing which shows calculation of (P, Q) notionally.
FIG. 4 is a flowchart illustrating a processing flow of a time-series data search method according to the first embodiment of the present invention.
FIG. 5 is a flowchart illustrating an algorithm of a segmented sequence creation process used for calculating an approximate distance in the second embodiment of the present invention.
FIG. 6 is a diagram showing an example of a segmented sequence.
FIG. 7 is an explanatory diagram conceptually showing an approximate distance calculation using a segmented sequence.
FIG. 8 is a flowchart illustrating a processing flow of a time-series data search method according to a second embodiment of the present invention.
FIG. 9 is a diagram conceptually illustrating an index used in a third embodiment of the present invention.
FIG. 10 is a flowchart illustrating a processing flow of a time-series data search method according to a third embodiment of the present invention.
FIG. 11 is a flowchart illustrating details of a process A in FIG. 10;
FIG. 12 is a flowchart illustrating details of a process B in FIG. 10;
FIG. 13 is a diagram showing calculation results of an example of the present invention.
[Explanation of symbols]
1. Time-series data search device
11 Input section
12 Control operation unit
13 Memory
14 Output section
51 Index
53 Multidimensional Index
55 sequence file

Claims

時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索方法であって、
複数のシーケンスを格納して記憶するデータベースを備えたコンピュータシステムが、
（Ａ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する近似ステップと、
（Ｂ）前記（Ａ）の近似ステップで近似した近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記データベースから読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出ステップと
を実行することを特徴とする時系列データ検索方法。A time-series data search method for searching for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping,
A computer system with a database that stores and stores a plurality of sequences,
(A) an approximation step of approximating a distance between a sequence to be searched and a sequence read from the database using a coefficient value of a discrete Fourier transform;
(B) a step of calculating a dynamic time warping distance between the sequence to be searched and the sequence read from the database when the approximate distance approximated in the approximating step (A) is within a predetermined range; A time-series data search method characterized by performing the following.

前記（Ａ）の近似ステップと、前記（Ｂ）の距離算出ステップを、前記データベースに記憶した全てのシーケンスに対して繰り返し実行し、
（Ｃ）この繰り返しの結果に基づいて、前記検索対象となるシーケンスの近傍に位置するシーケンスを検索結果として出力する検索結果出力ステップ
をさらに実行することを特徴とする請求項１記載の時系列データ検索方法。Repeating the approximation step (A) and the distance calculation step (B) for all sequences stored in the database;
2. The time-series data according to claim 1, further comprising: (C) a search result output step of outputting a sequence located near the search target sequence as a search result based on a result of the repetition. retrieval method.

時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索方法であって、
複数のシーケンスを格納して記憶するデータベースを備えたコンピュータシステムが、
（Ａ）近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成ステップと、
（Ｂ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、前記（Ａ）の作成ステップで作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する近似ステップと、
（Ｃ）この近似ステップで求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記データベースから読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出ステップと
を実行することを特徴とする時系列データ検索方法。A time-series data search method for searching for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping,
A computer system with a database that stores and stores a plurality of sequences,
(A) a creation step of creating a segmented sequence, which is time-series data segmented by sequentially joining adjacent elements,
(B) an approximating step of approximating the distance between the sequence to be searched and the sequence read from the database using the segmented sequences corresponding to the respective sequences created in the creating step (A);
(C) when the approximation distance obtained in this approximation step is within a predetermined range, executing a distance calculation step for obtaining a dynamic time warping distance between the sequence to be searched and the sequence read from the database; A time-series data search method characterized by the following.

前記（Ａ）の作成ステップから前記（Ｃ）の距離算出ステップに至る処理を、前記データベースに記憶した全てのシーケンスに対して繰り返し実行し、
（Ｄ）この繰り返しの結果に基づいて、前記検索対象となるシーケンスの近傍に位置するシーケンスを検索結果として出力する検索結果出力ステップ
をさらに実行することを特徴とする請求項３記載の時系列データ検索方法。The process from the creating step (A) to the distance calculating step (C) is repeatedly executed for all the sequences stored in the database,
4. The time-series data according to claim 3, further comprising: (D) a search result output step of outputting, as a search result, a sequence located near the sequence to be searched as a search result based on a result of the repetition. retrieval method.

時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索方法であって、
複数のシーケンスを格納して記憶するデータベースを備えたコンピュータシステムが、
（Ａ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する第１の近似ステップと、
（Ｂ）前記（Ａ）の第１の近似ステップで近似した近似距離が所定の範囲内にある場合には、近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成ステップと、
（Ｃ）検索対象となるシーケンスと前記データベースから読み出したシーケンスの距離を、前記（Ｂ）の作成ステップで作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する第２の近似ステップと、
（Ｄ）前記（Ｃ）の第２の近似ステップで求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記データベースから読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出ステップと
を実行することを特徴とする時系列データ検索方法。A time-series data search method for searching for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping,
A computer system with a database that stores and stores a plurality of sequences,
(A) a first approximation step of approximating a distance between a sequence to be searched and a sequence read from the database using a coefficient value of a discrete Fourier transform;
(B) When the approximation distance approximated in the first approximation step of (A) is within a predetermined range, segmented time series data that is segmented by sequentially joining adjacent elements is combined. A creation step to create a sequence;
(C) a second approximation step of approximating the distance between the sequence to be searched and the sequence read from the database using the segmented sequences corresponding to the respective sequences created in the creation step (B). ,
(D) when the approximation distance obtained in the second approximation step of (C) is within a predetermined range, a distance for obtaining a dynamic time warping distance between the sequence to be searched and the sequence read from the database; Performing a calculating step.

前記（Ａ）の第１の近似ステップから前記（Ｄ）の距離算出ステップに至る処理を、前記データベースに記憶した全てのシーケンスに対して繰り返し実行し、
（Ｅ）この繰り返しの結果に基づいて、前記検索対象となるシーケンスの近傍に位置するシーケンスを検索結果として出力する検索結果出力ステップ
をさらに実行することを特徴とする請求項５記載の時系列データ検索方法。The process from the first approximation step (A) to the distance calculation step (D) is repeatedly executed for all the sequences stored in the database,
6. The time-series data according to claim 5, further comprising: (E) a search result output step of outputting a sequence located near the search target sequence as a search result based on a result of the repetition. retrieval method.

時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索装置であって、
複数のシーケンスを格納して記憶する記憶手段と、
検索対象となるシーケンスと前記記憶手段で記憶したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する近似手段と、
この近似手段で近似した近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記記憶手段から読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出手段と
を備えたことを特徴とする時系列データ検索装置。A time-series data search device that searches for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping,
Storage means for storing and storing a plurality of sequences;
Approximation means for approximating the distance between the sequence to be searched and the sequence stored in the storage means using a coefficient value of a discrete Fourier transform,
When the approximation distance approximated by the approximation means is within a predetermined range, a distance calculation means for calculating a dynamic time warping distance of the sequence to be searched and the sequence read from the storage means is provided. Time-series data search device.

時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索装置であって、
複数のシーケンスを格納して記憶する記憶手段と、
近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを各々のシーケンスに対して作成する作成手段と、
検索対象となるシーケンスと前記記憶手段で記憶したシーケンスの距離を、前記作成手段で作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する近似手段と、
この近似手段で求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記記憶手段から読み出したシーケンスのダイナミックタイムワーピング距離を求める距離算出手段と
を備えたことを特徴とする時系列データ検索装置。A time-series data search device that searches for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping,
Storage means for storing and storing a plurality of sequences;
Creating means for creating for each sequence a segmented sequence that is segmented time-series data by joining in order from adjacent elements,
Approximation means for approximating the distance between the sequence to be searched and the sequence stored in the storage means using a segmented sequence corresponding to each sequence created by the creation means,
When the approximation distance obtained by the approximation means is within a predetermined range, a distance calculation means for obtaining a dynamic time warping distance of the sequence to be searched and the sequence read from the storage means is provided. Time-series data search device.

時間軸に沿って要素値が定められているシーケンスとして表現される時系列データをダイナミックタイムワーピングに基づいて検索する時系列データ検索装置であって、
複数のシーケンスを格納して記憶する記憶手段と、
検索対象となるシーケンスと前記記憶手段で記憶したシーケンスの距離を、離散フーリエ変換の係数値を用いて近似する第１の近似手段と、
この第１の近似手段で近似した近似距離が所定の範囲内にある場合には、近接する要素から順番に結合することによってセグメント化した時系列データであるセグメンティッド・シーケンスを作成する作成手段と、
検索対象となるシーケンスと前記記憶手段から読み出したシーケンスの距離を、前記作成手段で作成した各々のシーケンスに対応するセグメンティッド・シーケンスを用いて近似する第２の近似手段と、
この第２の近似手段で求めた近似距離が所定の範囲内にある場合には、前記検索対象となるシーケンスと前記記憶手段で記憶したシーケンスのダイナミックタイムワーピング距離を求める距離算出手段と
を備えたことを特徴とする時系列データ検索装置。A time-series data search device that searches for time-series data expressed as a sequence in which element values are determined along a time axis based on dynamic time warping,
Storage means for storing and storing a plurality of sequences;
First approximation means for approximating the distance between the sequence to be searched and the sequence stored in the storage means using a discrete Fourier transform coefficient value;
When the approximation distance approximated by the first approximation means is within a predetermined range, creating means for creating a segmented sequence, which is time-series data segmented by sequentially combining adjacent elements; ,
Second approximation means for approximating the distance between the sequence to be searched and the sequence read from the storage means using a segmented sequence corresponding to each sequence created by the creation means;
When the approximation distance obtained by the second approximation means is within a predetermined range, there is provided a distance calculation means for obtaining a dynamic time warping distance of the sequence to be searched and the sequence stored in the storage means. A time-series data search device, characterized in that:

請求項１乃至６のいずれか１項に記載した時系列データ検索方法をコンピュータに実行させることを特徴とする時系列データ検索プログラム。A time-series data search program for causing a computer to execute the time-series data search method according to any one of claims 1 to 6.

請求項１乃至６のいずれか１項に記載した時系列データ検索方法をコンピュータに実行させるための時系列データ検索プログラムを記録したことを特徴とするプログラム記録媒体。A program recording medium on which a time-series data search program for causing a computer to execute the time-series data search method according to claim 1 is recorded.