JP2006252259A

JP2006252259A - Data analysis apparatus and method

Info

Publication number: JP2006252259A
Application number: JP2005068885A
Authority: JP
Inventors: 修平 ▲桑▼田; Shuhei Kuwata; Masatoshi Nishimura; 正寿西村; Tsutomu Matsunaga; 務松永
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2005-03-11
Filing date: 2005-03-11
Publication date: 2006-09-21
Anticipated expiration: 2025-03-11
Also published as: JP4394020B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data analysis apparatus which can grasp data obtained as a result of window dressing settlement of accounts etc. as abnormal value by comparing financial data of a plurality of companies. <P>SOLUTION: This apparatus is provided with a first partial space making part 31 which makes the first partial space using each data of a plurality of past samples in a certain period, a second partial space making part 32 which makes the second partial space using each data of a plurality of samples at the present time, a similarity level calculating part 4 which obtains similarity level of each partial space, a coordinate value calculating part 7 which obtains each of two-dimensional coordinate values in the fixed period and the present time corresponding to each sample based on the similarity levels of the first and second partial spaces among each sample, and a result output part 8 which outputs each obtained coordinate value. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、サンプルを表すデータ中に含まれる異常値を検知する際に用いて好適なデータ分析装置及び方法に関する。 The present invention relates to a data analysis apparatus and method suitable for use in detecting an abnormal value included in data representing a sample.

例えば、クレジットカードの利用によって発生するトランザクションデータや、ネットワークトラヒックのログデータ等は、時々刻々と変化する大量の時系列データとなる。正常に処理が行われている間は、これらの時系列データにおいて、過去の傾向と現在の傾向を比較しても大きな違いは見られない。
しかし、クレジットカードの不正利用や、ネットワーク侵入が行われた場合には、時系列データに過去とは異なる傾向が発生することが知られている。つまり、時系列データを解析し、そのような異なる傾向を検出することで、不正利用や侵入などによって発生する異常状態を動的に検出することが可能となる。時系列データに発生する異なる傾向とは、図１１（ａ）に示すように正常な傾向からかけ離れた値となる場合や、図１１（ｂ）に示すように過去の傾向とは異なる別の傾向に変化する場合があり、前者が発生する箇所は、当該時系列データにおける「異常値」と呼ばれており、後者が発生する箇所は、当該時系列データにおける「変化点」と呼ばれている。 For example, transaction data generated by use of a credit card, log data of network traffic, and the like become a large amount of time-series data that changes from moment to moment. During normal processing, there is no significant difference in these time-series data even if past trends and current trends are compared.
However, it is known that when a credit card is illegally used or a network is intruded, the time-series data tends to differ from the past. That is, by analyzing time-series data and detecting such different tendencies, it is possible to dynamically detect an abnormal state caused by unauthorized use or intrusion. Different trends that occur in time-series data are different from normal trends as shown in FIG. 11A, or different trends that differ from past trends as shown in FIG. 11B. The part where the former occurs is called an “abnormal value” in the time series data, and the place where the latter occurs is called a “change point” in the time series data. .

しかしながら、このような大量のデータから異なる傾向を抽出することは容易ではなく、これまでに様々な検出手法が提案されている。例えば、「異常値検出」を行う手法としては以下の４つが提案されている。
（１）全系列の中で２つの系列の組の相関関係によって検出する手法（特許文献１、非特許文献１参照）。
（２）全系列を発生させる確率モデルを仮定し、過去と現在における確率モデルの違いによって検出する手法（非特許文献２参照）。
（３）主成分分析を用い、主成分の違いによって検出する手法（非特許文献３参照）
（４）過去の系列から学習した予測式に基づいて検出する手法（特許文献２） However, it is not easy to extract different tendencies from such a large amount of data, and various detection methods have been proposed so far. For example, the following four methods have been proposed for performing “abnormal value detection”.
(1) A method of detecting by correlation between a set of two sequences in all sequences (see Patent Document 1 and Non-Patent Document 1).
(2) A method of detecting a difference model between the past and the present, assuming a probability model that generates all sequences (see Non-Patent Document 2).
(3) Using principal component analysis, a detection method based on differences in principal components (see Non-Patent Document 3)
(4) Method of detection based on prediction formula learned from past series (Patent Document 2)

また、変化点の検出手法としては上記手法（２）を利用する方法が提案されている（非特許文献２参照）。
以下、この４つの手法について説明する。 As a method for detecting a change point, a method using the above method (2) has been proposed (see Non-Patent Document 2).
Hereinafter, these four methods will be described.

［手法（１）］
最初に、手法（１）は、図１２に示すように、例えば４つのデータ系列が入力された場合に、４つの系列の中から２つの系列を選択し、選択した２つの系列間における相関関係があるか否かを全ての組み合わせについて調べる。そして、相関のある２組の系列をルールとして記憶しておき、新たに入力されたデータと記憶しているルールとが乖離している場合に異常値として検出する手法である。 [Method (1)]
First, as shown in FIG. 12, in the method (1), for example, when four data series are input, two series are selected from the four series, and the correlation between the two selected series is selected. Check for all combinations to see if there is any. In this method, two sets of correlated series are stored as rules, and detected as abnormal values when the newly input data and the stored rules are different.

［手法（２）］
次に、手法（２）は、図１３に示すように、例えば４つのデータ系列が入力された場合に、入力された４つの系列を生成することができる確率モデルを過去の全系列から構成し、構成した確率モデルと、新たに入力されたデータを含めた全系列に基づいて構成した確率モデルとの差分が大きい場合を異常値として検出する手法である。また、手法（２）を利用して、変化点検出を行う場合には、過去の全系列から構成した確率モデルと、新たに入力されたデータを含めて構成した確率モデルとの差分が、過去のデータから得られた平均値に比べて大きく変化した場合を変化点として検出を行う。
なお、手法（２）において異常値の検出は、上述した通り確率モデルの変化によって判定するが、用いる確率モデルとしては、例えば、離散値の場合に適用されるヒストグラム法の確率密度関数や、連続値の場合に適用される混合正規分布がある。 [Method (2)]
Next, as shown in FIG. 13, in the method (2), for example, when four data series are input, a probability model that can generate the four input series is configured from all the past series. This is a method for detecting a case where a difference between a configured probability model and a probability model configured based on the entire series including newly input data is large as an abnormal value. In addition, when the change point is detected using the method (2), the difference between the probability model configured from all the past series and the probability model configured including newly input data is the past. Detection is performed with the change point as a change point compared to the average value obtained from the above data.
In the method (2), the detection of the abnormal value is determined by the change of the probability model as described above. As the probability model to be used, for example, the probability density function of the histogram method applied in the case of discrete values, or the continuous There is a mixed normal distribution applied in the case of values.

以下に、手法（２）の具体的な異常値及び変化点の検出方法について説明する。最初に異常値の検出はｘを系列ベクトル、ｐ^（ｔ）（ｘ）をｔ時点までの全系列から推定した確率モデルとした場合に次の式（１）によって求められるｔ時点までに得られた全系列に関する平均値が過去に求められた平均値よりも大きいか否かを判定することによって行う。 Hereinafter, a specific abnormal value and change point detection method of the method (2) will be described. First, detection of abnormal values is obtained by time t obtained by the following equation (1) when x is a sequence vector and p ^(t) (x) is a probability model estimated from all sequences up to time t. This is done by determining whether the average value for all the series is larger than the average value obtained in the past.

式（１）において、ｐ^（ｔ）（ｘ）は、ｔ時点までの全系列から推定した確率モデルであり、ｐ^{（ｔ−１）}（ｘ）は、ｔ−１時点までの全系列から推定した確率モデルである。 In Equation (1), p ^(t) (x) is a probability model estimated from all sequences up to time t, and p ^(t−1) (x) is estimated from all sequences up to time t−1. The probability model.

また、手法（２）において変化点の検出は、確率モデルの平均的な変化、即ち次の式（２）のＴ’期間における確率モデルｑが有するＳｈａｎｎｏｎ情報量の平均値が過去の平均値よりも大きいか否かを判定することによって行う。ここで、ｑはｙ_tから推定される確率モデルである。 Further, in the method (2), the change point is detected by the average change of the probability model, that is, the average value of the Shannon information amount of the probability model q in the T ′ period of the following equation (2) is compared with the past average value. It is performed by determining whether or not is larger. Here, q is the probability model is estimated from y _t.

式（２）において、ｙ_ｉは、上記の系列ｘの確率モデルｐが有するＳｈａｎｎｏｎの情報量のＴ期間における平均値であり、次の式（３）によって求めることができる。 In equation (2), y _i is the average value of Shannon's information amount in the probability model p of the sequence x in the T period, and can be obtained by the following equation (3).

［手法（３）］
次に、手法（３）は、図１４に示すように全系列に対して主成分分析を行うことによって求められる第１主成分ベクトルと、新たに入力されたベクトルを構成するデータとの投影距離が過去の投影距離と比べて大きくなったか否かによって異常値を検出する手法である。 [Method (3)]
Next, in the method (3), as shown in FIG. 14, the projection distance between the first principal component vector obtained by performing the principal component analysis on the entire series and the data constituting the newly input vector. This is a method for detecting an abnormal value depending on whether or not is larger than the past projection distance.

［手法（４）］
最後に、手法（４）は、図１５に示すように過去と現在の系列データの一部の値から系列データをクラスに分類し、当該クラスに予め対応付けられている予測式から予測値を求め、実際の値との誤差の大きさに基づいて異常値を検出する手法である。
具体的には、最初に、図１５の（１）に示すように、系列データのうち破線で囲まれたデータに基づいて該当するクラスに分類する。クラス分類の方法は予め過去のデータに基づいて学習されており、同図のデータはクラス２に分類されている。
次に、図１５の（２）に示すようにクラス２の予測式に基づいて時刻４の画素４に対応する予測値を算出する。各クラスに対応付けられている予測式は過去のデータに基づいて求められ、予め設定されている。そして、算出した予測値と実際の画素４、時刻４の値「２３」との誤差を算出し、予め設定されている閾値１と閾値２に基づいて、誤差が閾値１を超える回数が閾値２を超えた場合に異常値として判定することができる。
特開平５−２５６７４１号公報特開平７−８７４８１号公報山西健司、“データ・テキストマイニングの最新動向”、応用数理、２００２年矢入健久他、“時系列相関ルールマイニングに基づく人工衛星テレメトリデータからの異常検出法”、人工知能学会全国大会、２００１年和泉勇次他、“異常検知のためのネットワーク特徴量抽出法に関する一考察”、電子情報通信学会総合大会、２００４年 [Method (4)]
Finally, as shown in FIG. 15, the method (4) classifies the series data into a class from a part of the values of the past and current series data, and calculates the predicted value from the prediction formula previously associated with the class. This is a technique for detecting an abnormal value based on the magnitude of an error from an actual value.
Specifically, as shown in (1) of FIG. 15, first, classification is made into a corresponding class based on data surrounded by a broken line in the series data. The class classification method is learned in advance based on past data, and the data shown in FIG.
Next, as shown in (2) of FIG. 15, a predicted value corresponding to the pixel 4 at time 4 is calculated based on the prediction formula of class 2. The prediction formula associated with each class is obtained based on past data and set in advance. Then, an error between the calculated predicted value and the actual pixel 4 and the value “23” at time 4 is calculated, and the number of times the error exceeds the threshold 1 is determined based on the threshold 1 and the threshold 2 set in advance. Can be determined as an abnormal value.
Japanese Patent Laid-Open No. 5-256671 Japanese Patent Laid-Open No. 7-87481 Kenji Yamanishi, “Latest Trends in Data / Text Mining”, Applied Mathematics, 2002 Takehisa Yairi et al., “Abnormality Detection Method from Satellite Telemetry Data Based on Time Series Correlation Rule Mining”, National Conference of the Japanese Society for Artificial Intelligence, 2001 Yuji Izumi et al., “A Study on Network Feature Extraction Method for Anomaly Detection”, IEICE General Conference, 2004

しかしながら、上記の手法（１）から（４）では、入力された系列データの系列全体ではなく部分的な系列間の関係にのみ基づいて検出を行っているか、もしくは対象とする系列データに対してある仮定を立て当該仮定に基づいて検出を行っているという問題がある。例えば、手法（１）では、３つ以上の系列の相関関係を一度に対象とできないため、２つの系列の相関関係で挙動を捉えきれない場合には、異常値及び変化点の検出ができないという問題がある。また、手法（２）及び（４）では、確率モデルや予測式を仮定しているため、実際の系列データとの乖離が大きい場合には検出結果が保証されないという問題がある。また、さらに、手法（４）では、主成分分析によって全系列の相関関係を対象とすることができるが、第１主成分のみを対象としているため、入力された系列データの第１主成分の寄与率が小さい場合などには、系列データの相関関係を捉えきれず、異常値及び変化点の検出ができないという問題がある。 However, in the above methods (1) to (4), detection is performed based only on the relationship between partial sequences rather than the entire sequence of input sequence data, or for target sequence data There is a problem that a certain assumption is made and detection is performed based on the assumption. For example, in the method (1), since the correlation of three or more series cannot be targeted at once, if the behavior cannot be grasped by the correlation of two series, the abnormal value and the change point cannot be detected. There's a problem. Further, methods (2) and (4) assume a probability model and a prediction formula, and therefore there is a problem that the detection result is not guaranteed if the deviation from the actual series data is large. Furthermore, in the method (4), the correlation of all series can be targeted by principal component analysis, but since only the first principal component is targeted, the first principal component of the input series data is considered. When the contribution rate is small, there is a problem that the correlation between the series data cannot be grasped and the abnormal value and the change point cannot be detected.

ところで、異常値の検出対象となるものとしては、上述したようなクレジットカードの不正利用や、ネットワーク侵入に関するもののほか、例えば、企業の財務データに含まれる粉飾データが考えられる。ここで本願では粉飾データとは、例えば会社が正規の会計処理基準によらず、財務諸表の内容をゆがめたりすることで、利益や損失を過大あるいは過小となるようにするための処理によって得られたデータを意味することとする。粉飾データは、つまり、データの処理を意図的に変更した結果として得られるものである。例えば、正当に処理した場合には異常値となるようなデータが、異常値とならないように処理されることも粉飾処理の一形態となる。このような場合、データの時系列の関係だけでは検知することが困難であると考えられる。これに対しては、例えば分析対象の企業と同一の業界の他の複数の企業のデータを分析することで、異常値の分析がより適切に行えるのではないかということが考えられる。 By the way, as the detection target of the abnormal value, in addition to the unauthorized use of the credit card as described above and the network intrusion, for example, the decoration data included in the financial data of the company can be considered. Here, in the present application, the powdered data is obtained, for example, by a process for making a profit or loss excessive or small by distorting the contents of the financial statements, regardless of a regular accounting standard. Data. In other words, the powder decoration data is obtained as a result of intentionally changing the data processing. For example, it is also a form of the flourishing process that data that becomes an abnormal value when properly processed is processed so as not to become an abnormal value. In such a case, it is considered difficult to detect only by the time series relationship of data. For this, for example, it may be possible to analyze the abnormal value more appropriately by analyzing data of a plurality of other companies in the same industry as the analysis target company.

しかしながら、上述したように、従来の技術においては、例えば一つの企業のデータを分析する場合でも相関関係の挙動の分析が不適切となる場合が考えられる。これを複数の企業のデータの分析に利用するには、より適切な処理や、適切な分析結果の把握または出力もしくは表示方法が提供されることが望ましいことになる。 However, as described above, in the conventional technology, for example, even when analyzing data of one company, there is a case where the analysis of the behavior of the correlation becomes inappropriate. In order to use this for analysis of data of a plurality of companies, it is desirable to provide a more appropriate process and a method for grasping, outputting or displaying an appropriate analysis result.

本発明は、上記の事情を考慮してなされたものであり、例えば、複数の企業の財務データを比較することで、粉飾処理（粉飾決算）等の結果として得られるデータを異常値として把握することができるようにするためのより適切な構成を有するデータ分析装置及び方法を提供することを目的とする。 The present invention has been made in consideration of the above circumstances. For example, by comparing the financial data of a plurality of companies, the data obtained as a result of the dressing process (flooring settlement) is grasped as an abnormal value. It is an object of the present invention to provide a data analysis apparatus and method having a more appropriate configuration for enabling the above.

上記課題を解決するため請求項１記載の発明は、一定期間における複数のサンプルの各データを用いて第１の部分空間を作成する第１の部分空間作成手段と、前記一定期間以後の所定の時点における複数のサンプルの各データを用いて第２の部分空間を作成する第２の部分空間作成手段と、第１の部分空間及び第２の部分空間を用いて各サンプル間の類似度を求める類似度算出手段と、求めた類似度を用いて前記一定期間における複数の各サンプルに対応する各二次元座標値と前記所定の時点における複数の各サンプルに対応する各二次元座標値を求める座標値算出手段と、座標値算出手段が求めた各座標値を出力する出力手段とを備えることを特徴とする。 In order to solve the above-mentioned problem, the invention according to claim 1 is characterized in that a first partial space creating means for creating a first partial space using each data of a plurality of samples in a certain period, and a predetermined part after the certain period. A second subspace creating means for creating a second subspace using each data of a plurality of samples at a time point, and a similarity between the samples is obtained using the first subspace and the second subspace. Coordinates for calculating each two-dimensional coordinate value corresponding to each of the plurality of samples in the predetermined period and each two-dimensional coordinate value corresponding to each of the plurality of samples at the predetermined time point using the similarity calculation means and the calculated similarity It comprises a value calculation means and an output means for outputting each coordinate value obtained by the coordinate value calculation means.

請求項２記載の発明は、前記座標値算出手段が、類似度を表す配列に対して多次元尺度法を適用して座標値を表す配列を算出するものであることを特徴とする。 The invention according to claim 2 is characterized in that the coordinate value calculation means calculates an array representing coordinate values by applying a multidimensional scaling method to the array representing similarity.

請求項３記載の発明は、前記出力手段が、前記座標値算出手段が求めた各座標値に対応してサンプルを示す識別子を２次元座標上に配置して表示するものであることを特徴とする。 The invention according to claim 3 is characterized in that the output means displays an identifier indicating a sample corresponding to each coordinate value obtained by the coordinate value calculating means on a two-dimensional coordinate. To do.

請求項４記載の発明は、前記出力手段が、前記座標値算出手段が求めた前記一定期間における各座標値から前記所定の時点における各座標値に向けて配置された矢印を表示するものであることを特徴とする。 According to a fourth aspect of the present invention, the output means displays an arrow arranged from each coordinate value in the certain period obtained by the coordinate value calculating means toward each coordinate value at the predetermined time point. It is characterized by that.

請求項５記載の発明は、一定期間における複数のサンプルの各データを用いて第１の部分空間を作成する第１の部分空間作成過程と、前記一定期間以後の所定の時点における複数のサンプルの各データを用いて第２の部分空間を作成する第２の部分空間作成過程と、第１の部分空間及び第２の部分空間を用いて各サンプル間の類似度を求める類似度算出過程と、求めた類似度を用いて前記一定期間における複数の各サンプルに対応する各二次元座標値と前記所定の時点における複数の各サンプルに対応する各二次元座標値を求める座標値算出過程と、座標値算出過程程で求めた各座標値を出力する出力過程とを含んでいることを特徴とする。 According to a fifth aspect of the present invention, there is provided a first subspace creation process for creating a first subspace using each data of a plurality of samples in a certain period, and a plurality of samples at a predetermined time after the certain period. A second subspace creation process for creating a second subspace using each data; a similarity calculation process for obtaining a similarity between samples using the first subspace and the second subspace; A coordinate value calculation process for obtaining each two-dimensional coordinate value corresponding to each of a plurality of samples in the predetermined period and each two-dimensional coordinate value corresponding to each of the plurality of samples at the predetermined time using the obtained similarity; And an output process for outputting each coordinate value obtained in the value calculation process.

本発明によれば、サンプル間の類似度をサンプル毎の部分空間に基づいて求めているので既存の手法のように、計算式をあらかじめ作成（更新）する必要がない。また、同一のサンプルに対して、過去の一定期間（過去１または複数年）のデータとその後の所定時点（例えば今年）のデータを別のものとして扱い、同時に多次元尺度法等を適用することで、過去のデータと今回のデータの相対関係の変化の把握が容易になった。また、過去のサンプルから、対応する所定時点のサンプルへ向けて矢印を書くことで、相対関係の変化の把握が容易になった。 According to the present invention, since the similarity between samples is obtained based on the partial space for each sample, it is not necessary to create (update) a calculation formula in advance as in the existing method. Also, for the same sample, treat data for a certain period in the past (past one or more years) and data at a predetermined time point (for example, this year) as different, and apply multidimensional scaling etc. at the same time. This makes it easier to understand changes in the relative relationship between past data and current data. In addition, by drawing an arrow from a past sample to a corresponding sample at a predetermined time, it becomes easy to grasp the change in the relative relationship.

以下、本発明の一実施形態によるデータ分析装置及び方法を図面を参照して説明する。図１は、本実施形態による分析装置１０を示す概略ブロック図である。分析装置１０において、データ入力部１は、異常値、変化点等を検出する対象となる複数のサンプルの時系列データを入力する。 Hereinafter, a data analysis apparatus and method according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a schematic block diagram showing an analyzer 10 according to the present embodiment. In the analyzer 10, the data input unit 1 inputs time-series data of a plurality of samples that are targets for detecting abnormal values, change points, and the like.

入力データとしては、種々のトランザクションデータ、ログデータ等が考えられるが、この説明では図２に示すような複数年度の複数企業における財務データを一例として取り上げることとする。図２に示すデータは、複数年度に渡る、サンプル毎の指標１〜５からなる財務データによって構成されている。分析対象となるサンプルは、Ａ社、Ｂ社、Ｃ社、Ｄ社、…といったＮ社（Ｎ個）の企業であり、ここでは各社は同一の業種に属しているものとする。財務データを構成する各指標１〜５は、例えば貸借対照表や損益決算書等における予め設定した所定の項目の数値に対応している。 As input data, various transaction data, log data, and the like can be considered. In this description, financial data of a plurality of companies in a plurality of years as shown in FIG. 2 is taken as an example. The data shown in FIG. 2 is composed of financial data composed of indices 1 to 5 for each sample over a plurality of years. Samples to be analyzed are N companies (N) such as A company, B company, C company, D company,..., And here, each company belongs to the same industry. Each index 1-5 which comprises financial data respond | corresponds to the numerical value of the predetermined item preset, for example in a balance sheet, an income statement, etc., for example.

図１のデータ記憶部２は、データ入力部１によって入力されたデータを逐次記憶する。部分空間作成部３は、データ記憶部２から最新のデータ及び過去一定期間のデータを読み出し、読み出したデータに基づいて構成される部分空間、即ち検出のためのモデルを抽出し、抽出した部分空間の情報をモデル記憶部５に記録する。図２に示す例では、最新のデータは今年のデータすなわち２００４年度分のデータである。そして、過去の一定期間のデータは、２００３年度以前の例えば３年間前までのデータである。部分空間作成部３は、過去の一定期間における複数のサンプルの各データを用いて第１の部分空間を作成する第１の部分空間作成部３１と、一定期間以後の所定の時点における（最新の）複数のサンプルの各データを用いて第２の部分空間を作成する第２の部分空間作成部３２とから構成されている。 The data storage unit 2 in FIG. 1 sequentially stores the data input by the data input unit 1. The partial space creation unit 3 reads the latest data and data for a certain period in the past from the data storage unit 2, extracts a partial space configured based on the read data, that is, a model for detection, and extracts the extracted partial space Is recorded in the model storage unit 5. In the example shown in FIG. 2, the latest data is data for this year, that is, data for the fiscal year 2004. The data for a certain period in the past is data up to, for example, three years before the fiscal year 2003. The partial space creation unit 3 includes a first partial space creation unit 31 that creates a first partial space using each data of a plurality of samples in a past fixed period, and a predetermined time after the predetermined period (latest 2) A second partial space creating unit 32 that creates a second partial space using data of a plurality of samples.

図３は、部分空間作成部３によって構成される部分空間、即ちモデルを説明するための図である。部分空間は、複数時系列点（この例では複数年度）分のデータの特徴を表し、基底ベクトルによって表現される。ここで、部分空間の次元数はデータの特徴によって変わる。図３の例では、入力データが、４個のサンプルＡ１〜Ａ４であり、各サンプルのデータが５つの指標で表されているものとしている。部分空間の作成の際には、入力データを、５つの指標に対応する５次元の空間（原空間）に配置し、配置された各サンプルの特徴を表すように部分空間を作成する。部分空間は、各サンプルの特徴に応じて３次元の部分空間として作成されたり、全サンプルが１つの平面上に載る場合には２次元の平面として作成されたりする。ただし、本願においては平面であっても部分空間と称することとしている。 FIG. 3 is a diagram for explaining a partial space formed by the partial space creation unit 3, that is, a model. The partial space represents data characteristics for a plurality of time series points (in this example, a plurality of years), and is represented by a basis vector. Here, the number of dimensions of the subspace varies depending on the characteristics of the data. In the example of FIG. 3, the input data is four samples A1 to A4, and the data of each sample is represented by five indices. When creating a partial space, input data is arranged in a five-dimensional space (original space) corresponding to five indices, and the partial space is created so as to represent the characteristics of the arranged samples. The partial space is created as a three-dimensional partial space according to the characteristics of each sample, or is created as a two-dimensional plane when all the samples are placed on one plane. However, in the present application, even a flat surface is referred to as a partial space.

図２の例で、第１の部分空間作成部３１によって２００３〜２００１年度の３年分のデータから部分空間を作成するとした場合、各サンプルのデータは５次元空間に配置され、各サンプルごとに部分空間が算出されることになる。ただし、第２の部分空間作成部３２によって２００４年度の１年分のデータから部分空間を作成した場合には、部分空間はベクトルデータとして表されることになる。 In the example of FIG. 2, when the first partial space creation unit 31 creates a partial space from data for three years from 2003 to 2001, the data of each sample is arranged in a five-dimensional space, and for each sample. The subspace is calculated. However, when a partial space is created from the data for one year of fiscal 2004 by the second partial space creation unit 32, the partial space is represented as vector data.

次に、図１の類似度算出部４は、作成された第１の部分空間及び第２の部分空間を用いて各サンプル間の類似度を算出して、類似度記憶部６に記憶する。サンプル間の類似度は、サンプル毎の部分空間どうしがなす角度（正準角）によって求めることができる。図４は、類似度算出の手法を説明するための図である。類似度は、部分空間算出部３によって抽出されたベクトルまたは部分空間とに基づいて算出されるが、当該ベクトルと部分空間の要素の違い従って、以下の３つの算出パターンが存在する。 Next, the similarity calculation unit 4 in FIG. 1 calculates the similarity between the samples using the created first partial space and second partial space, and stores them in the similarity storage unit 6. The similarity between samples can be obtained by an angle (canonical angle) formed by subspaces for each sample. FIG. 4 is a diagram for explaining a technique for calculating similarity. The similarity is calculated based on the vector or the subspace extracted by the subspace calculation unit 3, and there are the following three calculation patterns according to the difference between the vector and the subspace element.

（パターン１）
最初に、図４（ａ）に示すように、部分空間作成部３によって抽出されたベクトルがベクトル「ｘ_１」として表され、部分空間がベクトル「ｙ_１」として表される場合について説明する。ここで、ｘ_１及びｙ_１は、ｄ次元ベクトルの転置によって表される。即ち、ｘ_１＝｛ｘ_１１，ｘ_１２，…，ｘ_１ｄ｝^Ｔ、ｙ_１＝｛ｙ_１１，ｙ_１２，…，ｙ_１ｄ｝^Ｔとして表現される。なお、Ｔは転置を意味する。
パターン１の場合に、第１類似度はベクトルとベクトルの内積をベクトルの大きさで規格化した値、即ちベクトル同士のなす角度の余弦として算出され、第１類似度をＳとした場合に以下の式（４）によって求められる。 (Pattern 1)
First, as shown in FIG. 4A, a case where the vector extracted by the subspace creating unit 3 is represented as a vector “x ₁ ” and the subspace is represented as a vector “y ₁ ” will be described. Here, x ₁ and y ₁ are represented by transposition of d-dimensional vectors. That is, x ₁ = {x ₁₁ , x ₁₂ ,..., X _1d } ^T , y ₁ = {y ₁₁ , y ₁₂ ,..., Y _1d } ^T T means transposition.
In the case of pattern 1, the first similarity is calculated as a value obtained by normalizing the inner product of the vectors with the vector size, that is, as the cosine of the angle formed by the vectors. (4).

（パターン２）
次に、図４（ｂ）に示すように、部分空間作成部３によって抽出されたベクトルがベクトル「ｘ_１」として表され、部分空間が複数のベクトルによって構成される空間「ｙ_１，ｙ_２，ｙ_３」として表される場合について説明する。ここで、ｘ_１及びｙ_１，ｙ_２，ｙ_３はそれぞれｄ次元ベクトルの転置として表される。即ち、ｘ_１＝｛ｘ_１１，ｘ_１２，…，ｘ_１ｄ｝^Ｔ、ｙ_１＝｛ｙ_１１，ｙ_１２，…，ｙ_１ｄ｝^Ｔ、ｙ_２＝｛ｙ_２１，ｙ_２２，…，ｙ_２ｄ｝^Ｔ、ｙ_３＝｛ｙ_３１，ｙ_３２，…，ｙ_３ｄ｝^Ｔとして表現される。なお、Ｔは転置を意味する。
パターン２の場合に、第１類似度Ｓは、以下の式（５）によって求められる。 (Pattern 2)
Next, as illustrated in FIG. 4B, the vector extracted by the subspace creation unit 3 is represented as a vector “x ₁ ”, and the space “y ₁ , y ₂ , where the subspace is configured by a plurality of vectors. , Y ₃ ”will be described. Here, x ₁ and y ₁ , y ₂ , y ₃ are each represented as a transpose of a d-dimensional vector. _{_{_{_{That, x 1 = {x 11,}}}} x 12, ..., x 1d} T, y 1 = {y 11, y 12, ..., y 1d} T, y 2 = {y 21, y 22, ..., y 2d } ^T , y ₃ = {y ₃₁ , y ₃₂ ,..., Y _3d } ^T. T means transposition.
In the case of pattern 2, the first similarity S is obtained by the following equation (5).

なお、部分空間がベクトルである場合には、式（５）の値は、式（４）と同じ値となる。 When the subspace is a vector, the value of Expression (5) is the same value as Expression (4).

（パターン３）
次に、図４（ｃ）に示すように、部分空間作成部３によって抽出されたベクトルが空間「ｘ_１，ｘ_２」として表され、部分空間が空間「ｙ_１，ｙ_２，ｙ_３」として表される場合について説明する。ここで、ｘ_１，ｘ_２及びｙ_１，ｙ_２，ｙ_３は図４（ｂ）と同じくｄ次元ベクトルの転置として表される。即ち、ｘ_１＝｛ｘ_１１，ｘ_１２，…，ｘ_１ｄ｝^Ｔ、ｘ_２＝｛ｘ_２１，ｘ_２２，…，ｘ_２ｄ｝^Ｔ、ｙ_１＝｛ｙ_１１，ｙ_１２，…，ｙ_１ｄ｝^Ｔ、ｙ_２＝｛ｙ_２１，ｙ_２２，…，ｙ_２ｄ｝^Ｔ、ｙ_３＝｛ｙ_３１，ｙ_３２，…，ｙ_３ｄ｝^Ｔとして表現される。なお、Ｔは転置を意味する。
パターン３の場合に、第１類似度Ｓは、Ｘ＝［ｘ_１，ｘ_２］、Ｙ＝［ｙ_１，ｙ_２，ｙ_３］としたときに、以下の式（６）の最大固有値μ_ｍａｘとして算出される。 (Pattern 3)
Next, as shown in FIG. 4C, the vector extracted by the subspace creation unit 3 is represented as a space “x ₁ , x ₂ ”, and the subspace is a space “y ₁ , y ₂ , y ₃ ”. Will be described. Here, x ₁ , x ₂ and y ₁ , y ₂ , y ₃ are represented as transpositions of d-dimensional vectors as in FIG. _{_{_{_{That, x 1 = {x 11,}}}} x 12, ..., x 1d} T, x 2 = {x 21, x 22, ..., x 2d} T, y 1 = {y 11, y 12, ..., y 1d } ^T , y ₂ = {y ₂₁ , y ₂₂ ,..., Y _2d } ^T , y ₃ = {y ₃₁ , y ₃₂ ,..., Y _3d } ^T T means transposition.
In the case of pattern 3, when the first similarity S is X = [x ₁ , x ₂ ] and Y = [y ₁ , y ₂ , y ₃ ], the maximum eigenvalue μ of the following equation (6) Calculated as _max .

なお、式（６）の最大固有値μ_ｍａｘは、Ｘがベクトルであり、かつＹがベクトルの場合には、式（４）によって算出される値と同じ値になり、Ｘがベクトルであり、かつＹが空間の場合には、式（５）によって算出される値と同じ値になる。 Note that the maximum eigenvalue μ _max of Equation (6) is the same value as that calculated by Equation (4) when X is a vector and Y is a vector, X is a vector, and When Y is a space, the value is the same as the value calculated by equation (5).

ここで、図５を参照して、上述した図１の部分空間作成部３及び類似度算出部４における処理の流れについて説明する。図５に示す例では、X年度の入力を最新のデータとした場合に、第１の部分空間作成部３１によってその３年前までの過去のデータ（Ｘ−１〜Ｘ−３年度のデータ）から部分空間が作成されることとしている。例えばＸ＝２００４とした場合、第１の部分空間作成部３１は、２００３年度〜２００１年度までのデータをデータ記憶部２から取り出す処理を行う（ステップＳ１１）。次に第２の部分空間作成部３２が、２００４年度のデータをデータ記憶部２から取り出す処理を行う（ステップＳ１２）。 Here, with reference to FIG. 5, the flow of processing in the partial space creation unit 3 and the similarity calculation unit 4 of FIG. 1 described above will be described. In the example shown in FIG. 5, when the input for year X is the latest data, the first partial space creation unit 31 causes past data up to three years ago (data for years X-1 to X-3). A subspace is to be created from. For example, when X = 2004, the first partial space creation unit 31 performs a process of retrieving data from the fiscal year 2003 to the fiscal year 2001 from the data storage unit 2 (step S11). Next, the second partial space creation unit 32 performs a process of retrieving data for fiscal 2004 from the data storage unit 2 (step S12).

次に第１の部分空間作成部３１が過去のデータとして第１の部分空間を表すデータを作成し、モデル記憶部５に記憶する（ステップＳ１３）。また第２の部分空間作成部３２が今年のデータとして第２の部分空間を表すデータを作成し、モデル記憶部５に記憶する（ステップＳ１４）。この場合、第１の部分空間は、Ｘ−１年度、Ｘ−２年度、及びＸ−３年度の各指標のデータに基づいて算出された部分空間となり、第２の部分空間は、Ｘ年度の各指標のデータに対応するベクトルとなる。 Next, the first partial space creation unit 31 creates data representing the first partial space as past data, and stores the data in the model storage unit 5 (step S13). In addition, the second partial space creation unit 32 creates data representing the second partial space as the data for the current year and stores it in the model storage unit 5 (step S14). In this case, the first subspace is a subspace calculated based on the data of each index of the year X-1, the year X-2, and the year X-3, and the second subspace is the year X. A vector corresponding to the data of each index.

部分空間作成部３によって部分空間が作成されると、次に類似度算出部４がモデル記憶部５に記憶されている部分空間データに基づいて上述したようにして部分空間どうしの類似度を算出する。そして、類似度算出部４によって全部で２Ｎ個の部分空間の比較が行われ類似度が算出される（ステップＳ１５）。類似度算出部４によって算出された類似度は、２Ｎ×２Ｎの類似度行列として類似度記憶部６に記憶される。このとき２Ｎ×２Ｎの類似度行列は、最初のＮ個の行はＸ年度のサンプルに対する項目として、残りのＮ個の行はＸ−１〜Ｘ−３年度のサンプルに対する項目として並べられ、列も同様に、最初のＮ個の列はＸ年度のサンプルに対する項目として、残りのＮ個の列はＸ−１〜Ｘ−３年度のサンプルに対する項目として並べられる。 When the partial space is created by the partial space creation unit 3, the similarity calculation unit 4 calculates the similarity between the partial spaces based on the partial space data stored in the model storage unit 5 as described above. To do. Then, the similarity calculation unit 4 compares 2N subspaces in total and calculates the similarity (step S15). The similarity calculated by the similarity calculation unit 4 is stored in the similarity storage unit 6 as a 2N × 2N similarity matrix. At this time, in the 2N × 2N similarity matrix, the first N rows are arranged as items for the sample of the year X, and the remaining N rows are arranged as items for the sample of the years X-1 to X-3. Similarly, the first N columns are arranged as items for the year X sample, and the remaining N columns are arranged as items for the samples X-1 to X-3.

次に、図１の座標値算出部７及び結果出力部８の構成及び処理内容について説明する。図１に示す構成例では、座標値算出部７は、類似度記憶部６に記憶されている各サンプル間の第１及び第２の部分空間の類似度に基づいて一定期間及び所定時点における各サンプルに対応する各二次元座標値を算出する。そして、算出された各座標値が結果出力部８によって、所定の表示出力装置あるいは他の記憶あるいは印刷媒体に対して出力される。 Next, the configuration and processing contents of the coordinate value calculation unit 7 and the result output unit 8 of FIG. 1 will be described. In the configuration example shown in FIG. 1, the coordinate value calculation unit 7 is configured so that each of the samples in a predetermined period and a predetermined time point is based on the similarity between the first and second subspaces between the samples stored in the similarity storage unit 6. Each two-dimensional coordinate value corresponding to the sample is calculated. The calculated coordinate values are output by the result output unit 8 to a predetermined display output device or other storage or print medium.

ここで図６を参照して座標値算出部７及び結果出力部８の処理内容について説明する。座標値算出部７は、サンプル間の類似度行列を入力としてＭＤＳ（Multi Dimensional Searching；多次元尺度法）によって各サンプル間の相対構造を表す各サンプルに対応する各二次元座標値を算出する（ステップＳ２１）。ここでＭＤＳは、対象間の類似性あるいは非類似性の測度を手がかりに、その背後にある相対構造を“分かりやすい形”で表現する方法である。“分かりやすい形”とは、具体的には、図７に示すように、行列形式で複数のサンプル間の類似度が示されていた場合に、各サンプルを２次元平面上にマッピングして示すことである（図８、図９参照）。 Here, processing contents of the coordinate value calculation unit 7 and the result output unit 8 will be described with reference to FIG. The coordinate value calculation unit 7 receives each similarity matrix between samples and calculates each two-dimensional coordinate value corresponding to each sample representing a relative structure between the samples by MDS (Multi Dimensional Searching) (multidimensional scaling). Step S21). Here, MDS is a method of expressing the relative structure behind the object in an “intelligible form” using a measure of similarity or dissimilarity between objects. Specifically, the “intelligible form” is shown by mapping each sample on a two-dimensional plane when the similarity between a plurality of samples is shown in a matrix format as shown in FIG. (See FIGS. 8 and 9).

図７は、類似度行列が３個のサンプルＡ〜Ｃ間の類似度（あるいは非類似度）を表す３×３の行列の場合を示した例である。類似度は０〜１の値を取り、１に近いほど類似していることを示す。例えばサンプルＡとサンプルＢの類似度（類似性）は０．２であり、サンプルＡとサンプルＣの間の類似性は０．６である。このような行列形式では、さらにサンプル数が増加した場合、サンプル間の相対構造（相対関係）を把握することが困難になると考えられる。そこでまず、類似度行列を各サンプルの座標に置き換え（図８）、その座標値に従って各サンプルを示す識別子を座標表示しているのである（図９）。 FIG. 7 is an example showing a case where the similarity matrix is a 3 × 3 matrix representing the similarity (or dissimilarity) between the three samples A to C. The degree of similarity takes a value of 0 to 1, and the closer to 1, the more similar. For example, the similarity (similarity) between sample A and sample B is 0.2, and the similarity between sample A and sample C is 0.6. In such a matrix format, it is considered that it is difficult to grasp the relative structure (relative relationship) between samples when the number of samples further increases. Therefore, first, the similarity matrix is replaced with the coordinates of each sample (FIG. 8), and identifiers indicating the samples are displayed in coordinates according to the coordinate values (FIG. 9).

類似度行列を座標データへの変換は例えば次のようにして行うことができる。ここで、入力された類似度行列は、図７に示すもの、すなわち式（７）に示すものであるとする。 The conversion of the similarity matrix into coordinate data can be performed as follows, for example. Here, it is assumed that the input similarity matrix is the one shown in FIG. 7, that is, the equation (7).

まずこれを非類似度行列にするために全ての要素が1の行列からの減算処理を行う。これが各データ間の非類似度（類似度）を表す行列Ｄとなる（式（８）〜（９））。 First, in order to make this a dissimilarity matrix, subtraction processing is performed from a matrix in which all elements are 1. This is the matrix D representing the dissimilarity (similarity) between the data (formulas (8) to (9)).

次に各要素（距離）の二乗を要素とする行列Ｄ⁽²⁾（式（１０））を求める。 Next, a matrix D ⁽²⁾ (formula (10)) having the square of each element (distance) as an element is obtained.

次にＤ⁽²⁾の各要素ｄ² _jiに対して下式（１１）の変換を行い、変換して得られるｐ_ijを要素とする行列Ｐを生成する。式（１１）において、添え字ｉ＊が付いているものは第ｉ行の平均、同じく＊ｊは第ｊ行の平均、そして＊＊は全平均を示している。 Next, the transformation of the following expression (11) is performed on each element d ² _ji of D ⁽²⁾ , and a matrix P having p _ij obtained by the transformation as an element is generated. In the formula (11), those with the suffix i * indicate the average of the i-th row, similarly * j indicates the average of the j-th row, and ** indicates the total average.

そして、行列Ｐから固有値ベクトルλとゼロでない固有値に対する固有ベクトルＳtを算出する。そしてゼロでない固有値を対角要素とする対角行列をＤ_λt ^l／2とすると、Ｘ_t＊＝ＳtＤ_λt ^l／2を満たすX_t＊を求めると下式（１２）の結果が得られる。 Then, the eigenvector St for the eigenvalue vector λ and the nonzero eigenvalue is calculated from the matrix P. When the diagonal matrix of the eigenvalues nonzero diagonal elements and _{^{_{D λt l / 2, X t}}} * = StD λt l / 2 satisfy X _t * the obtaining the following formula (12) results.

以上のような処理によって図７の行列を図８の２次元座標値を配列に変換することができる。なお、各年度のサンプルがＮ個の場合に算出される２Ｎ×２Ｎ類似度行列にＭＤＳを行うと、２Ｎ個の第１座標と第２座標が算出され、２Ｎの最初のＮ個はＸ年度に対する第１座標と第２座標となり、後のＮ個はＸ−１〜Ｘ−３年度に対応する第１座標と第２座標となって算出される。 Through the processing as described above, the matrix of FIG. 7 can be converted into the array of the two-dimensional coordinate values of FIG. When MDS is performed on the 2N × 2N similarity matrix calculated when there are N samples for each year, 2N first and second coordinates are calculated, and the first N of 2N is the X year. The first and second coordinates are calculated, and the subsequent N are calculated as the first and second coordinates corresponding to the years X-1 to X-3.

次に、図６では、ステップＳ２２で、求めた座標を元に２次元座標平面上に各座標がプロットされる。その際、各座標値に対応してサンプルを示す識別子（座標点を示す○や□の記号と「サンプルＡ（例えば会社名）」といった文字）が２次元座標上に配置される。図６の例では、過去３年間（Ｘ−１〜Ｘ−３年度）に対応するデータと今年のデータ（Ｘ年度）とが異なる記号（や色）で区別して表示されている。 Next, in FIG. 6, in step S22, each coordinate is plotted on the two-dimensional coordinate plane based on the obtained coordinate. At this time, an identifier indicating a sample (a symbol such as ◯ or □ indicating a coordinate point and a character such as “sample A (for example, company name)) corresponding to each coordinate value is arranged on the two-dimensional coordinates. In the example of FIG. 6, data corresponding to the past three years (years X-1 to X-3) and this year's data (year X) are displayed separately with different symbols (or colors).

次にステップＳ２３で同じサンプルに古い年度から新しい年度に向けて矢印が表示される。すなわち図１の座標値算出部７が求めた過去３年間の各座標値から今年のデータの各座標値に向けて矢印がサンプル毎に配置され、表示される。そして、矢印のつけられたマップが出力される。 In step S23, an arrow is displayed on the same sample from the old year to the new year. That is, an arrow is arranged and displayed for each sample from each coordinate value of the past three years obtained by the coordinate value calculation unit 7 of FIG. 1 to each coordinate value of this year's data. A map with an arrow is output.

図１０に、図１の結果出力部８による出力例を示した。この例は、ある業界におけるＡ、Ｂ、Ｃ、Ｄ、及びＥ社の過去数年間と、今年の財務データから得られた座標値を同一座標平面上に表示し、さらに矢印で変化の方向と大きさを示すものである。また、各年度毎に各サンプルのデータの平均値を求めたものを平均的な企業として（１つのサンプルとして）データを追加して、その座標値の変化（「業界平均」と表記）も示している。この例では、Ａ社とＥ社が同一業界の他社との関係において大きな変化が生じていることが分かる。この場合、財務データの変化が業界全体の傾向からはずれていることが示されたことになる。 FIG. 10 shows an output example by the result output unit 8 of FIG. In this example, A, B, C, D, and E companies in a certain industry display the coordinate values obtained from the financial data of this year and this year on the same coordinate plane, and the direction of the change is indicated by an arrow. It shows the size. Also, the average value of the data of each sample for each fiscal year is added as an average company (as one sample), and the change in the coordinate value (indicated as “industry average”) is also shown. ing. In this example, it can be seen that Company A and Company E have undergone significant changes in the relationship with other companies in the same industry. In this case, it has been shown that changes in financial data deviate from industry-wide trends.

本実施の形態の処理手順をまとめると次のようになる。なお、サンプル数はＮであるとする。 The processing procedure of this embodiment is summarized as follows. It is assumed that the number of samples is N.

１．過去（複数年）のデータを用いて部分空間を求め、各サンプルの特徴とする。（Ｎ個） 1. Subspaces are obtained using past (multiple years) data and are used as features of each sample. (N)

２．今年のデータを用いて部分空間を求め、各サンプルの特徴とする。（Ｎ個） 2. Using this year's data, subspaces are obtained and used as features of each sample. (N)

３．過去のデータのサンプル間、今年のデータのサンプル間、過去のデータと今年のデータのサンプル間の類似度を求める。ただし、類似度はサンプルの特徴を表す部分空間どうしのなす角度（正準角）で表現する。（類似度行列は、２Ｎ×２Ｎ行列となる。） 3. The similarity between the past data samples, the current year data samples, and the past data and current year data samples is obtained. However, the similarity is expressed as an angle (canonical angle) between subspaces representing the characteristics of the sample. (The similarity matrix is a 2N × 2N matrix.)

４．類似度行列を元に、ＭＤＳを用いて、サンプルの座標値（過去のサンプルＮ個、今年のサンプルＮ個）を求める。 4). Based on the similarity matrix, the coordinate values of the samples (N past samples and N samples this year) are obtained using MDS.

５．座標値を元に、全サンプル（２Ｎ個）を２次元上に配置する。 5. All samples (2N) are arranged two-dimensionally based on the coordinate values.

６．理解を容易にするため、過去のサンプルから対応する今年のサンプルへ向けて、矢印を書く。 6). For ease of understanding, draw an arrow from the past sample to the corresponding sample for this year.

そして、本実施の形態は次のような特徴を有している。すなわち、同一のサンプルに対して、過去のサンプルと今年のサンプルを別のサンプルとして扱い、同時にＭＤＳを適用することで、過去のデータと今年のデータの相対関係の変化の把握が容易になった。また、過去のサンプルから、対応する今年のサンプルへ向けて矢印を書くことで、相対関係の変化の把握が容易になった。さらに、既存の手法のように、計算式をあらかじめ作成（更新）する必要がない。 The present embodiment has the following characteristics. In other words, for the same sample, the past sample and this year's sample are treated as different samples, and MDS is applied at the same time, making it easier to grasp the change in the relative relationship between the past data and this year's data. . In addition, by drawing an arrow from the past sample to the corresponding sample for this year, it became easier to grasp the change in the relative relationship. Further, unlike the existing method, it is not necessary to create (update) a calculation formula in advance.

特に、複数の企業を一度に比較することで、特異なサンプルを発見するために用いることが有効である。その際、特異なサンプルを発見するだけでなく、一度に複数のサンプルを比較するため、全体的な傾向も把握可能となり、業界の動向把握など幅広いサービスの提供が可能となる。また、年度をまたいだサンプル間の類似性が把握可能となり、新たな知識発見につながる。さらに分析対象のデータに関する事前の知識を必要とせずに、特異なサンプルの発見が可能となり、作業の効率化が期待できる。 In particular, it is effective to use for finding a unique sample by comparing a plurality of companies at once. At that time, not only a unique sample is found, but also a plurality of samples are compared at a time, so it is possible to grasp the overall trend, and it is possible to provide a wide range of services such as grasping industry trends. In addition, the similarity between samples across fiscal years can be grasped, leading to new knowledge discovery. Furthermore, it is possible to find a specific sample without requiring prior knowledge about the data to be analyzed, and work efficiency can be expected.

なお、上記の構成は一例であり、例えば過去のデータを３年に限られず、変更可能である。また、本実施形態の分析装置１０は、例えば同様のシステムが複数存在するなど複数のサンプルを比較対象とできる場合には、例えば、ネットワークのトラヒックのログデータからのネットワーク侵入検出や、各センサから取得される人工衛星テレメトリックデータに基づく人工衛星などの宇宙システムの異常検知に利用可能である。また、カード決済時のトランザクションデータに基づくクレジットカード不正利用検出や、携帯電話のなりすまし利用行為の検出や、保険金請求データやレセプトからの例外事象や不審データの検出などにも利用することが可能である。 In addition, said structure is an example, For example, the past data is not restricted to 3 years, It can change. In addition, when the analysis apparatus 10 according to the present embodiment can compare a plurality of samples, for example, when there are a plurality of similar systems, for example, network intrusion detection from network traffic log data, or from each sensor It can be used to detect anomalies in space systems such as artificial satellites based on acquired satellite telemetric data. It can also be used to detect fraudulent use of credit cards based on transaction data at the time of card settlement, to detect fraudulent use of mobile phones, to detect insurance claims data, exceptions from suspicious data, and suspicious data. It is.

上述の分析装置１０は内部に、コンピュータシステムを有している。そして、上述した異常値検出及び変化点検出や分析処理は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータが読み出して実行することによって、上記処理が行われる。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 The analysis apparatus 10 described above has a computer system inside. The above-described abnormal value detection, change point detection, and analysis processing are stored in a computer-readable recording medium in the form of a program, and the above-described processing is performed by the computer reading and executing this program. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

本実施形態による分析装置の内部構成を示したブロック図である。It is the block diagram which showed the internal structure of the analyzer by this embodiment. 同実施形態にデータ入力部１へ入力されるデータの構成を示した図である。It is the figure which showed the structure of the data input into the data input part 1 in the same embodiment. 同実施形態における部分空間作成部によって作成される部分空間を説明するための図である。It is a figure for demonstrating the partial space created by the partial space creation part in the embodiment. 同実施形態における類似度算出の手法を説明するための図である。It is a figure for demonstrating the method of similarity calculation in the embodiment. 同実施形態における部分空間の作成及び類似度算出の処理を示したフローチャートである。It is the flowchart which showed the process of creation of the partial space and similarity calculation in the embodiment. 同実施形態における座標値算出の処理を示したフローチャートである。It is the flowchart which showed the process of the coordinate value calculation in the same embodiment. 同実施形態における座標値算出の処理を説明するための入力データを示した図である。It is the figure which showed the input data for demonstrating the process of coordinate value calculation in the embodiment. 同実施形態における座標値算出の処理を説明するための算出結果を示した図である。It is the figure which showed the calculation result for demonstrating the process of coordinate value calculation in the embodiment. 同実施形態における座標値算出の処理を説明するための算出結果の図示例を示した図である。It is the figure which showed the example of illustration of the calculation result for demonstrating the process of coordinate value calculation in the embodiment. 同実施形態における出力結果の一例を示した図である。It is the figure which showed an example of the output result in the same embodiment. 従来の技術を説明するための図である。It is a figure for demonstrating the prior art. 従来の技術における手法（１）を説明するための図（その１）である。It is FIG. (1) for demonstrating the method (1) in a prior art. 従来の技術における手法（２）を説明するための図である。It is a figure for demonstrating the method (2) in a prior art. 従来の技術における手法（３）を説明するための図である。It is a figure for demonstrating the method (3) in a prior art. 従来の技術における手法（４）を説明するための図である。It is a figure for demonstrating the method (4) in a prior art.

符号の説明Explanation of symbols

１データ入力部
２データ記憶部
３部分空間作成部
３１第１の部分空間作成部
３２第２の部分空間作成部
４類似度算出部
５モデル記憶部
６類似度記憶部
７座標値算出部
１０分析装置
DESCRIPTION OF SYMBOLS 1 Data input part 2 Data storage part 3 Subspace creation part 31 1st partial space creation part 32 2nd partial space creation part 4 Similarity calculation part 5 Model storage part 6 Similarity degree storage part 7 Coordinate value calculation part 10 Analysis apparatus

Claims

一定期間における複数のサンプルの各データを用いて第１の部分空間を作成する第１の部分空間作成手段と、
前記一定期間以後の所定の時点における複数のサンプルの各データを用いて第２の部分空間を作成する第２の部分空間作成手段と、
第１の部分空間及び第２の部分空間を用いて各サンプル間の類似度を求める類似度算出手段と、
求めた類似度を用いて前記一定期間における複数の各サンプルに対応する各二次元座標値と前記所定の時点における複数の各サンプルに対応する各二次元座標値を求める座標値算出手段と、
座標値算出手段が求めた各座標値を出力する出力手段と
を備えることを特徴とするデータ分析装置。 First partial space creating means for creating a first partial space using each data of a plurality of samples in a certain period;
Second partial space creating means for creating a second partial space using each data of a plurality of samples at a predetermined time after the predetermined period;
Similarity calculation means for calculating the similarity between the samples using the first partial space and the second partial space;
Coordinate value calculating means for obtaining each two-dimensional coordinate value corresponding to each of a plurality of samples in the predetermined period and each two-dimensional coordinate value corresponding to each of the plurality of samples at the predetermined time point using the obtained similarity,
An output unit that outputs each coordinate value obtained by the coordinate value calculation unit.

前記座標値算出手段が、類似度を表す配列に対して多次元尺度法を適用して座標値を表す配列を算出するものであることを特徴とする請求項１記載のデータ分析装置。 The data analysis apparatus according to claim 1, wherein the coordinate value calculation means calculates an array representing coordinate values by applying a multidimensional scaling method to the array representing similarity.

前記出力手段が、前記座標値算出手段が求めた各座標値に対応してサンプルを示す識別子を２次元座標上に配置して表示するものであることを特徴とする請求項１又は２に記載のデータ分析装置。 The said output means arrange | positions and displays the identifier which shows a sample corresponding to each coordinate value which the said coordinate value calculation means calculated | required on a two-dimensional coordinate, The display of Claim 1 or 2 characterized by the above-mentioned. Data analysis equipment.

前記出力手段が、前記座標値算出手段が求めた前記一定期間における各座標値から前記所定の時点における各座標値に向けて配置された矢印を表示するものであることを特徴とする請求項１〜３のいずれか１項に記載のデータ分析装置。 2. The output means displays arrows arranged from the respective coordinate values obtained by the coordinate value calculating means in the predetermined period toward the respective coordinate values at the predetermined time point. The data analyzer of any one of -3.

一定期間における複数のサンプルの各データを用いて第１の部分空間を作成する第１の部分空間作成過程と、
前記一定期間以後の所定の時点における複数のサンプルの各データを用いて第２の部分空間を作成する第２の部分空間作成過程と、
第１の部分空間及び第２の部分空間を用いて各サンプル間の類似度を求める類似度算出過程と、
求めた類似度を用いて前記一定期間における複数の各サンプルに対応する各二次元座標値と前記所定の時点における複数の各サンプルに対応する各二次元座標値を求める座標値算出過程と、
座標値算出過程で求めた各座標値を出力する出力過程と
を含んでいることを特徴とするデータ分析方法。
A first subspace creation process for creating a first subspace using each data of a plurality of samples in a certain period;
A second subspace creation step of creating a second subspace using each data of a plurality of samples at a predetermined time after the predetermined period;
A similarity calculation process for obtaining a similarity between samples using the first subspace and the second subspace;
A coordinate value calculation process for obtaining each two-dimensional coordinate value corresponding to each of a plurality of samples in the predetermined period and each two-dimensional coordinate value corresponding to each of the plurality of samples at the predetermined time point using the obtained similarity;
An output process for outputting each coordinate value obtained in the coordinate value calculation process.