JP6935551B2

JP6935551B2 - Methods and systems for detecting the root cause of anomalies in datasets

Info

Publication number: JP6935551B2
Application number: JP2020116162A
Authority: JP
Inventors: クマルケー．ピーシャラス; マリヤサガヤムマリエ
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-07-18
Filing date: 2020-07-06
Publication date: 2021-09-15
Anticipated expiration: 2040-07-06
Also published as: JP2021018813A

Description

本開示は、データ解析の分野に関する。特に、ただし非排他的に、本開示は、データセットにおける異常の根本原因を検出する方法およびシステムに関する。 The present disclosure relates to the field of data analysis. In particular, but non-exclusively, the present disclosure relates to methods and systems for detecting the root cause of anomalies in a dataset.

一般に、大規模ネットワークにおける異常を識別するのは困難な作業である。大規模ネットワークの分散システムから収集したデータセットは、非常に大きく、雑音が多いデータ点を含むことがあるので、１つ以上の異常を検出するためのデータセットの手動チェックは時間がかかり、エラーを起こしやすい。したがって、データセットの複数の変数に対して、目標変数のばらつきのグラフ表示をランク付けすることによって、データセットの１つ以上の異常を検出する自動化システムが使用される。識別された１つ以上の異常は訂正されなければならない。したがって、１つ以上の異常の根本原因が必要である。既存のグラフ表示システムは、データセットからの１つ以上のデータ変数を表すように、予め割り当てられたマッピング規則に基づいてグラフ表示を自動的に生成する。予め割り当てられたマッピング規則は、推奨されるグラフ表示チャートと一致するデータ特性に基づく。それに加えて、既存のグラフ表示システムはまた、ユーザ選択履歴に基づいてグラフ表示を連続的にランク付けし、それをユーザに推奨する。更に、ランク付けはデータフィールドの性質に基づいており、多くのユーザが、探索的データ解析技術を使用して、使用ケース適用例に関連するデータからの興味深い洞察を識別する。そのため、多量のデータフィールドが利用可能なときに適切なグラフ表示を選択し、適正なグラフ表示チャートを見つけることは、時間がかかる作業である。 In general, identifying anomalies in large networks is a difficult task. Datasets collected from distributed systems in large networks can contain very large and noisy data points, so manual checking of datasets to detect one or more anomalies can be time consuming and error-prone. Is easy to cause. Therefore, an automated system is used to detect one or more anomalies in a dataset by ranking the graph representation of the variability of the target variables for multiple variables in the dataset. One or more identified anomalies must be corrected. Therefore, the root cause of one or more anomalies is needed. Existing graph display systems automatically generate graph displays based on pre-assigned mapping rules to represent one or more data variables from a dataset. The pre-assigned mapping rules are based on data characteristics that match the recommended graph display charts. In addition, existing graph display systems also continuously rank graph displays based on user selection history and recommend it to users. In addition, rankings are based on the nature of the data fields, and many users use exploratory data analysis techniques to identify interesting insights from data related to use case applications. Therefore, selecting the appropriate graph display and finding the appropriate graph display chart when a large number of data fields are available is a time-consuming task.

既存の技術の課題は、ランク付けを生成するのにかかる時間、および粗悪な選択履歴の利用である。このことは、不正確なランク付けにつながり、最終的にはグラフ表示チャートの推奨が次善のものになる。 The challenges of existing technology are the time it takes to generate rankings and the use of poor selection history. This leads to inaccurate rankings and ultimately the recommendations for graphed charts are suboptimal.

既存の技術の課題は、ユーザの過去の選択バイアスによって推奨が冗長になり、ユーザの過去の選択に関する履歴データ、およびユーザプロファイルからのデータをランク付けの割当てに使用することによって、新しいデータセットに対してランク付けが正確でなくなることである。 The challenge of existing technology is that the user's past selection bias makes recommendations redundant, and by using historical data about the user's past selections and data from the user profile for ranking assignments, new datasets On the other hand, the ranking becomes inaccurate.

本開示のこの背景技術の項で開示した情報は、単に本発明の全体的背景の理解を高めるためのものであり、この情報が当業者には既に知られている従来技術を形成することを容認するもの、または何らかの形で提案するものとして解釈すべきではない。 The information disclosed in this background art section of the present disclosure is solely for the purpose of enhancing the understanding of the overall background of the invention, and that this information forms prior art already known to those of skill in the art. It should not be construed as acceptable or any suggestion.

本開示の方法を提供することによって、従来技術の１つ以上の欠点が克服され、追加の利点が提供される。 By providing the methods of the present disclosure, one or more drawbacks of the prior art are overcome and additional advantages are provided.

本開示の技術によって、更なる特徴および利点が実現される。本開示の他の実施形態および態様が、本明細書に詳細に記載され、本開示の特許請求の一部と見なされる。 Further features and advantages are realized by the techniques of the present disclosure. Other embodiments and aspects of the present disclosure are described in detail herein and are considered part of the claims of the present disclosure.

本明細書に開示するのは、データセットにおける異常の根本原因を検出するための、変数をランク付けする方法である。方法は、データセットからの複数の変数と、データセットにおける目標変数とを取得することを含む。更に、方法は、複数の変数に対する目標変数のばらつきを識別して、ばらつきに存在する異常値に基づいて、複数の変数に対する目標変数のばらつきにおける１つ以上の異常を検出することを含む。更に、方法は、複数の変数に対して実施される１つ以上の統計的解析に基づいて、検出された１つ以上の異常を引き起こす複数の変数から、１つ以上の変数を識別することを含む。最後に、方法は、識別された１つ以上の変数それぞれに対する目標変数のばらつきであって、データセットにおける異常の根本原因を検出するためのランク付けに基づいてそれぞれ表示されるばらつきを、ランク付けすることを含む。 Disclosed herein is a method of ranking variables to detect the root cause of anomalies in a dataset. The method involves retrieving multiple variables from the dataset and the target variables in the dataset. Further, the method comprises identifying the variability of the target variable for a plurality of variables and detecting one or more anomalies in the variability of the target variable for the plurality of variables based on the outliers present in the variation. In addition, the method identifies one or more variables from among the variables that cause one or more anomalies detected, based on one or more statistical analyzes performed on the variables. include. Finally, the method ranks the variability of the target variable for each one or more identified variables, each displayed variability based on a ranking to detect the root cause of the anomaly in the dataset. Including doing.

更に、本開示は、プロセッサとプロセッサに通信可能に連結されたメモリとを含み、メモリが、実行時にプロセッサに、データセットからの複数の変数とデータセットにおける目標変数とを取得させるプロセッサ命令を格納する、ランキングシステムを開示する。更に、プロセッサは、複数の変数に対する目標変数のばらつきを識別して、ばらつきに存在する異常値に基づいて、複数の変数に対する目標変数のばらつきにおける１つ以上の異常を検出するように構成される。更に、プロセッサは、複数の変数に対して実施される１つ以上の統計的解析に基づいて、検出された１つ以上の異常を引き起こす複数の変数から、１つ以上の変数を識別するように構成される。最後に、プロセッサは、識別された１つ以上の変数それぞれに対する目標変数のばらつきであって、データセットにおける異常の根本原因を検出するためのランク付けに基づいてそれぞれ表示されるばらつきを、ランク付けするように構成される。 Further, the present disclosure includes a processor and a memory communicatively linked to the processor, which stores processor instructions at run time that cause the processor to retrieve multiple variables from the dataset and target variables in the dataset. Disclose the ranking system. Further, the processor is configured to identify the variability of the target variable for a plurality of variables and detect one or more anomalies in the variability of the target variable for the plurality of variables based on the outliers present in the variability. .. In addition, the processor should identify one or more variables from among the variables that cause one or more anomalies detected, based on one or more statistical analyzes performed on the variables. It is composed. Finally, the processor ranks the variability of the target variable for each of the identified variables, each displayed based on a ranking to detect the root cause of the anomaly in the dataset. It is configured to do.

上述の概要は単なる例証であり、いかなる形でも限定的であることを意図しない。上述した例示的な態様、実施形態、および特徴に加えて、図面および以下の詳細な説明を参照することによって、更なる態様、実施形態、および特徴が明白となる。 The above overview is merely an example and is not intended to be limited in any way. In addition to the exemplary embodiments, embodiments, and features described above, further embodiments, embodiments, and features will become apparent by reference to the drawings and the detailed description below.

本開示の新規な特徴および特性を添付図面に記載する。しかしながら、本開示自体、ならびに好ましい使用モード、本開示の更なる目的および利点は、例示の実施形態の以下の詳細な説明を添付図面と併せて参照することによって、最も良く理解することができる。添付図面は、本開示に組み込まれると共にその一部を構成するものであり、例示の実施形態を例証し、説明と併せて開示の原理を説明するのに役立つ。図面中、参照番号の左端の桁は、その参照番号が最初に現れる図面を特定する。１つ以上の実施形態を、単なる例として、添付図面を参照して以下に記載する。図面中、類似の参照番号は類似の要素を表す。 The novel features and characteristics of the present disclosure are described in the accompanying drawings. However, the disclosure itself, as well as preferred modes of use, further objectives and advantages of the present disclosure, can best be understood by reference to the following detailed description of the exemplary embodiments in conjunction with the accompanying drawings. The accompanying drawings are incorporated into and constitute a portion of the present disclosure and are useful in exemplifying exemplary embodiments and explaining the principles of disclosure in conjunction with the description. In a drawing, the leftmost digit of the reference number identifies the drawing in which the reference number first appears. One or more embodiments are described below, by way of example only, with reference to the accompanying drawings. In the drawings, similar reference numbers represent similar elements.

本開示のいくつかの実施形態による、データセットにおける異常の根本原因を検出する例示のシステムを示す図である。FIG. 5 illustrates an exemplary system for detecting the root cause of anomalies in a dataset, according to some embodiments of the present disclosure. 本発明のいくつかの実施形態による、ランキングシステムを示す詳細ブロック図である。It is a detailed block diagram which shows the ranking system by some embodiments of this invention. 本開示のいくつかの実施形態による、データセットにおける異常の根本原因を検出する方法を示すフローチャートである。It is a flowchart which shows the method of detecting the root cause of an abnormality in a data set by some embodiments of this disclosure. 本開示のいくつかの実施形態による、例示のデータセットを示す図である。FIG. 5 illustrates an exemplary data set according to some embodiments of the present disclosure. 本開示のいくつかの実施形態による、複数の変数のカテゴリデータ型に基づいた例示の集計データセットを示す図である。It is a figure which shows the example aggregate data set based on the category data type of a plurality of variables according to some embodiments of this disclosure. 本開示のいくつかの実施形態による、複数の変数の時間データ型に基づいた例示の集計データセットを示す図である。It is a figure which shows the example aggregate data set based on the time data type of a plurality of variables by some embodiments of this disclosure. 本開示のいくつかの実施形態による、データセットにおける複数の変数のデータ型を識別するための例示のメタデータを示す図である。FIG. 5 illustrates exemplary metadata for identifying the data types of multiple variables in a dataset, according to some embodiments of the present disclosure. 本開示のいくつかの実施形態による、複数の変数のデータ型に基づいてグラフ表示を選択するための例示のメタデータを示す図である。FIG. 5 illustrates exemplary metadata for selecting a graph display based on the data types of a plurality of variables, according to some embodiments of the present disclosure. 本開示のいくつかの実施形態による、複数の変数のデータ型に基づいてデータ解析方法を選択するための例示のメタデータを示す図である。FIG. 5 illustrates exemplary metadata for selecting a data analysis method based on the data types of a plurality of variables according to some embodiments of the present disclosure. 本開示のいくつかの実施形態による、ばらつきに存在する異常値の例示の検出を示す図である。It is a figure which shows the example detection of the abnormal value which exists in a variation by some embodiments of this disclosure. 本開示のいくつかの実施形態による、例示のクラスタ解析を示す図である。It is a figure which shows the example cluster analysis by some embodiments of this disclosure. 本開示のいくつかの実施形態による、例示の四分位解析を示す図である。FIG. 5 illustrates an exemplary quartile analysis according to some embodiments of the present disclosure. 本開示のいくつかの実施形態による、例示の時系列解析を示す図である。It is a figure which shows the example time series analysis by some embodiments of this disclosure. 本開示のいくつかの実施形態による、相関解析を使用する１つ以上の変数の例示の識別を示す図である。FIG. 5 illustrates an exemplary identification of one or more variables using correlation analysis according to some embodiments of the present disclosure. 本開示のいくつかの実施形態による、データセットにおける異常の根本原因を検出する変数をランク付けする汎用コンピュータシステムを示す図である。FIG. 5 illustrates a general purpose computer system that ranks variables that detect the root cause of anomalies in a dataset, according to some embodiments of the present disclosure.

本明細書のいずれのブロック図も、本主題の原理を具体化する例示的なシステムの概念図を表すことが、当業者には理解されるはずである。同様に、いずれのフローチャート、フロー図、状態遷移図、擬似コードなども、コンピュータ可読媒体の形で実質的に表され、明示的に示されるか否かにかかわらずコンピュータまたはプロセッサによって実行されてもよい、様々なプロセスを表すことが理解されるであろう。 It should be understood by those skilled in the art that any block diagram herein represents a conceptual diagram of an exemplary system that embodies the principles of this subject. Similarly, any flowchart, flow diagram, state transition diagram, pseudocode, etc. may be substantially represented in the form of a computer-readable medium and executed by a computer or processor, whether explicitly shown or not. It will be understood that it represents a good, various process.

本明細書では、「例示の」という語は、「例、事例、または実例としての役割を果たす」ことを意味するのに使用される。「例示」として本明細書に記載する本主題のいずれの実施形態または実現例も、他の実施形態よりも好ましいかもしくは有利であるものとは必ずしも解釈されない。 As used herein, the term "exemplary" is used to mean "act as an example, case, or example." Any embodiment or embodiment of the subject described herein as an "exemplification" is not necessarily construed as preferred or advantageous over other embodiments.

本開示は様々な修正および代替形態に影響されやすいが、それらの具体的な実施形態を図面に例として示しており、以下に詳細に記載する。しかしながら、本開示を開示する形態に限定しようとするものではなく、それとは逆に、本開示は、本開示の範囲内にある全ての修正例、等価物、および代替例を網羅するものであることが理解されるべきである。 Although the present disclosure is susceptible to various modifications and alternatives, specific embodiments thereof are shown in the drawings as examples and are described in detail below. However, it is not intended to be limited to the forms in which the present disclosure is disclosed, and conversely, the present disclosure covers all modifications, equivalents, and alternatives within the scope of the present disclosure. Should be understood.

「備える」、「含む」、「備えている」、「含んでいる」という用語、またはそれらのあらゆる他の変形は、一連の構成要素もしくはステップを備えるセットアップ、デバイス、または方法が、それらの構成要素もしくはステップだけを含むのではなく、明示的に列挙されないかまたはかかるセットアップもしくはデバイスもしくは方法に固有のものではない、他の構成要素もしくはステップを含んでもよいという、非排他的な包含を網羅するものとする。換言すれば、「〜を備える」または「〜を含む」に続く、システムまたは装置における１つ以上の要素は、更なる制約を有さずに、システムまたは方法に他の要素もしくは追加要素が存在することを除外しない。 The terms "provide," "contain," "provide," "contain," or any other variant thereof, have a set-up, device, or method that comprises a set of components or steps. Covers non-exclusive inclusions that do not include only elements or steps, but may include other components or steps that are not explicitly listed or are not specific to such setup or device or method. It shall be. In other words, one or more elements in a system or device that follow "with" or "contains" have other or additional elements in the system or method without further restrictions. Do not exclude doing.

本開示は、データセットにおける異常の根本原因を検出するための、変数をランク付けする方法を記載する。方法は、データセットからの複数の変数と、データセットにおける目標変数とを取得することを含む。更に、方法は、複数の変数に対する目標変数のばらつきを識別して、ばらつきに存在する異常値に基づいて、複数の変数に対する目標変数のばらつきにおける１つ以上の異常を検出することを含む。更に、方法は、複数の変数に対して実施される１つ以上の統計的解析に基づいて、検出された１つ以上の異常を引き起こす複数の変数から、１つ以上の変数を識別することを含む。最後に、方法は、識別された１つ以上の変数それぞれに対する目標変数のばらつきであって、データセットにおける異常の根本原因を検出するためのランク付けに基づいてそれぞれ表示されるばらつきをランク付けすることを含む。 The present disclosure describes how variables are ranked to detect the root cause of anomalies in a dataset. The method involves retrieving multiple variables from the dataset and the target variables in the dataset. Further, the method comprises identifying the variability of the target variable for a plurality of variables and detecting one or more anomalies in the variability of the target variable for the plurality of variables based on the outliers present in the variation. In addition, the method identifies one or more variables from among the variables that cause one or more anomalies detected, based on one or more statistical analyzes performed on the variables. include. Finally, the method ranks the variability of the target variable for each one or more identified variables, each displayed variability based on a ranking to detect the root cause of the anomaly in the dataset. Including that.

本開示の実施形態の以下の詳細な説明において、本開示の一部を形成すると共に、例として本開示が実施されてもよい特定の実施形態が示される、添付図面を参照する。これらの実施形態は、当業者が本開示を実施できるように十分に詳細に記載されるが、他の実施形態が利用されてもよく、本開示の範囲から逸脱することなく変更が行われてもよいことが、理解されるべきである。したがって、以下の説明は限定的な意味で解釈されるべきではない。 In the following detailed description of embodiments of the present disclosure, reference is made to the accompanying drawings that form part of the present disclosure and, by way of example, show specific embodiments in which the present disclosure may be implemented. These embodiments will be described in sufficient detail to allow one of ordinary skill in the art to carry out the present disclosure, but other embodiments may be utilized and modified without departing from the scope of the present disclosure. It should be understood that it is also good. Therefore, the following explanation should not be construed in a limited sense.

図１は、本開示のいくつかの実施形態による、データセット（１０１）における異常の根本原因を検出する例示のシステムを示している。 FIG. 1 shows an exemplary system for detecting the root cause of anomalies in a dataset (101) according to some embodiments of the present disclosure.

一実施形態では、ユーザは、複数の行および列を含むデータセット（１０１）を提供してもよい。別の実施形態では、データセットは、ランキングシステム（１０９）のメモリ（２０２）から獲得されてもよい。更に、ユーザは、解析のためにデータセット（１０１）から目標変数（１０２）を選択する。データセット（１０１）は、個別にもしくは組み合わせてアクセスされるか、または完全なエンティティとして管理されてもよいデータの関連する離散的なアイテムの集合体である。データセット（１０１）の列は複数の変数を構成し、データセット（１０１）の各行は複数の変数が取る値を構成する。従属変数と呼ばれる目標変数（１０２）は、観察下にあるデータセット（１０１）の列である。目標変数（１０２）を除外したデータセット（１０１）の複数の変数は、独立変数と呼ばれる。データセット（１０１）における目標変数（１０２）を除外した複数の変数は、複数の変数のデータ型（例えば、数値、カテゴリ、時間、空間）に基づいて集計される。データセット（１０１）のメタデータ（１０３）は、複数の変数のデータ型を識別する設定規則を含む。更に、データ型に対応する集計された複数の変数に対する目標変数（１０２）のばらつきが識別される。データ型に対応するばらつきに基づいて、１つ以上の異常が、異常値検出部（１０５）によって、ばらつきに存在する異常値に基づいて検出される。異常値は、各ばらつきの複数のデータ点から逸脱した少なくとも１つのデータ点を示す。集計された複数の変数に対する検出された異常値は、異常値データセット（１０６）に格納される。更に、検出された１つ以上の異常を引き起こす、複数の変数からの１つ以上の変数（８０５）は、変数識別部（１０７）によって複数の変数に対して実施される、１つ以上の統計的解析に基づいて識別される。検出された１つ以上の異常を引き起こす、複数の変数から識別された１つ以上の変数（８０５）は、識別変数データセット（１０８）に格納される。複数の変数に適した１つ以上の統計的解析は、データセット（１０１）と関連付けられたメタデータ（１０３）から取得される。最後に、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきは、ランキング部（１０９）によってランク付けされる。例えば、Ｖ_１、Ｖ_２、およびＶ_３が識別された１つ以上の変数である場合、図１に示されるように、ランク１は、識別された１つ以上の変数Ｖ_１との目標変数のばらつきに割り当てられ、ランク２は、識別された１つ以上の変数Ｖ_２との目標変数のばらつきに割り当てられ、ランク３は、識別された１つ以上の変数Ｖ_３との目標変数のばらつきに割り当てられる。ランク付けされたばらつき（１１０）に対して、ばらつきにおける１つ以上の変数（８０５）のデータ型に基づいて、データセット（１０１）と関連付けられたメタデータ（１０３）から適切なグラフ表示（１１１）が選択される。例えば、１つ以上の変数データ型Ｖ_１のデータ型を数値、１つ以上の変数Ｖ_２を空間とする。したがって、図１に示されるように、目標変数に対する１つ以上の変数Ｖ_１のばらつきを表示するのに、折れ線グラフが選択され、目標変数に対する１つ以上の変数Ｖ_２のばらつきを表示するのに、Ｇｅｏチャートが選択される。一実施形態では、グラフ表示（１１１）を提供する順序は、１つ以上の変数と関連付けられたランクに基づいてもよい。例えば、対応する変数Ｖ_３が１にランク付けされることを示す折れ線グラフが上位に表示され、同様に、対応する変数Ｖ_２が２にランク付けされることを示すＧｅｏチャートが、折れ線グラフの下に表示されてもよい。ランク付けされたばらつきは、対応するグラフ表示（１１１）とともに、データセット（１０１）における異常の根本原因を検出するため、ユーザに対して表示される。 In one embodiment, the user may provide a dataset (101) containing multiple rows and columns. In another embodiment, the dataset may be acquired from the memory (202) of the ranking system (109). In addition, the user selects a target variable (102) from the dataset (101) for analysis. A dataset (101) is a collection of related discrete items of data that may be accessed individually or in combination, or managed as a complete entity. The columns of the dataset (101) constitute a plurality of variables, and each row of the dataset (101) constitutes the values taken by the plurality of variables. The target variable (102), called the dependent variable, is a sequence of datasets (101) under observation. The plurality of variables in the data set (101) excluding the target variable (102) are called independent variables. The plurality of variables excluding the target variable (102) in the data set (101) are aggregated based on the data types of the plurality of variables (eg, number, category, time, space). The metadata (103) of the dataset (101) includes configuration rules that identify the data types of a plurality of variables. Further, the variation of the target variable (102) with respect to the aggregated variables corresponding to the data type is identified. Based on the variation corresponding to the data type, one or more anomalies are detected by the outlier detection unit (105) based on the outliers present in the variation. The outlier indicates at least one data point that deviates from the plurality of data points of each variation. The detected outliers for the aggregated variables are stored in the outlier data set (106). Further, one or more variables (805) from a plurality of variables that cause one or more detected anomalies are performed by the variable discriminator (107) on the plurality of variables. It is identified based on the analysis. One or more variables (805) identified from a plurality of variables that cause one or more detected anomalies are stored in the discriminant variable dataset (108). One or more statistical analyzes suitable for multiple variables are taken from the metadata (103) associated with the dataset (101). Finally, the variability of the target variable (102) for each of the identified one or more variables (805) is ranked by the ranking unit (109). For example, if V ₁ , V ₂ , and V ₃ are one or more identified variables, then rank 1 is the target variable with _{one or more identified variables V 1, as shown in FIG.} assigned to the variation of rank 2 is assigned to variation in the target variable with one or more variables V ₂ identified, the variation of the target variable of rank 3, one or more variables V ₃ identified Assigned to. Appropriate graph representation (111) from the metadata (103) associated with the dataset (101) based on the data type of one or more variables (805) in the variability for the ranked variability (110). ) Is selected. For example, the data type of one or more variable data types V ₁ is a numerical value, and one or more variables V ₂ are spaces. Therefore, as shown in FIG. 1, _{to display the variability of one or more variables V 1} with respect to the target variable, a line graph is selected to display the variability of _{one or more variables V 2 with respect to the target variable.} The Geo chart is selected. In one embodiment, the order in which the graph display (111) is provided may be based on the rank associated with one or more variables. For example, a line graph showing that the corresponding variable V ₃ is ranked 1 is displayed at the top, and similarly, _{a Geo chart showing that the corresponding variable V 2} is ranked 2 is a line graph. It may be displayed below. The ranked variability, along with the corresponding graph display (111), is displayed to the user to detect the root cause of the anomaly in the dataset (101).

図２は、本発明のいくつかの実施形態による、ランキングシステム（２００）の詳細ブロック図を示している。 FIG. 2 shows a detailed block diagram of the ranking system (200) according to some embodiments of the present invention.

ランキングシステム（２００）は、中央処理装置（「ＣＰＵ」または「プロセッサ」）（２０３）と、プロセッサ（２０３）によって実行可能な命令を格納するメモリ（２０２）とを含んでもよい。プロセッサ（２０３）は、ユーザまたはシステム作成要求を実行するためにプログラムコンポーネントを実行する、少なくとも１つのデータプロセッサを含んでもよい。メモリ（２０２）はプロセッサ（２０３）に通信可能に連結されてもよい。ランキングシステム（２００）は入出力（Ｉ／Ｏ）インターフェース（２０１）を更に含む。Ｉ／Ｏインターフェース（２０１）はプロセッサ（２０３）に連結されてもよく、それを通して入力信号および／または出力信号が通信されてもよい。一実施形態では、ランキングシステム（２００）は、Ｉ／Ｏインターフェース（２０１）を通してデータセット（１０１）および目標変数（１０２）を受信してもよい。 The ranking system (200) may include a central processing unit (“CPU” or “processor”) (203) and a memory (202) that stores instructions that can be executed by the processor (203). Processor (203) may include at least one data processor that executes program components to execute user or system creation requests. The memory (202) may be communicatively linked to the processor (203). The ranking system (200) further includes an input / output (I / O) interface (201). The I / O interface (201) may be connected to a processor (203) through which input and / or output signals may be communicated. In one embodiment, the ranking system (200) may receive the dataset (101) and the target variable (102) through the I / O interface (201).

いくつかの実現例では、ランキングシステム（２００）はデータ（２０４）とモジュール（２０７）とを含んでもよい。一例として、データ（２０４）およびモジュール（２０７）は、図２に示されるように、ランキングシステム（２００）内に構成されたメモリ（２０２）に格納されてもよい。一実施形態では、データ（２０４）は、例えば、データセット（１０１）、メタデータ（１０３）、正常値データセット（２０５）、異常値データセット（１０６）、識別された変数（１０８）、および他のデータ（２０６）を含んでもよい。示される図２において、モジュール（２０７）は本明細書で詳細に記載される。 In some implementations, the ranking system (200) may include data (204) and modules (207). As an example, the data (204) and the module (207) may be stored in a memory (202) configured in the ranking system (200), as shown in FIG. In one embodiment, the data (204) are, for example, the dataset (101), the metadata (103), the normal data set (205), the outlier data set (106), the identified variables (108), and. Other data (206) may be included. In FIG. 2 shown, module (207) is described in detail herein.

一実施形態では、データセット（１０１）は、個別にもしくは組み合わせてアクセスされるか、または完全なエンティティとして管理されてもよいデータの関連する離散的なアイテムの集合体である。データセット（１０１）は、複数の行および列での関連データの配置を含む。データセット（１０１）の列は複数の変数を構成し、データセット（１０１）の各行は複数の変数が取る値を構成する。図４Ａは、複数の行および列を含む表の形態で配置された例示のデータセット（１０１）を示している。列（例えば、日付、加盟店ＩＤ、顧客ＩＤ、取引数量、所在地、加盟店業種コード（ＭＣＣ））は、複数の変数を構成する。図４Ａの表の行は、列の対応する変数が取る値を構成する。例えば、図４Ａの表の行１を考えると、「Ｍ１０１」は変数「加盟店ＩＤ」が取る値、「バンガロール」は変数「所在地」が取る値、などである。 In one embodiment, a dataset (101) is a collection of related discrete items of data that may be accessed individually or in combination, or managed as a complete entity. The dataset (101) includes the placement of related data in multiple rows and columns. The columns of the dataset (101) constitute a plurality of variables, and each row of the dataset (101) constitutes the values taken by the plurality of variables. FIG. 4A shows an exemplary dataset (101) arranged in the form of a table containing multiple rows and columns. A column (eg, date, Merchant ID, Customer ID, Transaction Quantity, Location, Merchant Category Code (MCC)) constitutes a plurality of variables. The rows of the table in FIG. 4A constitute the values taken by the corresponding variables in the columns. For example, considering row 1 of the table of FIG. 4A, "M101" is a value taken by the variable "merchant ID", "Bangalore" is a value taken by the variable "location", and so on.

一実施形態では、メタデータは、データセット（１０１）に関する情報を提供するデータである。特徴的なタイプのメタデータは、記述メタデータ、構造メタデータ、管理メタデータ、参照メタデータ、および統計メタデータである。記述メタデータは、発見および識別などの目的のリソースを説明する。タイトル、概要、著者、およびキーワードなどの要素を含むことができる。構造メタデータは、データの格納庫に関するメタデータであり、複合オブジェクトがどのように合わされるか、例えばページがどのように順序づけられて章を形成するかを示す。デジタル素材のタイプ、バージョン、関係、および他の特性を説明する。管理メタデータは、いつどのように作成されたか、ファイルタイプおよび他の技術情報、ならびに誰がアクセスできるかなど、リソースの管理を助ける情報を提供する。参照メタデータは、統計的データの内容および品質を説明する。統計的メタデータはまた、統計的データを収集、処理、または生成するプロセスを説明してもよく、かかるデータはプロセスデータとも呼ばれる。更に、メタデータ（１０３）は、複数の変数のデータ型を識別する規則セット、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきを表示するための１つ以上のグラフ表示（１１１）、およびグラフ表示（１１１）がサポートする１つ以上のデータ型、ならびに複数の変数に対する目標変数（１０２）のばらつきを解析する１つ以上のデータ解析方法、および対応するデータ解析方法がサポートする１つ以上のデータ型のうち、少なくとも１つを含む。図５Ａの表に示されるように、メタデータ（１０３）は、データセット（１０１）における複数の変数のデータ型を識別する１つ以上の規則を含む。例えば、図５Ａの表の行１を考えると、整数型またはｄｏｕｂｌｅ型の値を有するデータセット（１０１）の１つ以上の変数（８０５）は、数値データ型などとして識別される。図５Ｂの表に示されるように、メタデータ（１０３）は、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきを表示するための、１つ以上のグラフ表示（１１１）を含む。例えば、図５Ｂの表の列１を考えると、Ｘ軸を表す数値およびカテゴリデータ型と、Ｙ軸を表す数値データ型とに対して、「棒グラフ」などが使用される。図５Ｃの表に示されるように、メタデータ（１０３）は、複数の変数に対する目標変数（１０２）のばらつきを解析する１つ以上のデータ解析方法と、対応するデータ解析方法がサポートする１つ以上のデータ型とを含む。例えば、図５Ｃの表の列１を考えると、時間または数値のデータ型の複数の変数に対する、時間または数値のデータ型の目標変数（１０２）のばらつきに関して、ばらつきを解析するのに「時系列解析（６０６）」方法が使用され、ばらつきなどを表すのに折れ線グラフまたは棒グラフが使用される。 In one embodiment, the metadata is data that provides information about the dataset (101). Characteristic types of metadata are descriptive metadata, structural metadata, management metadata, reference metadata, and statistical metadata. Descriptive metadata describes resources of interest such as discovery and identification. It can include elements such as title, summary, author, and keywords. Structural metadata is metadata about a hangar of data that shows how composite objects fit together, for example, how pages are ordered to form chapters. Describe the types, versions, relationships, and other characteristics of digital materials. Management metadata provides information that helps manage resources, such as when and how it was created, file types and other technical information, as well as who can access it. Reference metadata describes the content and quality of statistical data. Statistical metadata may also describe the process of collecting, processing, or generating statistical data, which is also referred to as process data. Further, the metadata (103) is a set of rules that identify the data types of a plurality of variables, one or more for displaying the variation of the target variable (102) for each of the identified one or more variables (805). One or more data types supported by the graph display (111) and the graph display (111), and one or more data analysis methods for analyzing the variation of the target variable (102) for a plurality of variables, and the corresponding data analysis. Includes at least one of one or more data types supported by the method. As shown in the table of FIG. 5A, the metadata (103) includes one or more rules that identify the data types of multiple variables in the dataset (101). For example, considering row 1 of the table of FIG. 5A, one or more variables (805) in a dataset (101) having integer or double values are identified as numeric data types and the like. As shown in the table of FIG. 5B, the metadata (103) is a graph representation (1 or more) for displaying the variability of the target variable (102) for each of the identified variables (805). 111) is included. For example, considering column 1 of the table of FIG. 5B, a "bar graph" or the like is used for a numerical value and category data type representing the X-axis and a numerical data type representing the Y-axis. As shown in the table of FIG. 5C, the metadata (103) is one or more data analysis methods that analyze the variation of the target variable (102) for a plurality of variables, and one supported by the corresponding data analysis methods. Including the above data types. For example, considering column 1 of the table of FIG. 5C, it is necessary to analyze the variation of the target variable (102) of the time or numerical data type with respect to a plurality of variables of the time or numerical data type in order to analyze the variation. The "Analysis (606)" method is used, and line graphs or bar graphs are used to represent variations and the like.

一実施形態では、正常値データセット（２０５）は、複数の変数に対する対応するデータ型の目標変数（１０２）のばらつきにおける異常値として識別された行を除外した、データセット（１０１）の行および列のサブセットを含む。更に、正常値データセット（２０５）は、１つ以上のデータ型に対応する目標変数（１０２）のばらつきによって取得される、正常値データセット（２０５）の集計を含んでもよい。 In one embodiment, the normal data set (205) excludes the rows identified as outliers in the variation of the target variable (102) of the corresponding data type for multiple variables and the rows of the dataset (101). Contains a subset of columns. Further, the normal value data set (205) may include an aggregation of the normal value data set (205) acquired by the variability of the target variable (102) corresponding to one or more data types.

一実施形態では、異常値データセット（１０６）は、複数の変数に対する対応するデータ型の目標変数（１０２）のばらつきにおける異常値として識別された、データセット（１０１）の行および列のサブセットを含む。ばらつきの異常値は、クラスタ解析（６０１）、四分位解析（６０４）、および時系列解析（６０６）のうち少なくとも１つを含む、１つ以上のデータ解析方法に基づいて識別される。更に、異常値は、各ばらつきの複数のデータ点から逸脱した少なくとも１つのデータ点を示す。更に、異常値データセット（１０６）は、１つ以上のデータ型に対応する目標変数（１０２）のばらつきによって取得される、異常値データセット（１０６）の集計を含んでもよい。 In one embodiment, the outlier data set (106) is a subset of rows and columns of the dataset (101) identified as outliers in the variability of the target variable (102) of the corresponding data type for a plurality of variables. include. Outliers in variability are identified based on one or more data analysis methods, including at least one of cluster analysis (601), quartile analysis (604), and time series analysis (606). Further, the outlier indicates at least one data point that deviates from the plurality of data points of each variation. Further, the outlier data set (106) may include an aggregation of the outlier data set (106) acquired by the variability of the target variable (102) corresponding to one or more data types.

一実施形態では、識別された変数（１０８）は、複数の変数に対して実施される１つ以上の統計的解析に基づいて、検出された１つ以上の異常を引き起こす複数の変数から、１つ以上の変数（８０５）を含む。１つ以上の統計的解析は、相関解析（８０１）、回帰、および学習アルゴリズムのうち少なくとも１つを含む。例えば、図４Ａの表を考えると、「所在地」変数は、検出された１つ以上の異常（例えば、１ヶ月間の多い取引数量）を引き起こす複数の変数からの、１つ以上の変数として識別されてもよい。 In one embodiment, the identified variable (108) is one of the variables that cause one or more anomalies detected, based on one or more statistical analyzes performed on the variables. Contains one or more variables (805). One or more statistical analyzes include at least one of correlation analysis (801), regression, and learning algorithms. For example, considering the table in FIG. 4A, the "location" variable is identified as one or more variables from multiple variables that cause one or more detected anomalies (eg, high transaction volumes in a month). May be done.

一実施形態では、他のデータ（２０６）は、クラスタ解析（６０１）、四分位解析（６０４）、および時系列解析（６０６）のうち少なくとも１つを含む、１つ以上のデータ解析方法と、相関解析（８０１）、回帰、および学習アルゴリズムのうち少なくとも１つを含む、１つ以上の統計的解析とに関するデータを含んでもよい。更に、他のデータ（２０６）は、ばらつきにおける２つのデータ点間の距離測定値（６０３）に関するデータ、上位および下位四分位の間の四分位間距離（６０５）を計算し比較したデータ、ならびに上限および下限の間の予測限界値（６０７）を計算し比較したデータを含んでもよい。 In one embodiment, the other data (206) is associated with one or more data analysis methods, including at least one of cluster analysis (601), quadrant analysis (604), and time series analysis (606). , Correlation analysis (801), regression, and data relating to one or more statistical analyzes, including at least one of the learning algorithms. Further, the other data (206) is data relating to the distance measurement value (603) between two data points in the variation, and data obtained by calculating and comparing the interquartile distance (605) between the upper and lower quartiles. , As well as data for calculating and comparing predicted limits (607) between the upper and lower limits.

いくつかの実施形態では、データ（２０４）は、様々なデータ構造の形態でメモリ（２０２）に格納されてもよい。それに加えて、データ（２０４）は、関係または階層データモデルなどのデータモデルを使用して組織されてもよい。他のデータ（２０６）は、ランキングシステム（２００）の様々な機能を実施するためにモジュール（２０７）によって生成される、一時データおよび一時ファイルを含むデータを格納してもよい。 In some embodiments, the data (204) may be stored in memory (202) in the form of various data structures. In addition, the data (204) may be organized using a data model such as a relational or hierarchical data model. The other data (206) may store data, including temporary data and temporary files, generated by the module (207) to perform various functions of the ranking system (200).

いくつかの実施形態では、メモリ（２０２）内のデータ（２０４）は、ランキングシステム（２００）のモジュール（２０７）によって処理されてもよい。モジュール（２０７）はメモリ（２０２）内に格納されてもよい。一例では、ランキングシステム（２００）内に構成されたプロセッサ（２０３）に通信可能に連結されたモジュール（２０７）は、図２に示されるように、メモリ（２０２）の外部にも存在し、ハードウェアとして実現されてもよい。本明細書で使用するとき、モジュール（２０７）という用語は、特定用途向け集積回路（ＡＳＩＣ）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、電子回路、１つもしくは複数のソフトウェアまたはファームウェアプログラムを実行するプロセッサ（２０３）（共有、専用、もしくはグループ）およびメモリ（２０２）、組み合わせ論理回路、ならびに／あるいは記載の機能性を提供する他の適切な構成要素を指してもよい。他のいくつかの実施形態では、モジュール（２０７）は、ＡＳＩＣおよびＦＰＧＡのうち少なくとも１つを使用して実現されてもよい。 In some embodiments, the data (204) in memory (202) may be processed by the module (207) of the ranking system (200). Module (207) may be stored in memory (202). In one example, the module (207) communicably linked to the processor (203) configured in the ranking system (200) also exists outside the memory (202) and is hardware, as shown in FIG. It may be realized as wear. As used herein, the term module (207) refers to an application specific integrated circuit (ASIC), FPGA (Field Programmable Gate Array), electronic circuit, or processor that executes one or more software or firmware programs. 203) (shared, dedicated, or group) and memory (202), combined logic circuits, and / or other suitable components that provide the functionality described. In some other embodiments, the module (207) may be implemented using at least one of an ASIC and an FPGA.

一実現例では、モジュール（２０７）は、例えば、異常値検出部（１０５）、変数識別部（１０７）、出力モジュール（２０９）、入力モジュール（２０８）、ランキング部（１０９）、および他のモジュール（２１０）を含んでもよい。かかる上述のモジュール（２０７）は、単一のモジュールまたは異なるモジュール（２０７）の組み合わせとして表されてもよいことが理解できる。 In one implementation, the module (207) is, for example, an outlier detection unit (105), a variable identification unit (107), an output module (209), an input module (208), a ranking unit (109), and other modules. (210) may be included. It can be seen that such module (207) may be represented as a single module or a combination of different modules (207).

一実施形態では、異常値検出部（１０５）は、適用された少なくとも１つのデータ解析方法の結果に基づいて、複数の変数に存在する異常値を識別するのに使用される。異常値は、各ばらつきの複数のデータ点から逸脱した少なくとも１つのデータ点を示す。１つ以上のデータ解析方法は、クラスタ解析（６０１）、四分位解析（６０４）、および時系列解析（６０６）のうち少なくとも１つを含む。例えば、検出された異常値は、図７Ａ、図７Ｂ、および図７Ｃに示される。 In one embodiment, the outlier detection unit (105) is used to identify outliers present in a plurality of variables based on the results of at least one data analysis method applied. The outlier indicates at least one data point that deviates from the plurality of data points of each variation. One or more data analysis methods include at least one of cluster analysis (601), quartile analysis (604), and time series analysis (606). For example, the detected outliers are shown in FIGS. 7A, 7B, and 7C.

一実施形態では、変数識別部（１０７）は、複数の変数に対して実施される１つ以上の統計的解析に基づいて、検出された１つ以上の異常を引き起こす複数の変数から、１つ以上の変数（８０５）を識別するのに使用される。最初に、変数識別部（１０７）は、１つ以上の統計的解析から少なくとも１つの統計的解析を、集計された異常値データセット（１０６）、および異常値を有さない複数の変数の集計された正常値データセット（２０５）に適用する。更に、変数識別部（１０７）は、集計された異常値データセット（１０６）、および異常値を有さない複数の変数の集計された正常値データセット（２０５）に対する、統計的解析の結果の間の差を計算することによって、１つ以上の変数（８０５）を識別する。１つ以上の統計的解析は、相関解析（８０１）、回帰、および学習アルゴリズムのうち少なくとも１つを含む。例えば、図４Ａの表を考えると、「所在地」変数は、検出された１つ以上の異常（例えば、１ヶ月間の多い取引数量）を引き起こす複数の変数からの、１つ以上の変数として識別されてもよい。 In one embodiment, the variable discriminator (107) is one of a plurality of variables that cause one or more anomalies detected, based on one or more statistical analyzes performed on the plurality of variables. It is used to identify the above variables (805). First, the variable identification unit (107) performs at least one statistical analysis from one or more statistical analyzes, an aggregated outlier data set (106), and an aggregate of a plurality of variables having no outliers. It is applied to the normal value data set (205). Further, the variable identification unit (107) is the result of statistical analysis of the aggregated outlier data set (106) and the aggregated normal value data set (205) of a plurality of variables having no outliers. One or more variables (805) are identified by calculating the difference between them. One or more statistical analyzes include at least one of correlation analysis (801), regression, and learning algorithms. For example, considering the table in FIG. 4A, the "location" variable is identified as one or more variables from multiple variables that cause one or more detected anomalies (eg, high transaction volumes in a month). May be done.

一実施形態では、ランキング部（１０９）は、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきに、ランクを割り当てることに関与する。ランクは、集計された異常値データセット（１０６）、および異常値を有さない複数の変数の集計された正常値データセット（２０５）に対する、統計的解析の結果の間の計算された差に基づいて割り当てられる。更に、グラフ表示（１１１）は、１つ以上のグラフ表示（１１１）から選択され、ばらつきは、出力モジュール（２０９）を使用してユーザに対して表示される。 In one embodiment, the ranking unit (109) is involved in assigning ranks to the variability of the target variable (102) for each of one or more identified variables (805). The rank is the calculated difference between the results of the statistical analysis for the aggregated outlier data set (106) and the aggregated normal data set (205) for multiple variables that do not have outliers. Assigned based on. Further, the graph display (111) is selected from one or more graph displays (111) and the variability is displayed to the user using the output module (209).

一実施形態では、出力モジュール（２０９）は、ランキング部（１０９）によって割り当てられたランクに基づいて、識別された１つ以上の変数（８０５）それぞれに対する、目標変数（１０２）のばらつきそれぞれのグラフ表示（１１１）を表示することに関与する。更に、ランク付けされたグラフ表示（１１１）がユーザに対して表示されることによって、ユーザは、データセット（１０１）における異常の根本原因を検出することができる。 In one embodiment, the output module (209) is a graph of the variation of the target variable (102) for each of the identified variables (805) based on the rank assigned by the ranking unit (109). Involved in displaying the display (111). Further, by displaying the ranked graph display (111) to the user, the user can detect the root cause of the anomaly in the dataset (101).

一実施形態では、入力モジュール（２０８）は、データセット（１０１）における異常の根本原因を検出するため、データセット（１０１）からの複数の変数およびデータセット（１０１）における目標変数（１０２）を取得することに関与する。 In one embodiment, the input module (208) sets a plurality of variables from the dataset (101) and a target variable (102) in the dataset (101) to detect the root cause of the anomaly in the dataset (101). Involved in acquiring.

一実施形態では、他のモジュール（２１０）は、ばらつきにおける２つのデータ点間の距離を測定すること、データ点とデータ点群との間の距離を測定すること、上位および下位四分位の間の四分位間距離（６０５）を計算し比較すること、ならびに上限および下限の間の予測限界値（６０７）を計算し比較することに関与する。 In one embodiment, the other module (210) measures the distance between two data points in a variation, measures the distance between a data point and a group of data points, the upper and lower quartiles. It is involved in calculating and comparing the interquartile distance (605) between, as well as calculating and comparing the predicted limits (607) between the upper and lower limits.

図３は、本開示のいくつかの実施形態による、データセット（１０１）における異常の根本原因を検出する方法を示すフローチャートを示している。 FIG. 3 shows a flow chart showing a method of detecting the root cause of anomalies in a dataset (101) according to some embodiments of the present disclosure.

方法（３００）が記載される順序は、限定として解釈されることを意図するものではなく、方法を実現するのに、任意の数の記載される方法ブロックが任意の順序で組み合わされてもよい。それに加えて、個々のブロックは、本明細書に記載する主題の趣旨および範囲から逸脱することなく、方法から削除されてもよい。更に、方法は、任意の適切なハードウェア、ソフトウェア、ファームウェア、またはそれらの組み合わせで実現されてもよい。 The order in which method (300) is described is not intended to be construed as limiting, and any number of described method blocks may be combined in any order to implement the method. .. In addition, individual blocks may be removed from the method without departing from the spirit and scope of the subject matter described herein. Further, the method may be implemented with any suitable hardware, software, firmware, or a combination thereof.

ステップ（３０１）で、ランキングシステム（２００）は、データセット（１０１）からの複数の変数と、データセット（１０１）における目標変数（１０２）とをユーザから取得する。データセット（１０１）の列は複数の変数を構成し、データセット（１０１）の各行は複数の変数が取る値を構成する。例示のデータセット（１０１）が図４Ａの表に示される。ユーザから取得した目標変数（１０２）は、データセットにおける複数の変数からの１つの変数に対応する。例えば、ユーザから取得したデータセット（１０１）からの目標変数（１０２）として、「取引数量」を考える。 In step (301), the ranking system (200) acquires a plurality of variables from the dataset (101) and a target variable (102) in the dataset (101) from the user. The columns of the dataset (101) constitute a plurality of variables, and each row of the dataset (101) constitutes the values taken by the plurality of variables. An exemplary dataset (101) is shown in the table of FIG. 4A. The target variable (102) obtained from the user corresponds to one variable from a plurality of variables in the dataset. For example, consider "transaction quantity" as a target variable (102) from a data set (101) acquired from a user.

一実施形態では、データセット（１０１）からの複数の変数は、複数の変数のメタデータ（１０３）に基づいて、１つ以上のデータ型に集計される。メタデータ（１０３）は、複数の変数の１つ以上のデータ型を識別するための、１つ以上の規則を含んでもよい。１つ以上のデータ型は、数値データ型、カテゴリデータ型、時間データ型、および空間データ型のうち少なくとも１つを含む。 In one embodiment, the plurality of variables from the dataset (101) are aggregated into one or more data types based on the metadata (103) of the plurality of variables. Metadata (103) may include one or more rules for identifying one or more data types of a plurality of variables. The one or more data types include at least one of a numeric data type, a category data type, a temporal data type, and a spatial data type.

図４Ａの表に示されるように、「日付」および「取引数量」は数値データ型として識別され、「加盟店ＩＤ」、「顧客ＩＤ」、「所在地」、および「加盟店業種コード（ＭＣＣ）」はカテゴリデータ型として識別される。複数の変数のデータ型は、図５Ａの表に示されるように、メタデータ（１０３）に基づいて識別される。更に、図４Ｂの表に示されるように、「取引数量」に対する複数の変数のカテゴリデータ型に対応する集計されたカテゴリデータセット（１０４Ｂ）が取得され、図４Ｃの表に示されるように、「取引数量」に対する複数の変数の数値データ型に対応する集計された数値データセット（１０４Ａ）が取得される。 As shown in the table of FIG. 4A, "date" and "transaction volume" are identified as numerical data types, "merchant category code", "customer ID", "location", and "merchant category code (MCC)". Is identified as a category data type. The data types of the plurality of variables are identified based on the metadata (103), as shown in the table of FIG. 5A. Further, as shown in the table of FIG. 4B, an aggregated categorical data set (104B) corresponding to the categorical data types of the plurality of variables for "transaction volume" is obtained and as shown in the table of FIG. 4C. An aggregated numerical data set (104A) corresponding to the numerical data types of a plurality of variables for the "transaction quantity" is acquired.

ステップ（３０２）で、ランキングシステム（２００）は、複数の変数に対する目標変数（１０２）のばらつきを識別して、ばらつきに存在する異常値に基づいて、ばらつきにおける１つ以上の異常を検出する。図６に示されるように、１つ以上の異常は、複数の変数のデータ型に基づいて、１つ以上のデータ解析方法から少なくとも１つのデータ解析方法を各ばらつきに適用することによって検出される。 In step (302), the ranking system (200) identifies the variability of the target variable (102) for a plurality of variables and detects one or more anomalies in the variability based on the outliers present in the variability. As shown in FIG. 6, one or more anomalies are detected by applying at least one data analysis method from one or more data analysis methods to each variation based on the data types of the plurality of variables. ..

一実施形態では、数値データセット（１０４Ａ）に対して、クラスタ解析（６０１）が適用され、カテゴリデータセット（１０４Ｂ）または空間データセット（１０４Ｄ）に対して、四分位解析（６０４）が適用され、時間データセット（１０４Ｃ）に対して、時系列解析（６０６）が適用される。クラスタ解析（６０１）は、クラスタ群（６０２）のデータ点が、他のクラスタ群（６０２）のデータ点と比較して、クラスタの他のデータ（２０６）点に類似するようにして、データセット（１０１）からのデータ点のセットをクラスタ群（６０２）にグループ化する作業である。異常値は、クラスタの中心から閾値距離よりも遠い距離では、データ点として識別される。一実施形態では、閾値距離はクラスタ間距離の半分に設定されてもよい。四分位は、データセット（１０１）のデータ点のセットを、データ点の値に基づいて４つの定義された間隔に分割するのに使用される、統計的用語である。間隔を分離するデータ点の値は、第１、第２、および第３の四分位と呼ばれる。データセット（１０１）が「２ｎ」個のデータ点を含むと仮定して、第１の四分位（Ｑ１）は、データセット（１０１）におけるｎ個の最小入力の中央値として計算され、第２の四分位（Ｑ２）は、データセット（１０１）における「２ｎ」個の入力全ての計算された中央値であり、第３の四分位（Ｑ３）は、データセット（１０１）におけるｎ個の最大入力の中央値として計算される。一実施形態では、第３の四分位（Ｑ３）は上位四分位と呼ばれ、第１の四分位（Ｑ１）は下位四分位と呼ばれる。更に、四分位間（ＩＱＲ）距離は、第３の四分位Ｑ３と第１の四分位Ｑ１との間の距離として計算される。更に、異常値は、（Ｑ１−１．５×ＩＱＲ）未満および／または（Ｑ３＋１．５×ＩＱＲ）超過の範囲のデータ点として識別される。それぞれの四分位を通るデータセット（１０１）におけるデータ点群を図式的に描くのに、箱ひげ図が使用される。図７Ｂに示されるように、箱ひげ図は、上位および下位四分位の外側にあるデータ点のばらつきを示すボックスから垂直に延在する線も有してもよい。ボックス内部の帯は第２の四分位を表す。更に、異常値は、図７Ｂに示されるように、個々の点としてプロットされてもよい。時系列は、時間順に指数化された一連のデータ点である。時系列は、連続する均等な間隔の時間点で得られるシーケンスである。したがって、時系列は離散的な時間データのシーケンスを形成する。時系列解析（６０６）は、時系列データを解析し、データの有意統計値（例えば、自己相関、相互相関、平均など）および他の特性を抽出する方法を含む。例えば、自己回帰和分移動平均（ＡＲＩＭＡ）モデルが、時系列解析を実施するのに使用されてもよい。 In one embodiment, cluster analysis (601) is applied to the numerical data set (104A) and quadrant analysis (604) is applied to the category data set (104B) or spatial data set (104D). Then, the time series analysis (606) is applied to the time data set (104C). In the cluster analysis (601), the data points of the cluster group (602) are set to resemble the other data points (206) of the cluster as compared with the data points of the other cluster group (602). This is an operation of grouping a set of data points from (101) into a cluster group (602). Outliers are identified as data points at distances farther than the threshold distance from the center of the cluster. In one embodiment, the threshold distance may be set to half the intercluster distance. Quartile is a statistical term used to divide a set of data points in a data set (101) into four defined intervals based on the values of the data points. The values of the data points that separate the intervals are called the first, second, and third quartiles. Assuming that the dataset (101) contains "2n" data points, the first quadrant (Q1) is calculated as the median of n minimum inputs in the dataset (101), and the first The 2 quadrant (Q2) is the calculated median of all "2n" inputs in the dataset (101), and the 3rd quadrant (Q3) is n in the dataset (101). Calculated as the median of the maximum inputs. In one embodiment, the third quartile (Q3) is referred to as the upper quartile and the first quartile (Q1) is referred to as the lower quartile. Further, the interquartile (IQR) distance is calculated as the distance between the third quartile Q3 and the first quartile Q1. Further, outliers are identified as data points in the range less than (Q1-1.5 × IQR) and / or more than (Q3 + 1.5 × IQR). A box plot is used to graphically depict a group of data points in a dataset (101) that passes through each quartile. As shown in FIG. 7B, the boxplot may also have a line extending vertically from the box showing the variability of the data points outside the upper and lower quartiles. The band inside the box represents the second quartile. In addition, outliers may be plotted as individual points, as shown in FIG. 7B. A time series is a series of data points indexed in chronological order. A time series is a sequence obtained at consecutive, evenly spaced time points. Therefore, the time series forms a sequence of discrete time data. Time series analysis (606) includes methods of analyzing time series data to extract significant statistics (eg, autocorrelation, cross-correlation, mean, etc.) and other characteristics of the data. For example, an autoregressive integrated moving average (ARIMA) model may be used to perform time series analysis.

更に、複数の変数に存在する異常値は、適用された少なくとも１つのデータ解析方法の結果に基づいて識別される。異常値は、各ばらつきの複数のデータ点から逸脱した少なくとも１つのデータ点を示す。識別された異常値は、異常値の集計されたデータセット（１０１）、および異常値を有さない複数の変数の集計された正常値データセット（２０５）を生成するのに使用される。一実施形態では、図６に示されるように、距離基準は、クラスタ解析（６０１）によって生成された１つ以上のクラスタ群（６０２）に基づいて異常値を識別するのに使用され、四分位間距離は、四分位解析（６０４）の結果に基づいて異常値を識別するのに使用され、予測限界値（６０７）は、時系列解析（６０６）の結果に基づいて異常値を識別するのに使用される。 Further, outliers present in the plurality of variables are identified based on the results of at least one data analysis method applied. The outlier indicates at least one data point that deviates from the plurality of data points of each variation. The identified outliers are used to generate an aggregated dataset of outliers (101) and an aggregated normal dataset (205) of multiple variables that do not have outliers. In one embodiment, as shown in FIG. 6, the distance reference is used to identify outliers based on one or more cluster groups (602) generated by cluster analysis (601) and is a quartile. The interposition distance is used to identify outliers based on the results of quartile analysis (604), and the predicted limit value (607) identifies outliers based on the results of time series analysis (606). Used to do.

更に、図７Ａは、クラスタ解析（６０１）によって生成された１つ以上のクラスタ群（６０２）に基づいて、クラスタの中心に対するデータ点の距離基準を使用して識別された、例示の異常値を示している。図７Ｂは、四分位間距離（６０５）を比較することによって識別された例示の異常値を示している。図７Ｃは、時系列解析に基づいて識別された例示の異常値を示している。 In addition, FIG. 7A shows exemplary outliers identified using the data point distance criteria with respect to the center of the cluster, based on one or more cluster groups (602) generated by cluster analysis (601). Shown. FIG. 7B shows exemplary outliers identified by comparing quartile distances (605). FIG. 7C shows exemplary outliers identified based on time series analysis.

図６に示されるように、識別された異常値は、データ型に対応する集計されたデータセット（１０１）から分離され、異常値データセット（１０１）に格納され、異常値データを除外したデータ型に対応する集計されたデータセット（１０１）のデータは、正常値データセット（２０５）に格納される。１つ以上のデータ型の異常値データセット（１０６）および正常値データセット（２０５）は、データ集計（６０８）によって組み合わされて、図６に示されるように、集計された異常値データセット（１０６）と、異常値を有さない複数の変数の集計された正常値データセット（２０５）とを生成する。 As shown in FIG. 6, the identified outliers are separated from the aggregated data set (101) corresponding to the data type, stored in the outlier data set (101), and the outlier data excluded. The aggregated data set (101) data corresponding to the type is stored in the normal value data set (205). Outlier data sets (106) and normal value data sets (205) of one or more data types are combined by data aggregation (608) and aggregated outlier data sets (as shown in FIG. 6). 106) and an aggregated normal value data set (205) of a plurality of variables having no outliers are generated.

一実施形態では、集計された異常値データセットは、検出された１つ以上の異常に対応する。 In one embodiment, the aggregated outlier data set corresponds to one or more detected anomalies.

ステップ（３０３）で、ランキングシステム（２００）は、複数の変数に対して実施される１つ以上の統計的解析に基づいて、検出された１つ以上の異常を引き起こす複数の変数から、１つ以上の変数（８０５）を識別する。一実施形態では、１つ以上の統計的解析からの少なくとも１つの統計的解析は、集計された異常値データセット（１０６）と、異常値を有さない複数の変数の集計された正常値データセット（２０５）とに適用され、集計された異常値データセット（１０６）、および異常値を有さない複数の変数の集計された正常値データセット（２０５）に対する統計的解析の結果の間の差を計算することによって、１つ以上の変数（８０５）を識別する。１つ以上の統計的解析は、相関解析（８０１）、回帰、および学習アルゴリズムのうち少なくとも１つを含む。 In step (303), the ranking system (200) is one of a plurality of variables that causes one or more anomalies detected, based on one or more statistical analyzes performed on the plurality of variables. The above variables (805) are identified. In one embodiment, at least one statistical analysis from one or more statistical analyzes includes an aggregated outlier data set (106) and aggregated normal value data for a plurality of variables having no outliers. Between the results of statistical analysis on the aggregated outlier data set (106) applied to the set (205) and the aggregated normal data set (205) of multiple variables with no outliers. One or more variables (805) are identified by calculating the difference. One or more statistical analyzes include at least one of correlation analysis (801), regression, and learning algorithms.

図８は、検出された１つ以上の異常を引き起こす複数の変数から１つ以上の変数（８０５）を識別する、例示の相関解析（８０１）を示している。相関解析（８０１）は、２つ以上の変数の間の統計的関係を識別する。異常値データセット（１０６）および目標変数（１０２）のデータ点間の相関が識別され、異常値データ相関（８０２）が生成される。正常値データセット（２０５）および目標変数（１０２）のデータ点間の相関が識別され、正常値データ相関（８０３）が生成される。相関解析（８０１）の結果は、相関係数と呼ばれる相関の数値基準を生成する。相関計数の値は−１〜＋１の範囲であり、＋１は、目標変数（１０２）に対するデータ点の強い正の相関（または一致）を示し、０は、相関（または一致）なしを示し、−１は、強い負の相関（または一致）を示す。図８に示されるように、異常値データ相関（８０２）は、相関係数＋０．９を有するＶ_１、相関係数＋０．２を有するＶ_２、および相関係数−１を有するＶ_３を示し、Ｖ_１、Ｖ_２、およびＶ_３は、異常値データセット（１０６）における複数の変数を示す。更に、正常値データ相関（８０３）は、相関係数＋０．７を有するＶ_１、相関係数＋０．３を有するＶ_２、および相関係数＋１を有するＶ_３を示し、Ｖ_１、Ｖ_２、およびＶ_３は、図８に示されるように、正常値データセット（２０５）における複数の変数を示す。更に、異常値データ相関（８０２）の相関係数と正常値データ相関（８０３）の相関係数との間の絶対相関差（８０４）が計算される。Ｖ_１は、＋０．２（｜＋０．９−０．７｜）の相関差（８０４）を有し、Ｖ_２は、＋０．１（｜＋０．２−０．３｜）の相関差（８０４）を有し、Ｖ_３は、＋２（｜−１−１｜）の相関差（８０４）を有する。最大絶対相関差（８０４）を有する変数Ｖ_３は、複数の変数に対して実施される相関解析（８０１）に基づいて、複数の変数Ｖ_１およびＶ_２と比較して、検出された１つ以上の異常値を引き起こす１つ以上の変数（８０５）として識別される。 FIG. 8 shows an exemplary correlation analysis (801) that identifies one or more variables (805) from a plurality of variables that cause one or more anomalies detected. Correlation analysis (801) identifies statistical relationships between two or more variables. The correlation between the data points of the outlier data set (106) and the target variable (102) is identified and an outlier data correlation (802) is generated. The correlation between the data points of the normal value data set (205) and the target variable (102) is identified and the normal value data correlation (803) is generated. The result of the correlation analysis (801) produces a numerical criterion of correlation called the correlation coefficient. Correlation count values range from -1 to +1 where +1 indicates a strong positive correlation (or match) of the data points to the target variable (102), 0 indicates no correlation (or match), and- 1 indicates a strong negative correlation (or match). As shown in FIG. 8, the outlier data correlation (802) includes V ₁ _{with a correlation coefficient of +0.9, V 2} with a correlation coefficient of +0.2 _{, and V 3} with a correlation coefficient of -1. Shown, V ₁ , V ₂ , and V ₃ represent multiple variables in the outlier data set (106). Furthermore, normal value data correlation (803) shows a _{V 1,} _{V 3} having a _{V 2,} and the correlation coefficient +1 with a correlation coefficient +0.3 with a correlation coefficient _+0.7, V 1, _{V 2} , and V _3, as shown in FIG. 8 shows a plurality of variables in the normal value data set (205). Further, the absolute correlation difference (804) between the correlation coefficient of the outlier data correlation (802) and the correlation coefficient of the normal value data correlation (803) is calculated. V ₁ has a correlation difference (804) of +0.2 (| + 0.9-0.7 |), and V ₂ has a correlation difference (| + 0.2-0.3 |) of +0.1 (| + 0.2-0.3 |). It has 804) and V ₃ has a correlation difference (804) of +2 (| -1-1 |). Variable V ₃ having the maximum absolute correlation difference (804) is based on correlation analysis (801) to be performed for a plurality of variables, as compared to the plurality of variable V ₁ and V _2, one that has been detected It is identified as one or more variables (805) that cause the above outliers.

一実施形態では、複数の変数に対して適合された回帰モデルは、検出された１つ以上の異常を引き起こす１つ以上の変数（８０５）を識別するのに使用されてもよい。回帰モデルは次の数式を使用して適合される。 In one embodiment, a regression model fitted to multiple variables may be used to identify one or more variables (805) that cause one or more detected anomalies. The regression model is fitted using the following formula.

Ｙ＝β_０＋β_１×Ｖ_１＋β_２×Ｖ_２＋…＋β_ｎ×Ｖ_ｎ
式中、Ｖ_１、Ｖ_２、…、Ｖ_ｎは複数の変数を示し、Ｙは目標変数（１０２）を示し、β_０、β_１、…、β_ｎは回帰モデルの重みを示す。回帰モデルは、異常値データセットおよび正常値データセット（２０５）に適合される。回帰モデルの重みの間の差は異常値データセットに適合され、正常値データセット（２０５）は計算され、閾値よりも大きい差を有する１つ以上の変数（８０５）は、複数の変数に対して実施される回帰に基づいて、検出された１つ以上の異常の原因として識別される。 Y = β ₀ + β ₁ × V ₁ + β ₂ × V ₂ +… + β _n × V _n
In the equation, V ₁ , V ₂ , ..., V _n indicate a plurality of variables, Y indicates the target variable (102), and β ₀ , β ₁ , ..., β _n indicates the weights of the regression model. The regression model is fitted to the outlier and normal data sets (205). Differences between the weights of the regression model are fitted to the outlier dataset, the normal dataset (205) is calculated, and one or more variables (805) with a difference greater than the threshold are for multiple variables. Is identified as the cause of one or more detected anomalies based on the regression performed.

一実施形態では、学習アルゴリズムを用いる決定木またはニューラルネットワークを使用して、複数の変数に対して実施される相関解析（８０１）に基づいて、検出された１つ以上の異常を引き起こす１つ以上の変数（８０５）が識別されてもよい。 In one embodiment, a decision tree using a learning algorithm or a neural network is used to cause one or more detected anomalies based on a correlation analysis (801) performed on multiple variables. Variable (805) may be identified.

ステップ（３０４）で、ランキングシステム（２００）は、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきをランク付けする。ばらつきは、異常値の集計されたデータセット（１０１）、および異常値を有さない複数の変数の集計された正常値データセット（２０５）に対する、統計的解析の結果の間の計算された差に基づいてランク付けされる。更に、グラフ表示（１１１）は、ばらつきを表示するため、メタデータ（１０３）を使用して１つ以上のグラフ表示（１１１）から選択される。更に、ばらつきはそれぞれ、割り当てられたランクに基づいて、ユーザに対して表示される。更に、ユーザは、データセット（１０１）における異常の根本原因を検出するため、ランク付けされたばらつきを使用してもよい。 In step (304), the ranking system (200) ranks the variability of the target variable (102) for each of the identified variables (805). The variability is the calculated difference between the results of the statistical analysis for the aggregated dataset of outliers (101) and the aggregated normal dataset of multiple variables without outliers (205). It is ranked based on. Further, the graph display (111) is selected from one or more graph displays (111) using the metadata (103) to display the variability. In addition, each variation is displayed to the user based on the assigned rank. In addition, the user may use the ranked variability to detect the root cause of the anomaly in the dataset (101).

一実施形態では、計算された差に基づいて、ステップ（３０３）で、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきがランク付けされる。図８に示されるように、Ｖ_３はランク１に、次いでＶ_２はランク２に、Ｖ_１はランク３に割り当てられる。更に、１つ以上のグラフ表示（１１１）から、目標変数（１０２）に対するＶ_３、目標変数（１０２）に対するＶ_２、および目標変数（１０２）に対するＶ_１のばらつきに関するグラフ表示（１１１）が、データセット（１０１）のメタデータ（１０３）を使用して選択される。図５の表２に示されるように、Ｘ軸を表す１つ以上の変数のデータ型、およびＹ軸を表す目標変数（１０２）のデータ型に基づいて、適切なチャートまたはグラフ表示（１１１）が選択される。一例として、図５の表２の列１を考えると、カテゴリまたは数値としての対応するデータ型を有する識別された１つ以上の変数（８０５）それぞれに対して、数値としての対応するデータ型を有する目標変数（１０２）のばらつきを表示するためのグラフ表示（１１１）として、棒グラフが選択される。更に、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきに対して選択されたグラフ表示（１１１）が、図１に示されるような割り当てられたランクに基づいて、ユーザに対して表示される。 In one embodiment, the variability of the target variable (102) is ranked for each of the identified one or more variables (805) in step (303) based on the calculated difference. As shown in FIG. 8, V ₃ is assigned to rank 1, then V ₂ is assigned to rank 2, and V ₁ is assigned to rank 3. Further, from one or more graph displays (111), a graph display (111) relating to the variation of _{V 3} with respect to the target variable (102), V ₂ _{with respect to the target variable (102), and V 1} with respect to the target variable (102). Selected using the metadata (103) of the dataset (101). Appropriate chart or graph display (111) based on the data type of one or more variables representing the X-axis and the data type of the target variable (102) representing the Y-axis, as shown in Table 2 of FIG. Is selected. As an example, consider column 1 in Table 2 of FIG. 5, for each of one or more identified variables (805) having a corresponding data type as a category or number, the corresponding data type as a number. A bar graph is selected as the graph display (111) for displaying the variation of the target variable (102) having the target variable (102). Further, the graph display (111) selected for the variability of the target variable (102) for each of the identified variables (805) is based on the assigned rank as shown in FIG. Displayed to the user.

一実施形態では、ユーザは、既存のドリルダウン技術を使用して異常の根本原因を識別するため、表示されたランク付けされたグラフ表示（１１１）に基づいて、１つ以上の検出された異常を選択してもよい。 In one embodiment, the user uses existing drill-down techniques to identify the root cause of the anomaly, based on the displayed ranked graph display (111), one or more detected anomalies. May be selected.

一例として、異常の根本原因を検出するための分散システムのネットワークを考える。分散システムのネットワークは、ネットワークを形成するように相互接続された１つ以上の計算ノードを含む。分散システムのネットワークにおける全ての計算ノードは、例えば、ＣＰＵ利用、フリーメモリの量、ＣＰＵの入出力がブロックされた時間の割合、１秒当たりの読出しブロック数、１秒当たりの書込みブロック数などのログを生成し収集する。１つ以上の異常はログの情報を使用して検出される。１つ以上の異常は、分散システムのネットワークにおける誤り挙動、またはネットワークからの予期しない長い応答時間を含む。これらの異常は、ハードウェアの問題、ネットワーク通信の混雑、または分散システムコンポーネントにおけるソフトウェアバグによって引き起こされることがある。１つ以上の異常の検出は時系列解析を使用して行われてもよい。検出された１つ以上の異常はランク付けされ、ユーザに対して表示される。更に、異常の根本原因は、ランク付けされた１つ以上の異常に基づいてドリルダウン解析を実施することによって、検出されてもよい。 As an example, consider a network of distributed systems for detecting the root cause of anomalies. A network of distributed systems includes one or more compute nodes interconnected to form a network. All compute nodes in a network of distributed systems have, for example, CPU usage, amount of free memory, percentage of time CPU I / O is blocked, number of read blocks per second, number of write blocks per second, etc. Generate and collect logs. One or more anomalies are detected using the information in the log. One or more anomalies include erroneous behavior in the network of distributed systems, or unexpectedly long response times from the network. These anomalies can be caused by hardware problems, network communication congestion, or software bugs in distributed system components. Detection of one or more anomalies may be made using time series analysis. One or more anomalies detected are ranked and displayed to the user. In addition, the root cause of the anomaly may be detected by performing a drill-down analysis based on one or more ranked anomalies.

別の例として、店頭、電子商取引サイト、およびモバイルアプリのような様々なソースからカードベースの取引が実施される、銀行取引または金融サービスドメインを考える。例えば、疑わしい取引、サービス停止、ある場所における急な取引の減少など、１つ以上の異常の検出が、ランキングシステム（１０９）によって実施されてもよい。１つ以上の異常は、銀行から顧客に対するサービスの品質を低下させることがある。そのため、銀行は、取引パターンを継続的に監視して、１つ以上の異常に対する根本原因を検出し識別する必要がある。１つ以上の異常は、例えば、取引ログ、顧客関係データなど、銀行支払ネットワークに記録された情報を使用して検出されてもよい。異常、例えばある場所におけるサービス停止を考える。ランキングシステム（１０９）は、時系列およびカテゴリ解析を使用して、サービス停止を識別してもよい。検出された異常を引き起こす１つ以上の変数は、ランキングシステム（１０９）によって識別されてもよい。１つ以上の変数は、例えば、その場所における膨大なアクセスによる支払ネットワークの不具合を含んでもよい。１つ以上の識別された変数は、ランク付けされ、適切なグラフ表示（１１１）を使用してユーザに対して表示される。更に、異常の根本原因は、ランク付けされた１つ以上の異常に基づいてドリルダウン解析を実施することによって、検出されてもよい。
コンピュータシステム
図９は、本開示と一致する実施形態を実現する、例示のコンピュータシステム（９００）のブロック図を示している。一実施形態では、コンピュータシステム（９００）は、データセットにおける異常の根本原因を検出するための、変数をランク付けする方法を実現するのに使用されてもよい。コンピュータシステム（９００）は、中央処理装置（「ＣＰＵ」または「プロセッサ」）（９０２）を含んでもよい。プロセッサ（９０２）は、実行時間に動的にリソースを割り振るようにプログラムコンポーネントを実行する、少なくとも１つのデータプロセッサを含んでもよい。プロセッサ（９０２）は、統合システム（バス）コントローラ、メモリ（２０２）管理制御装置、浮動小数点装置、グラフィックス処理装置、デジタル信号処理装置などの専用処理装置を含んでもよい。 As another example, consider a banking or financial services domain where card-based transactions are carried out from various sources such as over-the-counter, e-commerce sites, and mobile apps. Detection of one or more anomalies, such as suspicious transactions, outages, or a sudden decrease in transactions at a location, may be performed by the ranking system (109). One or more anomalies can reduce the quality of service from the bank to the customer. As such, banks need to continually monitor transaction patterns to detect and identify root causes for one or more anomalies. One or more anomalies may be detected using information recorded in the bank payment network, such as transaction logs, customer relationship data, and so on. Consider an anomaly, for example a service outage at a location. The ranking system (109) may use time series and categorical analysis to identify outages. One or more variables that cause the detected anomalies may be identified by the ranking system (109). One or more variables may include, for example, payment network glitches due to huge access at the location. One or more identified variables are ranked and displayed to the user using the appropriate graph display (111). In addition, the root cause of the anomaly may be detected by performing a drill-down analysis based on one or more ranked anomalies.
Computer System FIG. 9 shows a block diagram of an exemplary computer system (900) that implements an embodiment consistent with the present disclosure. In one embodiment, a computer system (900) may be used to implement a method of ranking variables to detect the root cause of anomalies in a dataset. The computer system (900) may include a central processing unit (“CPU” or “processor”) (902). The processor (902) may include at least one data processor that executes program components to dynamically allocate resources at run time. The processor (902) may include a dedicated processing device such as an integrated system (bus) controller, a memory (202) management control device, a floating point device, a graphics processing device, and a digital signal processing device.

プロセッサ（９０２）は、入出力インターフェース（９０１）を介して、１つ以上の入出力（Ｉ／Ｏ）デバイス（図示なし）と連通して配設されてもよい。入出力インターフェース（９０１）は、非限定的に、音声、アナログ、デジタル、モノラル、ＲＣＡ、ステレオ、ＩＥＥＥ−（１３９）４、シリアルバス、ユニバーサルシリアルバス（ＵＳＢ）、赤外線、ＰＳ／２、ＢＮＣ、同軸、コンポーネント、複合、デジタルビジュアルインターフェース（ＤＶＩ）、高解像度マルチメディアインターフェース（ＨＤＭＩ）、ＲＦアンテナ、Ｓ−ビデオ、ＶＧＡ、ＩＥＥＥ（８０２）．ｎ／ｂ／ｇ／ｎ／ｘ、ブルートゥース（登録商標）、セルラー（例えば、符号分割多重アクセス（ＣＤＭＡ）、高速パケットアクセス（ＨＳＰＡ＋）、グローバル移動通信システム（ＧＳＭ）、ロングタームエボリューション（ＬＴＥ）、ＷｉＭａｘなど）などの、通信プロトコル／方法を用いてもよい。 The processor (902) may communicate with one or more input / output (I / O) devices (not shown) via an input / output interface (901). Input / output interface (901) is not limited to audio, analog, digital, monaural, RCA, stereo, IEEE- (139) 4, serial bus, universal serial bus (USB), infrared, PS / 2, BNC, Coaxial, component, composite, digital visual interface (DVI), high resolution multimedia interface (HDMI), RF antenna, S-video, VGA, IEEE (802). n / b / g / n / x, Bluetooth®, Cellular (eg Code Division Multiple Access (CDMA), High Speed Packet Access (HSPA +), Global Mobile Communication System (GSM), Long Term Evolution (LTE), Communication protocols / methods such as WiMax) may be used.

入出力インターフェース（９０１）を使用して、コンピュータシステム（９００）は、１つ以上の入出力デバイスと通信してもよい。例えば、入力デバイス（９１０）は、アンテナ、キーボード、マウス、ジョイスティック、（赤外線）リモートコントロール、カメラ、カードリーダ、ファックス機、ドングル、生体認証リーダ、マイクロフォン、タッチ画面、タッチパッド、トラックボール、スタイラス、スキャナ、記憶デバイス、送受信機、ビデオデバイス／ソースなどであってもよい。出力デバイス（９１１）は、プリンタ、ファックス機、ビデオディスプレイ（例えば、陰極管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）、発光ダイオード（ＬＥＤ）、プラズマ、プラズマディスプレイパネル（ＰＤＰ）、有機発光ダイオードディスプレイ（ＯＬＥＤ）など）、音声スピーカーなどであってもよい。 Using the I / O interface (901), the computer system (900) may communicate with one or more I / O devices. For example, the input device (910) can be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, stylus, etc. It may be a scanner, a storage device, a transmitter / receiver, a video device / source, and the like. Output devices (911) include printers, fax machines, video displays (eg, cathode tubes (CRTs), liquid crystal displays (LCDs), light emitting diodes (LEDs), plasmas, plasma display panels (PDPs), organic light emitting diode displays (OLEDs). ) Etc.), may be a voice speaker or the like.

いくつかの実施形態では、コンピュータシステム（９００）は、通信ネットワーク（９０９）を通してサービスオペレータに接続される。プロセッサ（９０２）は、ネットワークインターフェース（９０３）を介して、通信ネットワーク（９０９）と連通して配設されてもよい。ネットワークインターフェース（９０３）は通信ネットワーク（９０９）と通信してもよい。ネットワークインターフェース（９０３）は、非限定的に、直接接続、イーサネット（例えば、ツイストペア１０／１００／１０００ＢａｓｅＴ）、伝送制御プロトコル／インターネットプロトコル（ＴＣＰ／ＩＰ）、トークンリング、ＩＥＥＥ８０２．１１ａ／ｂ／ｇ／ｎ／ｘなどを含む、接続プロトコルを用いてもよい。通信ネットワーク（９０９）は、非限定的に、直接相互接続、ｅコマースネットワーク、ピアツーピア（Ｐ２Ｐ）ネットワーク、ローカルエリアネットワーク（ＬＡＮ）、広域ネットワーク（ＷＡＮ）、ワイヤレスネットワーク（例えば、ワイヤレスアプリケーションプロトコルを使用）、インターネット、Ｗｉ−Ｆｉなどを含んでもよい。ネットワークインターフェース（９０３）および通信ネットワーク（９０９）を使用して、コンピュータシステム（９００）は、１つ以上のサービスオペレータと通信してもよい。 In some embodiments, the computer system (900) is connected to the service operator through a communication network (909). The processor (902) may be arranged in communication with the communication network (909) via the network interface (903). The network interface (903) may communicate with the communication network (909). The network interface (903) is non-limiting, direct connection, Ethernet (eg, twisted pair 10/100/1000 Base T), transmission control protocol / Internet protocol (TCP / IP), Token Ring, IEEE 802.11a / b. Connection protocols may be used, including / g / n / x and the like. Communication networks (909) include, but are not limited to, direct interconnects, e-commerce networks, peer-to-peer (P2P) networks, local area networks (LANs), wide area networks (WANs), wireless networks (eg, using wireless application protocols). , Internet, Wi-Fi, etc. may be included. The computer system (900) may communicate with one or more service operators using the network interface (903) and the communication network (909).

いくつかの実施形態では、プロセッサ（９０２）は、記憶装置インターフェース（９０４）を介して、メモリ（９０５）（例えば、図９には図示されない、ＲＡＭ、ＲＯＭなど）と連通して配設されてもよい。記憶装置インターフェース（９０４）は、シリアルアドバンストテクノロジーアタッチメント（ＳＡＴＡ）、統合ドライブエレクトロニクス（ＩＤＥ）、ＩＥＥＥ−１３９４、ユニバーサルシリアルバス（ＵＳＢ）、ファイバーチャネル、小規模コンピュータシステムインターフェース（ＳＣＳＩ）などの接続プロトコルを用いる、非限定的に、メモリ（２０２）ドライブ、リムーバブルディスクドライブなどを含む、メモリ（９０５）に接続してもよい。メモリ（２０２）ドライブは更に、ドラム、磁気ディスクドライブ、磁気光学ドライブ、光学ドライブ、独立ディスクの冗長型アレイ（ＲＡＩＤ）、固体メモリ（２０２）デバイス、固体ドライブなどを含んでもよい。 In some embodiments, the processor (902) is disposed in communication with a memory (905) (eg, RAM, ROM, etc., not shown in FIG. 9) via a storage interface (904). May be good. The storage interface (904) provides connection protocols such as Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), Fiber Channel, and Small Computer System Interface (SCSI). It may be connected to a memory (905), including, but not limited to, a memory (202) drive, a removable disk drive, and the like. The memory (202) drive may further include a drum, a magnetic disk drive, a magnetic optical drive, an optical drive, a redundant array of independent disks (RAID), a solid-state memory (202) device, a solid-state drive, and the like.

メモリ（９０５）は、非限定的に、ユーザインターフェース（９０６）、オペレーティングシステム（９０７）、ウェブサーバ（９０８）などを含む、一連のプログラムまたはデータベースコンポーネントを格納してもよい。いくつかの実施形態では、コンピュータシステム（９００）は、本開示に記載するような、データ、変数、記録などのユーザ／アプリケーションデータ（９０６）を格納してもよい。かかるデータベースは、ＯｒａｃｌｅまたはＳｙｂａｓｅなど、フォールトトレラントで、関係型で、スケーラブルで、安全なデータベースとして実現されてもよい。 The memory (905) may store a set of program or database components, including, but not limited to, a user interface (906), an operating system (907), a web server (908), and the like. In some embodiments, the computer system (900) may store user / application data (906), such as data, variables, records, as described in the present disclosure. Such a database may be implemented as a fault tolerant, relational, scalable and secure database such as Oracle or Sybase.

オペレーティングシステム（９０７）は、コンピュータシステム（９００）のリソース管理および操作を容易にしてもよい。オペレーティングシステムの例としては、非限定的に、ＡＰＰＬＥ（登録商標）ＭＡＣＩＮＴＯＳＨ（登録商標）ＯＳＸ（登録商標）、ＵＮＩＸ（登録商標）、ＵＮＩＸ系システム分配（例えば、ＢＥＲＫＥＬＥＹＳＯＦＴＷＡＲＥＤＩＳＴＲＩＢＵＴＩＯＮ（登録商標）（ＢＳＤ）、ＦＲＥＥＢＳＤ（登録商標）、ＮＥＴＢＳＤ（登録商標）、ＯＰＥＮＢＳＤ（登録商標）など）、ＬＩＮＵＸ（登録商標）ＤＩＳＴＲＩＢＵＴＩＯＮＳ（例えば、ＲＥＤＨＡＴ（登録商標）、ＵＢＵＮＴＵ（登録商標）、ＫＵＢＵＮＴＵ（登録商標）など）、ＩＢＭ（登録商標）ＯＳ／２（登録商標）、ＭＩＣＲＯＳＯＦＴ（登録商標）ＷＩＮＤＯＷＳ（登録商標）（ＸＰ（登録商標）、ＶＩＳＴＡ（登録商標）７／８、１０など）、ＡＰＰＬＥ（登録商標）ＩＯＳ（登録商標）、ＧＯＯＧＬＥ（商標）ＡＮＤＲＯＩＤ（登録商標）、ＢＬＡＣＫＢＥＲＲＹ（登録商標）ＯＳなどが挙げられる。 The operating system (907) may facilitate resource management and operation of the computer system (900). Examples of operating systems include, but are not limited to, APPLE (registered trademark) MACINTOSH (registered trademark) OS X (registered trademark), UNIX (registered trademark), UNIX-based system distribution (eg, BERKELEY SOFTWARE DISTRIBUTION (registered trademark)). BSD), FREEBSD (registered trademark), NETBSD (registered trademark), OPENBSD (registered trademark), etc.), LINUX (registered trademark) DISTRIBUTIONS (for example, RED HAT (registered trademark), UBUNTU (registered trademark), KUBUNTU (registered trademark) , IBM (registered trademark) OS / 2 (registered trademark), MICROSOFT (registered trademark) WINDOWS (registered trademark) (XP (registered trademark), VISTA (registered trademark) 7/8, 10, etc.), APPLE (registered trademark) ) IOS (registered trademark), GOOGLE (trademark) ANDROID (registered trademark), BLACKBERRY (registered trademark) OS and the like.

いくつかの実施形態では、コンピュータシステム（９００）はウェブブラウザ（９０８）格納プログラムコンポーネントを実現してもよい。ウェブブラウザ（９０８）は、ＭＩＣＲＯＳＯＦＴ（登録商標）ＩＮＴＥＲＮＥＴＥＸＰＬＯＲＥＲ（登録商標）、ＧＯＯＧＬＥ（商標）ＣＲＯＭＥ（商標）、ＭＯＺＩＬＬＡ（登録商標）、ＦＩＲＥＦＯＸ（登録商標））、ＡＰＰＬＥ（登録商標）ＳＡＦＡＲＩ（登録商標）など、ハイパーテキスト表示アプリケーションであってもよい。安全なウェブブラウジングは、セキュアハイパーテキスト転送プロトコル（ＨＴＴＰＳ）、セキュアソケットレイヤ（ＳＳＬ）、トランスポートレイヤセキュリティ（ＴＬＳ）などを使用して提供されてもよい。ウェブブラウザ（９０８）は、ＡＪＡＸ、ＤＨＴＭＬ、ＡＤＯＢＥ（登録商標）ＦＬＡＳＨ（登録商標）、ＪＡＶＡＳＣＲＩＰＴ（登録商標）、ＪＡＶＡ（登録商標）、アプリケーションプログラミングインターフェース（ＡＰＩ）などのファシリティを利用してもよい。いくつかの実施形態では、コンピュータシステム（９００）は、メールサーバ格納プロフラムコンポーネントを実現してもよい。メールサーバは、ＭｉｃｒｏｓｏｆｔＥｘｃｈａｎｇｅなどのインターネットメールサーバであってもよい。メールサーバは、アクティブサーバページ（ＡＳＰ）、ＡＣＴＩＶＥＸ（登録商標）、ＡＮＳＩ（登録商標）、Ｃ＋＋／Ｃ＃、ＭＩＣＲＯＳＯＦＴ（登録商標）、ＮＥＴ、ＣＧＩＳＣＲＩＰＴＳ、ＪＡＶＡ（登録商標）、ＪＡＶＡＳＣＲＩＰＴ（登録商標）、ＰＥＲＬ（登録商標）、ＰＨＰ、ＰＹＴＨＯＮ（登録商標）、ＷＥＢＯＢＪＥＣＴＳ（登録商標）などのファシリティを利用してもよい。メールサーバは、インターネットメッセージアクセスプロトコル（ＩＭＡＰ）、メッセージングアプリケーションプログラミングインターフェース（ＭＡＰＩ）、ＭＩＣＲＯＳＯＦＴ（登録商標）Ｅｘｃｈａｎｇｅ、ポストオフィスプロトコル（ＰＯＰ）、シンプルメールトランスファプロトコル（ＳＭＴＰ）などの通信プロトコルを利用してもよい。いくつかの実施形態では、コンピュータシステム（９００）は、メールクライアント格納プロフラムコンポーネントを実現してもよい。メールクライアントは、ＡＰＰＬＥ（登録商標）ＭＡＩＬ、ＭＩＣＲＯＳＯＦＴ（登録商標）ＥＮＴＯＵＲＡＧＥ（登録商標）、ＭＩＣＲＯＳＯＦＴ（登録商標）ＯＵＴＬＯＯＫ（登録商標）、ＭＯＺＩＬＬＡ（登録商標）ＴＨＵＮＤＥＲＢＩＲＤ（登録商標）などのメール表示アプリケーションであってもよい。 In some embodiments, the computer system (900) may implement a web browser (908) storage program component. The web browser (908) is a MICROSOFT® INTERNET EXPLORER®, GOOGLE ™ CROME ™, MOZILLA®, FIREFOX®), APPLE® SAFARI®. ), Etc., may be a hypertext display application. Secure web browsing may be provided using Secure Hypertext Transfer Protocol (HTTPS), Secure Socket Layer (SSL), Transport Layer Security (TLS), and the like. The web browser (908) may use facilities such as AJAX, DHCP, ADOBE® FLASH®, JAVASCRIPT®, JAVA®, and application programming interface (API). In some embodiments, the computer system (900) may implement a mail server storage program component. The mail server may be an Internet mail server such as Microsoft Exchange. The mail server is Active Server Page (ASP), ACTIVE (registered trademark), ANSI (registered trademark), C ++ / C #, MICROSOFT (registered trademark), NET, CGI SCRIPTS, JAVA (registered trademark), JAVASCRIPT (registered trademark). , PERL®, PHP, PYTHON®, WEBOBJECTS® and other facilities may be used. The mail server may use communication protocols such as Internet Message Access Protocol (IMAP), Messaging Application Programming Interface (MAPI), MICROSOFT® Exchange, Post Office Protocol (POP), and Simple Mail Transfer Protocol (SMTP). good. In some embodiments, the computer system (900) may implement a mail client storage program component. The mail client is a mail display application such as APPLE (registered trademark) MAIL, MICROSOFT (registered trademark) ENTOURAGE (registered trademark), MICROSOFT (registered trademark) OUTLOOK (registered trademark), MOZILLA (registered trademark) THUNDERBIRD (registered trademark). You may.

更に、１つ以上のコンピュータ可読記憶媒体が、本開示と一致する実施形態を実現するのに利用されてもよい。コンピュータ可読記憶媒体は、プロセッサ（２０３）が読取り可能な情報またはデータが格納されてもよい、任意のタイプの物理的メモリ（２０２）を指す。したがって、コンピュータ可読記憶媒体は、本明細書に記載される実施形態と一致するステップまたは段階をプロセッサに実施させる命令を含む、１つ以上のプロセッサが実行する命令を格納してもよい。「コンピュータ可読媒体」という用語は、有形物を含み、搬送波および過渡信号は除外する、即ち持続的なものと理解されるべきである。例としては、ランダムアクセスメモリ（２０２）（ＲＡＭ）、読出し専用メモリ（２０２）（ＲＯＭ）、揮発性メモリ（２０２）、非揮発性メモリ（２０２）、ハードドライブ、コンパクトディスク（ＣＤ）ＲＯＭ、デジタルビデオディスク（ＤＶＤ）、フラッシュドライブ、ディスク、および他の任意の既知の物理的記憶媒体が挙げられる。 Further, one or more computer-readable storage media may be utilized to realize embodiments consistent with the present disclosure. Computer-readable storage medium refers to any type of physical memory (202) in which information or data readable by the processor (203) may be stored. Thus, a computer-readable storage medium may contain instructions executed by one or more processors, including instructions that cause the processor to perform steps or steps consistent with the embodiments described herein. The term "computer-readable medium" should be understood to include tangible objects and exclude carrier and transient signals, i.e. persistent. Examples include random access memory (202) (RAM), read-only memory (202) (ROM), volatile memory (202), non-volatile memory (202), hard drive, compact disc (CD) ROM, digital. Included are video discs (DVDs), flash drives, discs, and any other known physical storage medium.

「一実施形態（ａｎｅｍｂｏｄｉｍｅｎｔ）」、「実施形態（ｅｍｂｏｄｉｍｅｎｔ）」、「複数の実施形態（ｅｍｂｏｄｉｍｅｎｔｓ）」、「実施形態（ｔｈｅｅｍｂｏｄｉｍｅｎｔ）」、「複数の実施形態（ｔｈｅｅｍｂｏｄｉｍｅｎｔｓ）」、「１つ以上の実施形態（ｏｎｅｏｒｍｏｒｅｅｍｂｏｄｉｍｅｎｔｓ）」、「いくつかの実施形態（ｓｏｍｅｅｍｂｏｄｉｍｅｎｔｓ）」、および「１つの実施形態（ｏｎｅｅｍｂｏｄｉｍｅｎｔ）」という用語は、別段の明示的な指定がない限り、「本発明の１つ以上の（ただし全てではない）実施形態」を意味する。 "One embodiment", "Embodied", "Multiple embodiments", "The embodied", "Multiple embodiments", "1" The terms "one or more embodied", "some embodiments", and "one embodied" are used unless otherwise explicitly specified. It means "one or more (but not all) embodiments of the present invention".

「含む」、「備える」、「有する」という用語およびそれらの変形は、別段の明示的な指定がない限り、「〜を含むがそれに限定されない」ことを意味する。 The terms "include", "provide", "have" and their variants mean "including but not limited to" unless otherwise explicitly specified.

列挙される項目の一覧は、別段の明示的な指定がない限り、項目のいずれかまたは全てが相互に排他的であることを示唆しない。「ａ」、「ａｎ」、および「ｔｈｅ」という用語は、別段の明示的な指定がない限り、「１つ以上」を意味する。 The list of listed items does not suggest that any or all of the items are mutually exclusive, unless otherwise explicitly specified. The terms "a," "an," and "the" mean "one or more," unless otherwise explicitly specified.

互いに連通しているいくつかの構成要素を用いた実施形態の説明は、全てのかかる構成要素を要することを示唆しない。反対に、本発明の多種多様な可能性のある実施形態を例示するのに、様々な任意の構成要素が記載される。 The description of the embodiment with several components communicating with each other does not suggest that all such components are required. Conversely, various optional components are described to illustrate a wide variety of possible embodiments of the present invention.

単一のデバイスまたは物品が本明細書に記載される場合、１つを超えるデバイス／物品（それらが協働するか否かにかかわらず）が単一のデバイス／物品の代わりに使用されてもよいことが、容易に明白となるであろう。同様に、１つを超えるデバイスまたは物品（それらが協働するか否かにかかわらず）が本明細書に記載される場合、単一のデバイス／物品が１つを超えるデバイスまたは物品の代わりに使用されてもよいこと、あるいは異なる数のデバイス／物品が図示される数のデバイスまたはプログラムの代わりに使用されてもよいことが、容易に明白となるであろう。あるいは、デバイスの機能性および／または特徴は、かかる機能性／特徴を有するものとして明示的に記載されていない１つもしくは複数の他のデバイスによって具体化されてもよい。したがって、本発明の他の実施形態は必ずしもデバイス自体を含まなくてもよい。 Where a single device or article is described herein, even if more than one device / article (whether they work together or not) is used in place of the single device / article. Good things will easily become apparent. Similarly, where more than one device or article (whether or not they work together) is described herein, a single device / article replaces more than one device or article. It will be readily apparent that they may be used, or that different numbers of devices / articles may be used in place of the number of devices or programs shown. Alternatively, the functionality and / or characteristics of the device may be embodied by one or more other devices that are not explicitly described as having such functionality / characteristics. Therefore, other embodiments of the present invention do not necessarily include the device itself.

図３の図示される動作は、特定の順序で起こる特定のイベントを示している。代替実施形態では、特定の動作は、異なる順序で実施されるか、修正されるか、または除去されてもよい。更に、上述の論理にステップが追加された上で、記載される実施形態に依然として準拠してもよい。更に、本明細書に記載される動作は連続的に起こってもよく、または特定の動作が並行して処理されてもよい。更にまた、動作は、単一の処理装置によって、または分散処理装置によって実施されてもよい。 The illustrated actions in FIG. 3 indicate specific events that occur in a specific order. In alternative embodiments, the particular actions may be performed, modified, or eliminated in a different order. In addition, steps may be added to the logic described above to still comply with the described embodiments. Further, the actions described herein may occur continuously, or certain actions may be processed in parallel. Furthermore, the operation may be performed by a single processor or by a distributed processor.

したがって、データセット（１０１）における異常の根本原因を検出するためのランキング変数は、データセット（１０１）と関連付けられたメタデータ（１０３）を使用して、識別された１つ以上の変数（８０５）それぞれに対する目標変数（１０２）のばらつきを表示する、１つ以上のグラフ表示（１１１）を推奨する。更に、ランク付けされたグラフ表示（１１１）は、データセット（１０１）における検出された１つ以上の異常、および１つ以上の異常を引き起こす１つ以上の変数（８０５）を説明する、関連するグラフ表示（１１１）に基づいて、ドリルダウンの洞察を提供する。 Therefore, the ranking variable for detecting the root cause of anomalies in the dataset (101) is one or more variables (805) identified using the metadata (103) associated with the dataset (101). ) One or more graph displays (111) that display the variability of the target variable (102) for each are recommended. In addition, the ranked graph representation (111) describes and relates one or more detected anomalies in the dataset (101) and one or more variables (805) that cause one or more anomalies. Provides drill-down insights based on the graph display (111).

最後に、本明細書で使用される言語は、基本的に可読性および教育の目的で選択されたのであり、本発明の主題を正確に描写するかまたは制限するために選択されていないことがある。したがって、本発明の範囲は、この詳細な説明によってではなく、適用の基礎となる任意のクレームによって限定されることが意図される。したがって、本発明の実施形態の開示は、以下の特許請求の範囲に記載される本発明の範囲を例示するものであって、限定するものではない。 Finally, the language used herein has been selected primarily for readability and educational purposes and may not have been selected to accurately portray or limit the subject matter of the invention. .. Therefore, the scope of the invention is intended to be limited by any claim on which the application is based, not by this detailed description. Therefore, the disclosure of embodiments of the present invention illustrates, but does not limit, the scope of the invention described in the claims below.

様々な態様および実施形態について本明細書に開示してきたが、他の態様および実施形態が当業者には明白となるであろう。本明細書に開示した様々な態様および実施形態は、例示目的であって限定を意図するものではなく、真の範囲および趣旨は以下の特許請求の範囲によって示される。 Various aspects and embodiments have been disclosed herein, but other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for illustrative purposes only and are not intended to be limiting, the true scope and intent of which is set forth in the claims below.

１０１データセット
１０２目標変数
１０３メタデータ
１０４Ａ数値データセット
１０４Ｂカテゴリデータセット
１０４Ｃ時間データセット
１０４Ｄ空間データセット
１０５異常値検出部
１０６異常値データセット
１０７変数識別部
１０８識別された変数データセット
１０９ランキング部
１１０ランク付けされたばらつき
１１１グラフ表示
２００ランキングシステム
２０１入出力インターフェース
２０２メモリ
２０３プロセッサ
２０４データ
２０５正常値データセット
２０６他のデータ
２０７モジュール
２０８入力モジュール
（２０９）出力モジュール
（２１０）他のモジュール
６０１クラスタ解析
６０２クラスタ群
６０３距離測定値
６０４四分位解析
６０５四分位間距離と比較
６０６時系列解析
６０７予測限界値
６０８データ集計
８０１相関解析
８０２異常値データ相関
８０３正常値データ相関
８０４相関差
８０５１つ以上の変数
９００コンピュータシステム
９０１入出力インターフェース
９０２プロセッサ
９０３ネットワークインターフェース
９０４記憶装置インターフェース
９０５メモリ
９０６ユーザインターフェース
９０７オペレーティングシステム
９０８ウェブサーバ
９０９通信ネットワーク
９１０入力デバイス
９１１出力デバイス
９１２遠隔デバイス 101 data set 102 target variable 103 metadata 104A numerical data set 104B category data set 104C time data set 104D spatial data set 105 outlier detection unit 106 outlier data set 107 variable identification unit 108 identified variable data set 109 ranking unit 110 Ranked variation 111 Graph display 200 Ranking system 201 I / O interface 202 Memory 203 Processor 204 Data 205 Normal value data set 206 Other data 207 Module 208 Input module (209) Output module (210) Other module 601 Cluster analysis 602 Cluster group 603 Distance measurement value 604 Quadrant analysis 605 Compare with quadrant distance 606 Time series analysis 607 Prediction limit value 608 Data aggregation 801 Correlation analysis 802 Abnormal value data correlation 803 Normal value data correlation 804 Correlation difference 805 One or more Variables 900 Computer system 901 Input / output interface 902 Processor 903 Network interface 904 Storage device interface 905 Memory 906 User interface 907 Operating system 908 Web server 909 Communication network 910 Input device 911 Output device 912 Remote device

Claims

ランキングシステム（２００）によって、データセット（１０１）からの複数の変数と、前記データセット（１０１）における目標変数（１０２）とを取得するステップと、
前記ランキングシステム（２００）によって、前記複数の変数に対する前記目標変数のばらつきを識別して、前記ばらつきに存在する異常値に基づいて、前記複数の変数に対する前記目標変数（１０２）のばらつきにおける１つ以上の異常を検出するステップと、
前記ランキングシステム（２００）によって、前記複数の変数に対して実施される１つ以上の統計的解析に基づいて、前記検出された１つ以上の異常を引き起こす前記複数の変数から、１つ以上の変数を識別するステップと、
前記ランキングシステム（２００）によって、前記識別された１つ以上の変数（８０５）それぞれに対する前記目標変数（１０２）のばらつきであって、前記データセット（１０１）における異常の根本原因を検出するためのランク付けに基づいてそれぞれ表示されるばらつきを、ランク付けするステップとを含む、データセット（１０１）における異常の根本原因を検出するための、変数をランク付けする方法。 A step of acquiring a plurality of variables from the data set (101) and a target variable (102) in the data set (101) by the ranking system (200).
The ranking system (200) identifies the variation of the target variable with respect to the plurality of variables, and one of the variations of the target variable (102) with respect to the plurality of variables based on the abnormal value existing in the variation. Steps to detect the above abnormalities and
One or more of the variables that cause the detected one or more anomalies, based on one or more statistical analyzes performed on the variables by the ranking system (200). Steps to identify variables and
A variation of the target variable (102) for each of the identified one or more variables (805) by the ranking system (200) for detecting the root cause of the anomaly in the dataset (101). A method of ranking variables for detecting the root cause of anomalies in a dataset (101), including a step of ranking each displayed variability based on the ranking.

前記複数の変数が、前記複数の変数のメタデータ（１０３）に基づいて、１つ以上のデータ型に集計される、請求項１に記載の方法。 The method of claim 1, wherein the plurality of variables are aggregated into one or more data types based on the metadata (103) of the plurality of variables.

前記１つ以上のデータ型が、数値データ型、カテゴリデータ型、時間データ型、および空間データ型のうち少なくとも１つを含む、請求項２に記載の方法。 The method of claim 2, wherein the one or more data types include at least one of a numeric data type, a category data type, a temporal data type, and a spatial data type.

前記メタデータ（１０３）が、前記複数の変数のデータ型を識別する規則セット、前記識別された１つ以上の変数（８０５）それぞれに対する前記目標変数（１０２）のばらつきを表示するための１つ以上のグラフ表示（１１１）、およびグラフ表示（１１１）がサポートする前記１つ以上のデータ型、ならびに前記複数の変数に対する前記目標変数（１０２）のばらつきを解析する１つ以上のデータ解析方法、および対応するデータ解析方法がサポートする前記１つ以上のデータ型のうち、少なくとも１つを含む、請求項２に記載の方法。 One for the metadata (103) to display a rule set that identifies the data types of the plurality of variables, a variation of the target variable (102) for each of the identified one or more variables (805). The graph display (111), and one or more data analysis methods for analyzing the variation of the target variable (102) with respect to the one or more data types supported by the graph display (111), and the plurality of variables. 2. The method of claim 2, comprising at least one of the one or more data types supported by the corresponding data analysis method.

前記１つ以上のデータ解析方法が、クラスタ解析（６０１）、四分位解析（６０４）、および時系列解析（６０６）のうち少なくとも１つを含む、請求項４に記載の方法。 The method of claim 4, wherein the one or more data analysis methods include at least one of cluster analysis (601), quartile analysis (604), and time series analysis (606).

前記１つ以上の異常を検出するステップが、
各ばらつきに、前記複数の変数のデータ型に基づいて、１つ以上のデータ解析方法から少なくとも１つのデータ解析方法を適用するステップと、
前記適用された少なくとも１つのデータ解析方法の結果に基づいて、前記複数の変数に存在する異常値を識別するステップとを含む、請求項４に記載の方法。 The step of detecting one or more of the above-mentioned abnormalities is
A step of applying at least one data analysis method from one or more data analysis methods to each variation based on the data types of the plurality of variables.
The method of claim 4, comprising the step of identifying outliers present in the plurality of variables based on the results of at least one applied data analysis method.

各ばらつきの前記複数のデータ点から逸脱した少なくとも１つのデータ点を示す前記異常値が、集計された異常値データセット（１０６）と、前記異常値を有さない前記複数の変数の集計された正常値データセット（２０５）とを生成するのに使用される、請求項１に記載の方法。 The outliers indicating at least one data point deviating from the plurality of data points of each variation were aggregated into the aggregated outlier data set (106) and the plurality of variables having no outliers. The method of claim 1, which is used to generate a normal value data set (205).

前記１つ以上の変数（８０５）を識別するステップが、
前記１つ以上の統計的解析から少なくとも１つの統計的解析を、集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の集計された正常値データセット（２０５）に適用するステップと、
前記集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の前記集計された正常値データセット（２０５）に対する、前記統計的解析の結果の間の差を計算することによって、前記１つ以上の変数（８０５）を識別するステップとを含む、請求項１に記載の方法。 The step of identifying one or more variables (805) is
At least one statistical analysis from the one or more statistical analyzes is performed on the aggregated outlier data set (106) and the aggregated normal value dataset (205) of the plurality of variables having no outliers. ) And the steps to apply
Calculate the difference between the results of the statistical analysis for the aggregated outlier data set (106) and the aggregated normal data set (205) for the plurality of variables that do not have the outliers. The method of claim 1, comprising the step of identifying the one or more variables (805) by doing so.

前記１つ以上の統計的解析が、相関解析（８０１）、回帰、および学習アルゴリズムのうち少なくとも１つを含む、請求項１に記載の方法。 The method of claim 1, wherein the one or more statistical analyzes include at least one of correlation analysis (801), regression, and learning algorithms.

前記識別された１つ以上の変数（８０５）それぞれに対する前記目標変数（１０２）の前記ばらつきのランク付けが、集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の集計された正常値データセット（２０５）に対する、統計的解析の結果の間の計算された差に基づいており、前記ばらつきを表示するため、前記１つ以上のグラフ表示（１１１）からグラフ表示（１１１）が選択される、請求項１に記載の方法。 The ranking of the variability of the target variable (102) for each of the identified one or more variables (805) is an aggregated outlier data set (106), and the plurality of which do not have the outlier. Graph from one or more graph displays (111) to display the variability based on the calculated differences between the results of the statistical analysis for the aggregated normal data set (205) of the variables. The method of claim 1, wherein display (111) is selected.

プロセッサ（２０３）と、
前記プロセッサに通信可能に連結され、前記プロセッサ（２０３）の命令を格納するメモリ（２０２）とを備え、該命令が実行されると、前記プロセッサ（２０３）が、
データセット（１０１）からの複数の変数と、前記データセット（１０１）における目標変数（１０２）とを取得し、
前記複数の変数に対する前記目標変数（１０２）のばらつきを識別して、前記ばらつきに存在する異常値に基づいて、前記複数の変数に対する前記目標変数（１０２）のばらつきにおける１つ以上の異常を検出し、
前記複数の変数に対して実施される１つ以上の統計的解析に基づいて、前記検出された１つ以上の異常を引き起こす前記複数の変数から、１つ以上の変数（８０５）を識別し、
前記識別された１つ以上の変数（８０５）それぞれに対する前記目標変数（１０２）のばらつきであって、前記データセット（１０１）における異常の根本原因を検出するためのランク付けに基づいてそれぞれ表示されるばらつきをランク付けする、ランキングシステム（２００）。 Processor (203) and
It is communicably linked to the processor and includes a memory (202) for storing an instruction of the processor (203), and when the instruction is executed, the processor (203) is subjected to.
A plurality of variables from the data set (101) and a target variable (102) in the data set (101) are acquired.
The variation of the target variable (102) with respect to the plurality of variables is identified, and one or more abnormalities in the variation of the target variable (102) with respect to the plurality of variables are detected based on the abnormal values existing in the variation. death,
Based on one or more statistical analyzes performed on the plurality of variables, one or more variables (805) are identified from the plurality of variables that cause the one or more detected anomalies.
The variability of the target variable (102) for each of the identified one or more variables (805), each displayed based on a ranking for detecting the root cause of the anomaly in the dataset (101). A ranking system (200) that ranks variations.

前記プロセッサ（２０３）が、前記複数の変数のメタデータ（１０３）に基づいて、前記取得された複数の変数を１つ以上のデータ型に集計するように構成された、請求項１１に記載のランキングシステム（２００）。 11. The invention of claim 11, wherein the processor (203) is configured to aggregate the acquired plurality of variables into one or more data types based on the metadata (103) of the plurality of variables. Ranking system (200).

前記１つ以上のデータ型が、数値データ型、カテゴリデータ型、時間データ型、および空間データ型のうち少なくとも１つを含む、請求項１２に記載のランキングシステム（２００）。 The ranking system (200) according to claim 12, wherein the one or more data types include at least one of a numerical data type, a category data type, a temporal data type, and a spatial data type.

前記メタデータ（１０３）が、前記複数の変数のデータ型を識別する規則セット、前記識別された１つ以上の変数（８０５）それぞれに対する前記目標変数（１０２）のばらつきを表示するための１つ以上のグラフ表示（１１１）、およびグラフ表示（１１１）がサポートする前記１つ以上のデータ型、ならびに前記複数の変数に対する前記目標変数（１０２）のばらつきを解析する１つ以上のデータ解析方法、および対応するデータ解析方法がサポートする前記１つ以上のデータ型のうち、少なくとも１つを含む、請求項１２に記載のランキングシステム（２００）。 One for the metadata (103) to display a rule set that identifies the data types of the plurality of variables, a variation of the target variable (102) for each of the identified one or more variables (805). The graph display (111), and one or more data analysis methods for analyzing the variation of the target variable (102) with respect to the one or more data types supported by the graph display (111), and the plurality of variables. The ranking system (200) according to claim 12, wherein the ranking system (200) includes at least one of the one or more data types supported by the corresponding data analysis method.

前記１つ以上のデータ解析方法が、クラスタ解析（６０１）、四分位解析（６０４）、および時系列解析（６０６）のうち少なくとも１つを含む、請求項１４に記載のランキングシステム（２００）。 The ranking system (200) according to claim 14, wherein the one or more data analysis methods include at least one of cluster analysis (601), quartile analysis (604), and time series analysis (606). ..

前記プロセッサ（２０３）が、前記１つ以上の異常を検出するように構成され、
各ばらつきに、前記複数の変数のデータ型に基づいて、１つ以上のデータ解析方法から少なくとも１つのデータ解析方法を適用するステップと、
前記適用された少なくとも１つのデータ解析方法の結果に基づいて、前記複数の変数に存在する異常値を識別するステップとを含む、請求項１１に記載のランキングシステム（２００）。 The processor (203) is configured to detect the one or more anomalies.
A step of applying at least one data analysis method from one or more data analysis methods to each variation based on the data types of the plurality of variables.
The ranking system (200) according to claim 11, further comprising identifying outliers present in the plurality of variables based on the results of at least one applied data analysis method.

前記プロセッサ（２０３）が、前記複数の変数における識別された異常値に基づいて、集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の集計された正常値データセット（２０５）を生成するように構成され、更に前記異常値が、各ばらつきの前記複数のデータ点から逸脱した少なくとも１つのデータ点を示す、請求項１１に記載のランキングシステム（２００）。 An outlier data set (106) aggregated by the processor (203) based on identified outliers in the plurality of variables, and aggregated normal values of the plurality of variables having no outliers. The ranking system (200) according to claim 11, wherein the outliers are configured to generate a dataset (205) and further indicate at least one data point where the outlier deviates from the plurality of data points of each variation.

前記プロセッサ（２０３）が、前記１つ以上の変数（８０５）を識別するように構成され、
前記１つ以上の統計的解析から少なくとも１つの統計的解析を、集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の集計された正常値データセット（２０５）に適用するステップと、
前記集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の前記集計された正常値データセット（２０５）に対する、前記統計的解析の結果の間の差を計算することによって、前記１つ以上の変数（８０５）を識別するステップとを含む、請求項１１に記載のランキングシステム（２００）。 The processor (203) is configured to identify the one or more variables (805).
At least one statistical analysis from the one or more statistical analyzes is performed on the aggregated outlier data set (106) and the aggregated normal value dataset (205) of the plurality of variables having no outliers. ) And the steps to apply
Calculate the difference between the results of the statistical analysis for the aggregated outlier data set (106) and the aggregated normal data set (205) for the plurality of variables that do not have the outliers. The ranking system (200) according to claim 11, further comprising the step of identifying the one or more variables (805).

前記１つ以上の統計的解析が、相関解析（８０１）、回帰、および学習アルゴリズムのうち少なくとも１つを含む、請求項１１に記載のランキングシステム（２００）。 The ranking system (200) according to claim 11, wherein the one or more statistical analyzes include at least one of correlation analysis (801), regression, and learning algorithms.

前記プロセッサ（２０３）が、集計された異常値データセット（１０６）、および前記異常値を有さない前記複数の変数の集計された正常値データセット（２０５）に対する、統計的解析の結果の間の計算された差に基づいて、前記識別された１つ以上の変数（８０５）それぞれに対して、前記目標変数（１０２）の前記ばらつきをランク付けするように構成され、前記ばらつきを表示するため、前記１つ以上のグラフ表示（１１１）からグラフ表示（１１１）が選択される、請求項１１に記載のランキングシステム（２００）。 Between the results of statistical analysis by the processor (203) on the aggregated outlier data set (106) and the aggregated normal value dataset (205) of the plurality of variables having no outliers. To display the variability, configured to rank the variability of the target variable (102) for each of the identified one or more variables (805) based on the calculated difference in. The ranking system (200) according to claim 11, wherein the graph display (111) is selected from the one or more graph displays (111).