JP6775452B2

JP6775452B2 - Monitoring system, program and monitoring method

Info

Publication number: JP6775452B2
Application number: JP2017055882A
Authority: JP
Inventors: 朝信丹羽; 雅典宮澤; 林　通秋; 通秋林
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2020-10-28
Anticipated expiration: 2037-03-22
Also published as: JP2018160020A; WO2018173698A1

Description

本発明は、物理計算機上に構成された複数のコンポーネントおよび各コンポーネント間の相関関係を監視する技術に関する。 The present invention relates to a plurality of components configured on a physical computer and a technique for monitoring the correlation between each component.

従来から、クラウドコンピューティングと呼ばれる技術が知られている。この技術は、物理計算機（物理マシンまたは物理サーバ）に仮想化技術を適用することで仮想化基盤（クラウド基盤）を構築し、この仮想化基盤上に仮想計算機（仮想マシンまたは仮想サーバ）を動作させる。そして、この仮想計算機上でアプリケーションを実行することでサービスを提供する。 Conventionally, a technology called cloud computing has been known. This technology builds a virtualization platform (cloud platform) by applying virtualization technology to a physical computer (physical machine or physical server), and operates a virtual computer (virtual machine or virtual server) on this virtualization platform. Let me. Then, the service is provided by executing the application on this virtual computer.

このようなクラウドコンピューティングでは、動的に仮想計算機を作成し、破棄し、移動することができるため、仮想計算機上で実行されるサービスの利用形態に応じて、コンピューティング、ストレージ、ネットワーク等のリソースを、仮想計算機に柔軟に割り当てすることができる。さらに、物理計算機の異常や障害の発生時には、物理計算機上で動作している仮想計算機を、別の健全な物理計算機に移動させることも容易であるため、高い可用性を担保できるという特徴もある。 In such cloud computing, a virtual computer can be dynamically created, destroyed, and moved. Therefore, depending on the usage pattern of services executed on the virtual computer, computing, storage, network, etc. Resources can be flexibly allocated to virtual computers. Furthermore, when an abnormality or failure occurs in a physical computer, it is easy to move the virtual computer running on the physical computer to another sound physical computer, so that high availability can be ensured.

仮想化基盤は、種々の機能が連携することでクラウドコンピューティングサービスを実現する。例えば、仮想基盤操作へのアクセス権限を管理する認証機能、仮想計算機の作成、破棄を管理するコンピュート機能、仮想計算機の起動イメージを管理するイメージ管理機能、仮想計算機にストレージを提供するストレージ機能、仮想計算機にネットワークを提供するネットワーキング機能、仮想化基盤制御システムにおけるウェブインターフェースを提供するダッシュボード機能等である。さらに、このような各機能は、データベース、メッセージキュー、ＨＴＴＰサービス、ＮＴＰサービス等を提供するミドルウェアと相互に連携し、動作する。 The virtualization platform realizes a cloud computing service by linking various functions. For example, an authentication function that manages access authority to virtual infrastructure operations, a compute function that manages the creation and destruction of virtual computers, an image management function that manages the startup image of virtual computers, a storage function that provides storage for virtual computers, and virtual A networking function that provides a network to a computer, a dashboard function that provides a web interface in a virtualization infrastructure control system, and the like. Further, each of such functions operates in cooperation with middleware that provides a database, a message queue, an HTTP service, an NTP service, and the like.

図６は、仮想化基盤の構成の一例を示す図である。図６では、仮想計算機を実行する仮想化基盤を「コンピュートノード」、コンピュートノードをコントロールする仮想化基盤を「コントローラノード」とし、それぞれの機能やミドルウェアが連携する様子を示している。以下、仮想化基盤を構成する各機能と、各機能と連携する各ミドルウェアを総称して、「仮想化基盤の構成要素」、「コンポーネント」と呼ぶ。 FIG. 6 is a diagram showing an example of the configuration of the virtualization infrastructure. In FIG. 6, the virtualization platform that executes the virtual computer is referred to as a “compute node”, the virtualization platform that controls the compute node is referred to as a “controller node”, and the respective functions and middleware are linked. Hereinafter, each function constituting the virtualization infrastructure and each middleware linked with each function are collectively referred to as "components of the virtualization infrastructure" and "components".

安定したクラウドコンピューティングサービスを提供するには、仮想化基盤には高い耐障害性が求められ、特に仮想化基盤の異常や障害を迅速に発見することは、クラウドコンピューティングサービスの品質を向上させる上で重要である。直接的な手段としては、仮想化基盤の異常や障害の発生時に、管理者が各コンポーネントのログを解析し、解析結果に応じて対策が講じられている。 In order to provide a stable cloud computing service, the virtualization infrastructure is required to have high fault tolerance, and in particular, rapid detection of anomalies and failures in the virtualization infrastructure improves the quality of the cloud computing service. Important above. As a direct means, when an abnormality or failure occurs in the virtualization infrastructure, the administrator analyzes the log of each component and takes measures according to the analysis result.

特許文献１および２には、各コンポーネントの異常や障害を検出する技術が開示されている。特許文献１に記載されている技術では、アプリケーションのログを監視し続け、所定のログメッセージの出現頻度が所定回数以上であった場合や、ログ更新が所定時間間隔以上行なわれなかった場合を障害としてみなしている。 Patent Documents 1 and 2 disclose techniques for detecting abnormalities and failures of each component. In the technique described in Patent Document 1, the log of the application is continuously monitored, and a failure occurs when a predetermined log message appears more than a predetermined number of times or when the log is not updated more than a predetermined time interval. It is regarded as.

特許文献２に記載されている技術では、アプリケーションが自発的に発生させたコンテキストスイッチ回数とオペレーションシステムがアプリケーションを制御するために発生させたコンテキストスイッチ回数を監視し、これらコンテキストスイッチ回数の変化度合と、アプリケーションのプロセス状態を関連づけることで、アプリケーションの異常を検出する。 In the technique described in Patent Document 2, the number of context switches spontaneously generated by the application and the number of context switches generated by the operating system to control the application are monitored, and the degree of change in the number of context switches is determined. , Detect application anomalies by associating application process states.

特許文献３および非特許文献１には、仮想化基盤の異常や障害を検出する技術が開示されている。特許文献３に記載されている技術では、仮想化基盤のＣＰＵ使用率やメモリ使用率等の性能情報を収集し、クラスタリングアルゴリズムを用いて正常な状態との乖離を検出することで、仮想化基盤の異常を検出する。 Patent Document 3 and Non-Patent Document 1 disclose a technique for detecting an abnormality or failure of a virtualization platform. In the technique described in Patent Document 3, performance information such as CPU usage rate and memory usage rate of the virtualization board is collected, and a clustering algorithm is used to detect a deviation from the normal state, thereby enabling the virtualization board. Detects anomalies.

非特許文献１に記載されている技術では、仮想化基盤のオープンソース実装である「OpenStack」に焦点を当て、障害を意図的に挿入することで、予めバグや障害要因を特定する。 The technology described in Non-Patent Document 1 focuses on "OpenStack", which is an open source implementation of a virtualization platform, and identifies bugs and failure factors in advance by intentionally inserting failures.

特許第４２３０９４６号明細書Patent No. 4230946 特許第４５６２５６８号明細書Japanese Patent No. 4562568 特開２０１５−０７０５２８号公報Japanese Unexamined Patent Publication No. 2015-070528

Xiaoen Ju et al., On Fault Resilience of OpenStack, SOCC 2013, DOI:10.1145/2523616.2523622Xiaoen Ju et al., On Fault Resilience of OpenStack, SOCC 2013, DOI: 10.1145 / 2523616.2523622

しかしながら、仮想化基盤の異常や障害の発生時に、管理者が各コンポーネントのログを解析し、解析結果に応じて対策を講じる手法では、各コンポーネントが複雑に連携している状況下において、各コンポーネントに対する十分な知見が求められ、一般に、管理者が異常や障害の原因を早期に特定することは困難である。 However, in the method where the administrator analyzes the log of each component and takes measures according to the analysis result when an abnormality or failure occurs in the virtualization infrastructure, each component is complicatedly linked. In general, it is difficult for an administrator to identify the cause of an abnormality or failure at an early stage.

特許文献１に記載されている技術では、管理者が障害時にアプリケーションがどのようなログを出力するかを予め把握するか、アプリケーションが所定のログを出力するようにアプリケーションのソースコードを改修する必要がある。このように、特許文献１では、コンポーネントのログを解析することで障害の検出を試みるが、仮想化基盤の挙動に対して深い知見が要求され、例えば、仮想化基盤のバージョンアップ等ログの仕様が変更される度に監視システムの改修が必要となる。 In the technique described in Patent Document 1, it is necessary for the administrator to know in advance what kind of log the application outputs in the event of a failure, or to modify the source code of the application so that the application outputs a predetermined log. There is. As described above, in Patent Document 1, a failure is detected by analyzing the log of the component, but deep knowledge is required for the behavior of the virtualization platform. For example, log specifications such as version upgrade of the virtualization platform are required. Every time is changed, the monitoring system needs to be modified.

また、特許文献２に記載されている技術では、アプリケーションがＣＰＵを使用し続ける無限ループや、アプリケーションが「Ｉ／Ｏ待ち」や「ＣＰＵ待ち」で停止するといった単純な異常事象に対しては有効ではあるものの、メモリリーク等のコンテキストスイッチが関与しない異常を検出できない。すなわち、検出できる障害が限定的である。 Further, the technique described in Patent Document 2 is effective for an infinite loop in which the application continues to use the CPU or a simple abnormal event such as the application stopping at "I / O wait" or "CPU wait". However, it is not possible to detect anomalies such as memory leaks that do not involve context switches. That is, the obstacles that can be detected are limited.

また、特許文献３に記載されている技術では、物理計算機や仮想計算機の異常を検出することはできるが、コンポーネントの異常や障害そのものを検出するわけではないため、根本原因となるコンポーネントを特定することができず、異常や障害の切り分け、対応には適用することができない。 Further, although the technique described in Patent Document 3 can detect an abnormality of a physical computer or a virtual computer, it does not detect an abnormality or a failure of a component itself, and therefore identifies a component that causes the root cause. It cannot be applied to isolate and respond to abnormalities and failures.

また、非特許文献１に記載されている技術では、ログ解析が必要とされるため、各コンポーネントについて深い知識が要求される。また、障害を挿入するという性質上、稼働中の仮想化基盤には適用できず、障害発生時に即座に障害を検出できない。 Further, since the technique described in Non-Patent Document 1 requires log analysis, deep knowledge of each component is required. In addition, due to the nature of inserting a failure, it cannot be applied to an operating virtualization platform, and a failure cannot be detected immediately when a failure occurs.

このように、従来から種々の技術が提案されてきたが、仮想化基盤は複数のコンポーネントから構成されており、これらコンポーネントが複雑に連携していることから、依然として、異常や障害の早期検出、特定が容易ではない。 In this way, various technologies have been proposed so far, but since the virtualization platform is composed of multiple components and these components are intricately linked, it is still possible to detect anomalies and failures at an early stage. It is not easy to identify.

本発明は、このような事情に鑑みてなされたものであり、管理者が仮想化基盤を構成する各コンポーネントに対して十分な知見を有していない場合においても、仮想化基盤の異常、並びにその根本原因となるコンポーネントを早期に検出することができる監視システム、プログラムおよび監視方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and even when the administrator does not have sufficient knowledge about each component constituting the virtualization infrastructure, the abnormality of the virtualization infrastructure and the abnormality of the virtualization infrastructure, as well as An object of the present invention is to provide a monitoring system, a program, and a monitoring method capable of detecting the component that causes the root cause at an early stage.

（１）上記の目的を達成するために、本発明は、以下のような手段を講じた。すなわち、本発明の監視システムは、物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視システムであって、前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得し、前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成するグラフ生成部と、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出するグラフ解析部と、を備えることを特徴とする。 (1) In order to achieve the above object, the present invention has taken the following measures. That is, the monitoring system of the present invention is a monitoring system that monitors a plurality of components configured on a physical computer and the correlation between the components, and the system resource information of the components and the communication between the components. Obtain resource information, use the value based on the system resource information of each component and the value based on the communication resource information between the components, make each component a node at regular time intervals, and correlate between the components. A graph generator that creates a graph with a relationship as an edge, a specific node, another node whose distance from the specific node is less than or equal to a predetermined value, and an edge that connects the specific node and the other node. On the other hand, a graph analysis unit that applies an abnormality detection algorithm and detects changes in the graph over time is provided.

このように、各コンポーネントのシステム資源情報および各コンポーネント間の通信資源情報を取得し、各コンポーネントのシステム資源情報に基づく値および各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、各コンポーネントをノードとし、各コンポーネント間の相関関係をエッジとしたグラフを作成し、特定のノードおよび特定のノードからの距離が所定値以下である他のノード並びに特定のノードと他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、グラフの時系列的な変化を検出するので、仮想化基盤の管理者が、仮想化基盤を構成する各コンポーネントに対して十分な知見を有していない場合であっても、仮想化基盤やコンポーネントの異常を検出することが可能となる。 In this way, the system resource information of each component and the communication resource information between each component are acquired, and the value based on the system resource information of each component and the value based on the communication resource information between each component are used at regular time intervals. , Create a graph with each component as a node and the correlation between each component as an edge, and with a specific node and other nodes whose distance from a specific node is less than or equal to a predetermined value, as well as a specific node and another node. Since the anomaly detection algorithm is applied to the edges connecting the devices and the time-series changes in the graph are detected, the administrator of the virtualization infrastructure has sufficient knowledge about each component that composes the virtualization infrastructure. Even if this is not the case, it is possible to detect anomalies in the virtualization infrastructure and components.

（２）また、本発明の監視システムにおいて、前記グラフ生成部および前記グラフ解析部は、物理計算機の仮想化基板解析システム上に構成され、前記各コンポーネントは、物理計算機の仮想化基盤上に構成されていることを特徴とする。 (2) Further, in the monitoring system of the present invention, the graph generation unit and the graph analysis unit are configured on the virtualization board analysis system of the physical computer, and each of the components is configured on the virtualization platform of the physical computer. It is characterized by being done.

このように、グラフ生成部およびグラフ解析部は、物理計算機の仮想化基板解析システム上に構成され、各コンポーネントは、物理計算機の仮想化基盤上に構成されているので、仮想化基盤解析システムと仮想化基盤とを物理的に離れた場所で構築することができる。これにより、仮想化基盤解析システムに対して遠隔地に構成された仮想化基盤やコンポーネントの異常を検出することが可能となる。なお、仮想化基盤解析システムと仮想化基盤とを同一の物理計算機上に構築することも可能である。 In this way, the graph generation unit and the graph analysis unit are configured on the virtualization board analysis system of the physical computer, and each component is configured on the virtualization platform of the physical computer. The virtualization infrastructure can be built at a physically separate location. This makes it possible to detect anomalies in the virtualization infrastructure and components configured in remote locations for the virtualization infrastructure analysis system. It is also possible to build the virtualization infrastructure analysis system and the virtualization infrastructure on the same physical computer.

（３）また、本発明の監視システムは、一定の時間間隔で生成された前記グラフ、並びに前記各ノードの属性を示す情報および前記エッジを示す情報を含むマトリクスを保存するグラフ保存部をさらに備えることを特徴とする。 (3) Further, the monitoring system of the present invention further includes a graph storage unit that stores the graph generated at regular time intervals, and a matrix including information indicating the attributes of the nodes and information indicating the edges. It is characterized by that.

このように、一定の時間間隔で生成されたグラフ、並びに各ノードの属性を示す情報およびエッジを示す情報を含むマトリクスを保存するグラフ保存部をさらに備えるので、グラフの時系列的な変動を把握することが可能となる。 In this way, a graph generated at regular time intervals and a graph storage unit that stores a matrix containing information indicating the attributes of each node and information indicating edges are further provided, so that the time-series fluctuation of the graph can be grasped. It becomes possible to do.

（４）また、本発明のプログラムは、物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視装置のプログラムであって、前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得する処理と、前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成する処理と、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出する処理と、の一連の処理をコンピュータに実行させることを特徴とする。 (4) Further, the program of the present invention is a program of a plurality of components configured on a physical computer and a monitoring device for monitoring the correlation between the components, and the system resource information of each of the components and each of the above. Using the process of acquiring communication resource information between components, the value based on the system resource information of each component, and the value based on the communication resource information between each component, each component is set as a node at regular time intervals. The process of creating a graph with the correlation between each component as an edge, the specific node, other nodes whose distance from the specific node is less than or equal to a predetermined value, and the specific node and the other node. It is characterized in that an abnormality detection algorithm is applied to the connected edges, and a computer is made to execute a series of processes of detecting a time-series change in the graph.

（５）また、本発明の監視方法は、物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視方法であって、前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得するステップと、前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成するステップと、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出するステップと、を少なくとも含むことを特徴とする。 (5) Further, the monitoring method of the present invention is a monitoring method for monitoring a plurality of components configured on a physical computer and the correlation between the components, and the system resource information of each component and each component. Using the step of acquiring the communication resource information between the components, the value based on the system resource information of each component, and the value based on the communication resource information between the components, each component is set as a node at a fixed time interval, and the above A step to create a graph with the correlation between each component as an edge, and a connection between a specific node, another node whose distance from the specific node is less than or equal to a predetermined value, and the specific node and the other node. It is characterized by including at least a step of applying an abnormality detection algorithm to an edge to be detected and detecting a time-series change of the graph.

本発明によれば、仮想化基盤の管理者が、仮想化基盤を構成する各コンポーネントに対して十分な知見を有していない場合においても、仮想化基盤を構成するコンポーネントとその相関から、仮想化基盤やコンポーネントの異常を検出することができる。 According to the present invention, even when the administrator of the virtualization infrastructure does not have sufficient knowledge about each component constituting the virtualization infrastructure, the components constituting the virtualization infrastructure and their correlations are used to determine the virtual infrastructure. It is possible to detect abnormalities in virtualization infrastructure and components.

本実施形態に係る仮想化基盤の監視システムの概略構成を示す図である。It is a figure which shows the schematic structure of the monitoring system of the virtualization infrastructure which concerns on this Embodiment. グラフ生成部４が作成したグラフの一例を示す図である。It is a figure which shows an example of the graph created by the graph generation part 4. 時刻ｔ０、ｔ１、ｔ２にグラフが生成され、時々刻々とグラフ構造が変化している様子を示す図である。It is a figure which shows how the graph is generated at time t0, t1 and t2, and the graph structure is changing moment by moment. 特定のノードＣとの隣接距離がＮ＝１であるノードＢ、ノードＤ、ノードＥと、それらを接続するエッジを表す図である。It is a figure which shows the node B, node D, node E which the adjacency distance with a specific node C is N = 1, and the edge which connects them. 時刻ｔ０〜ｔ９の時系列グラフをクラスタリングし、異常を検出した例を示す。An example in which an abnormality is detected by clustering a time series graph at times t0 to t9 is shown. 仮想化基盤の構成の一例を示す図である。It is a figure which shows an example of the configuration of a virtualization infrastructure.

本発明者らは、仮想化基盤が複数のコンポーネントから構成されており、これらのコンポーネントが複雑に連携しているため、異常や障害の早期検出や特定が容易ではないことに着目し、仮想化基盤を構成するコンポーネントとコンポーネントの相関関係をグラフ化し、グラフ構造の時系列変化の異常を検出することによって、仮想化基盤の管理者が、仮想化基盤を構成する各コンポーネントに対して十分な知見を有していない場合においても、仮想化基盤やコンポーネントの異常を把握できることを見出し、本発明に至った。 The present inventors have focused on the fact that the virtualization infrastructure is composed of a plurality of components, and these components are intricately linked, so that it is not easy to detect and identify anomalies and failures at an early stage. By graphing the components that make up the infrastructure and the correlation between the components and detecting abnormalities in the time-series changes in the graph structure, the administrator of the virtualization infrastructure has sufficient knowledge about each component that makes up the virtualization infrastructure. We have found that it is possible to grasp the abnormality of the virtualization infrastructure and the components even when the above is not provided, and have arrived at the present invention.

すなわち、本発明の監視システムは、物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視システムであって、前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得し、前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成するグラフ生成部と、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出するグラフ解析部と、を備えることを特徴とする。 That is, the monitoring system of the present invention is a monitoring system that monitors a plurality of components configured on a physical computer and the correlation between the components, and the system resource information of the components and the communication between the components. Obtain resource information, use the value based on the system resource information of each component and the value based on the communication resource information between the components, make each component a node at regular time intervals, and correlate between the components. A graph generator that creates a graph with a relationship as an edge, a specific node, another node whose distance from the specific node is less than or equal to a predetermined value, and an edge that connects the specific node and the other node. On the other hand, a graph analysis unit that applies an abnormality detection algorithm and detects changes in the graph over time is provided.

これにより、本発明者らは、仮想化基盤の管理者が、仮想化基盤を構成する各コンポーネントに対して十分な知見を有していない場合であっても、仮想化基盤やコンポーネントの異常を検出することを可能とした。以下、本発明の実施形態について、図面を参照しながら具体的に説明する。 As a result, the present inventors can detect abnormalities in the virtualization infrastructure and components even when the administrator of the virtualization infrastructure does not have sufficient knowledge about each component constituting the virtualization infrastructure. It was possible to detect it. Hereinafter, embodiments of the present invention will be specifically described with reference to the drawings.

本実施形態では、仮想化基盤を構成するコンポーネントをノード、コンポーネントの相関をエッジと見立てたグラフを時系列毎に作成する。グラフの構成要素であるノードは、コンポーネントの使用するシステム資源情報（CPU使用時間、メモリ使用量、I/O情報等）、またはシステム資源情報から導出される情報を属性として有する。グラフの構成要素であるエッジは、コンポーネント間で送受信される通信資源情報（トラフィック量、パケット数、ソケットの再起動回数等）、または通信資源情報から導出される情報を属性として有する。そして、ある時間区間における属性から定まるグラフ構造を取得し、グラフ構造の時系列変化を監視し、グラフ構造の異常を検出する。これにより、仮想化基盤システムの異常を検出する。 In this embodiment, a graph is created for each time series in which the components constituting the virtualization infrastructure are regarded as nodes and the correlation of the components is regarded as an edge. The node, which is a component of the graph, has system resource information (CPU usage time, memory usage, I / O information, etc.) used by the component, or information derived from the system resource information as attributes. The edge, which is a component of the graph, has communication resource information (traffic volume, number of packets, number of socket restarts, etc.) transmitted and received between components, or information derived from the communication resource information as attributes. Then, the graph structure determined from the attributes in a certain time interval is acquired, the time-series change of the graph structure is monitored, and the abnormality of the graph structure is detected. As a result, an abnormality in the virtualization infrastructure system is detected.

図１は、本実施形態に係る仮想化基盤の監視システムの概略構成を示す図である。この仮想化基盤の監視システムは、物理計算機上に構成された仮想化基盤解析システム１と、複数の物理計算機１０−１〜１０−ｎ上に構成された複数の仮想化基盤２０−１〜２０−ｎから構成されている。前提として、図１に示す各仮想化基盤２０−１〜２０−ｎにおいて、仮想化基盤を構成する各機能および各機能と連携する各ミドルウェアとしてのコンポーネントが設けられているが、ここでは図示していない。また、図１では、仮想化基盤解析システム１と、複数の物理計算機１０−１〜１０−ｎ上に構成された複数の仮想化基盤２０−１〜２０−ｎを示したが、本発明は、これに限定されるわけではなく、同一の物理計算機上に仮想化基盤解析システム１および複数の仮想化基盤２０−１〜２０−ｎを構成することもできるし、単一の物理計算機上に仮想化基盤解析システム１を構成し、他の単一の物理計算機上に複数の仮想化基盤２０−１〜２０−ｎを構成することも可能である。 FIG. 1 is a diagram showing a schematic configuration of a monitoring system for a virtualization platform according to the present embodiment. The monitoring system of this virtualization platform includes a virtualization platform analysis system 1 configured on a physical computer and a plurality of virtualization platforms 20-1 to 20 configured on a plurality of physical computers 10-1 to 10-n. It is composed of −n. As a premise, each of the virtualization platforms 20-1 to 20-n shown in FIG. 1 is provided with each function constituting the virtualization platform and a component as each middleware that cooperates with each function. Not. Further, FIG. 1 shows the virtualization infrastructure analysis system 1 and a plurality of virtualization infrastructures 20-1 to 20-n configured on a plurality of physical computers 10-1 to 10-n. However, the present invention is not limited to this, and the virtualization infrastructure analysis system 1 and a plurality of virtualization infrastructures 20-1 to 20-n can be configured on the same physical computer, or on a single physical computer. It is also possible to configure the virtualization infrastructure analysis system 1 and configure a plurality of virtualization infrastructures 20-1 to 20-n on another single physical computer.

図１に示す各仮想化基盤２０−１〜２０−ｎにおいて、システム資源情報収集部２２は、各コンポーネントが使用するシステム資源情報２１を一定時間間隔で収集する。ここで、使用するシステム資源情報とは、例えば、ユーザＣＰＵ使用時間、システムＣＰＵ使用時間、メモリ使用量、スワップ量、ページフォールト数、ディスクアクセス数、ディスク書き込み数等である。Ｌｉｎｕｘ（登録商標）では、ｐｒｏｃファイルシステム（/proc配下のファイル）のファイルの参照、あるいはコマンドを実行することで情報を取得可能である。システム資源情報加工部２３は、システム資源情報収集部２２が取得した情報に対して、統計的処理（前回取得した値との差分や平均値からの乖離の算出等）や規格化（パーセンテージ化や正規化等）をする。 In each virtualization platform 20-1 to 20-n shown in FIG. 1, the system resource information collecting unit 22 collects the system resource information 21 used by each component at regular time intervals. Here, the system resource information to be used is, for example, a user CPU usage time, a system CPU usage time, a memory usage amount, a swap amount, a page fault number, a disk access number, a disk write number, and the like. In Linux (registered trademark), information can be obtained by referencing a file in the procf file system (file under / proc) or by executing a command. The system resource information processing unit 23 performs statistical processing (calculation of difference from the previously acquired value, calculation of deviation from the average value, etc.) and normalization (percentage conversion) of the information acquired by the system resource information collection unit 22. Normalize, etc.).

通信資源情報収集部２５は、各コンポーネントが使用する通信資源情報２４を一定時間間隔で収集する。使用する通信資源情報とは、例えば、プロトコル、パケットサイズ、パケット数、使用しているソケットの数等である。Ｌｉｎｕｘ（登録商標）では、パケットキャプチャ情報と、各コンポーネントが使用するソケット情報とを紐付けることで情報を取得可能である。通信資源情報加工部２６は、通信資源情報収集部２５が取得した情報に対して、統計的処理（前回取得した値との差分や平均値からの乖離の算出等）や規格化（パーセンテージ化や正規化等）をする。 The communication resource information collection unit 25 collects the communication resource information 24 used by each component at regular time intervals. The communication resource information to be used is, for example, a protocol, a packet size, the number of packets, the number of sockets used, and the like. In Linux (registered trademark), information can be acquired by associating packet capture information with socket information used by each component. The communication resource information processing unit 26 performs statistical processing (calculation of difference from the previously acquired value, calculation of deviation from the average value, etc.) and standardization (percentage conversion) of the information acquired by the communication resource information collection unit 25. Normalize, etc.).

送信部２７は、加工したシステム資源情報や加工した通信資源情報（以下、「資源情報」と呼称する。）を仮想化基盤解析システム１に送信する。 The transmission unit 27 transmits the processed system resource information and the processed communication resource information (hereinafter, referred to as “resource information”) to the virtualization infrastructure analysis system 1.

一方、仮想化基盤解析システム１において、受信部２は、複数の物理計算機１０−１〜１０−ｎの送信部２７から送信された資源情報を受信し、資源情報保存部３に保存する。グラフ生成部４は、資源情報保存部３内の資源情報をもとに、コンポーネントを「ノード」、コンポーネントの相関を「エッジ」としたグラフを生成し、グラフ保存部５に保存する。ここで、ノードやエッジは、資源情報や資源情報から計算される変換値を有する。 On the other hand, in the virtualization infrastructure analysis system 1, the receiving unit 2 receives the resource information transmitted from the transmitting units 27 of the plurality of physical computers 10-1 to 10-n and stores the resource information in the resource information storage unit 3. The graph generation unit 4 generates a graph in which the component is a "node" and the correlation of the components is an "edge" based on the resource information in the resource information storage unit 3, and stores the graph in the graph storage unit 5. Here, the node and the edge have a resource information and a conversion value calculated from the resource information.

グラフ解析部６は、グラフ生成部４が生成した現時刻のグラフと、グラフ保存部５に保存された過去のグラフとを比較し、グラフ構造の時系列変動を検証する。グラフ構造の時系列変動が正常と異なれば、仮想化基盤に障害が発生したと判定する。グラフ表示部７は、仮想化基盤の管理者にグラフを表示するインターフェースを提供する。解析結果送信部８は、グラフ解析結果を外部監視システムに送信する。 The graph analysis unit 6 compares the graph of the current time generated by the graph generation unit 4 with the past graphs stored in the graph storage unit 5, and verifies the time-series fluctuation of the graph structure. If the time-series fluctuation of the graph structure is different from normal, it is determined that the virtualization platform has failed. The graph display unit 7 provides an interface for displaying a graph to the administrator of the virtualization infrastructure. The analysis result transmission unit 8 transmits the graph analysis result to the external monitoring system.

次に、本実施形態に係るグラフ生成部４について説明する。グラフ生成部４では、コンポーネントをノード、コンポーネントの相関をエッジとしたグラフを生成する。図２は、グラフ生成部４が作成したグラフの一例を示す図である。図２では、２つの物理計算機（ホスト（１）とホスト（２））上で動作するコンポーネントから成るグラフの例を示している。ノードは、ホスト名とコンポーネント名（あるいはコンポーネントを実行するプロセス名）の組を識別子として、各コンポーネントが使用するシステム資源情報（例えば、CPU使用時間、メモリ使用量、ディスクI/O量等）、またはシステム資源情報を元に計算される値を属性に持つ。エッジは、通信をする送信ノードと受信ノードの組を識別子として、各エッジは通信資源情報（例えば、トラフィック量、パケット数、使用ソケット数等）、または通信資源情報を元に計算される値を属性に持つ。グラフ生成部４は、一定の時間間隔でグラフを生成し、生成したグラフをグラフ保存部５に格納する。 Next, the graph generation unit 4 according to the present embodiment will be described. The graph generation unit 4 generates a graph in which the component is a node and the correlation of the component is an edge. FIG. 2 is a diagram showing an example of a graph created by the graph generation unit 4. FIG. 2 shows an example of a graph consisting of components operating on two physical computers (host (1) and host (2)). A node uses a set of a host name and a component name (or a process name that executes a component) as an identifier, and uses system resource information (for example, CPU usage time, memory usage, disk I / O amount, etc.) used by each component. Alternatively, the attribute has a value calculated based on the system resource information. The edge uses the pair of transmitting node and receiving node that communicate as an identifier, and each edge uses the communication resource information (for example, the amount of traffic, the number of packets, the number of sockets used, etc.) or the value calculated based on the communication resource information. Have an attribute. The graph generation unit 4 generates a graph at regular time intervals, and stores the generated graph in the graph storage unit 5.

図３は、時刻ｔ０、ｔ１、ｔ２にグラフが生成され、時々刻々とグラフ構造が変化している様子を示す図である。図３では、各コンポーネントと各コンポーネント間の相関は、マトリクスとしてデータを保持できる。図３の例では、時刻ｔ２ではノード（Ａ）は２０の属性を持ち、ノード（Ａ）からノード（Ｂ）に接続するエッジは９２の属性を持つ。時刻ｔ０ではノード（Ａ）は１８の属性を持ち、ノード（Ａ）からノード（Ｂ）に接続するエッジは８９の属性を持つ。ノードやエッジは、ＣＰＵ使用時間やメモリ使用量等の複数の属性値を持つ。この例では、属性値を簡易的にシステム資源情報や通信資源情報を表す単一の数値で示したが、属性値を各要素に持つベクトル値として保持しても良いし、複数の属性値から計算される変換値として保持しても良い。 FIG. 3 is a diagram showing a state in which graphs are generated at times t0, t1 and t2, and the graph structure changes from moment to moment. In FIG. 3, the correlation between each component and each component can hold data as a matrix. In the example of FIG. 3, at time t2, the node (A) has 20 attributes, and the edge connecting the node (A) to the node (B) has 92 attributes. At time t0, the node (A) has 18 attributes, and the edge connecting the node (A) to the node (B) has 89 attributes. Nodes and edges have multiple attribute values such as CPU usage time and memory usage. In this example, the attribute value is simply shown as a single numerical value representing system resource information and communication resource information, but the attribute value may be held as a vector value for each element, or from multiple attribute values. It may be retained as a calculated conversion value.

次に、本実施形態に係るグラフ解析部６について説明する。グラフの解析については、一般的な手法として、時系列データからノードの相関関係を抽出することで、グラフ全体、あるいは相関性が強いノードで構成された部分グラフに対して、異常検知を適用する手法が考えられる。しかし、本実施形態では、パケットのヘッダを解析するためコンポーネント間の接続関係は明示的であり、さらに、あるコンポーネントが送信する通信は複数のコンポーネントを経由するケースは少ない。本実施形態における異常検知の目的は、どのホストのどのコンポーネントが異常要因となっているかを検出することにあり、ノードの連なりを解析し、ネットワークとしての異常検知を適用することは計算量の観点からも望ましくない。一方で、ノードやエッジ単体での異常検知を実施した場合、異常の根本原因の追求は容易となるものの、コンポーネント間の通信は０（通信は発生していない）が支配的なノード、エッジも多く、属性値の情報量が少ない場合には、特徴量の抽出が困難であり、異常検知の精度が課題となる。 Next, the graph analysis unit 6 according to the present embodiment will be described. For graph analysis, as a general method, by extracting the correlation of nodes from time series data, anomaly detection is applied to the entire graph or a partial graph composed of nodes with strong correlation. A method can be considered. However, in the present embodiment, since the header of the packet is analyzed, the connection relationship between the components is explicit, and the communication transmitted by a certain component rarely goes through a plurality of components. The purpose of anomaly detection in this embodiment is to detect which component of which host is the cause of the anomaly, and it is a computational complexity to analyze a series of nodes and apply anomaly detection as a network. It is also undesirable. On the other hand, when anomaly detection is performed on a node or edge alone, it is easy to find the root cause of the anomaly, but there are also nodes and edges where communication between components is 0 (no communication has occurred). When there are many and the amount of information of the attribute value is small, it is difficult to extract the feature amount, and the accuracy of abnormality detection becomes an issue.

そこで、本実施形態では、各ノードを基準として解析を行なう。すなわち、ノードと、ノードからの隣接距離がＮ以下となるノードと、ノードと隣接距離がＮ以下となるノードとを接続するエッジのデータを基に異常検知を適用する。 Therefore, in the present embodiment, the analysis is performed with each node as a reference. That is, the abnormality detection is applied based on the data of the edge connecting the node, the node whose adjacent distance from the node is N or less, and the node whose adjacent distance is N or less.

図４は、特定のノードＣとの隣接距離がＮ＝１であるノードＢ、ノードＤ、ノードＥと、それらを接続するエッジを表す図である。すなわち、図４では、ノードＣを基準として、ある一定時間内においてノードＣに隣接関係にあるノード群（ノードＢ、ノードＤ、ノードＥ）と関連するエッジを異常検知対象としている。図４の紙面に対して右側のマトリクスにおいては斜線で塗りつぶした数値を対象としている。異常検知には、既存の異常検知アルゴリズムが適用できる。例えば、Ｋ近傍法等のクラスタリングアルゴリズムを適用し、外れ値を検知することで、グラフの異常を検出する。図５は、時刻ｔ０〜ｔ９の時系列グラフをクラスタリングし、異常を検出した例を示す。ここでは、各時系列グラフにおいて、最も近い距離と閾値とを比較し、閾値よりも大きい場合に外れ値と判定した例を示している。このように、各ノードと、前記ノードからの隣接距離がＮ以下となるノードと、ノードと隣接距離がＮ以下となるノードを接続するエッジとに異常検知アルゴリズムを適用することで、コンポーネントの異常を検出できる。 FIG. 4 is a diagram showing nodes B, D, and E having an adjacent distance of N = 1 to a specific node C, and an edge connecting them. That is, in FIG. 4, with reference to the node C, the edge associated with the node group (node B, node D, node E) adjacent to the node C within a certain period of time is targeted for abnormality detection. In the matrix on the right side of the paper surface of FIG. 4, the numerical values filled with diagonal lines are targeted. An existing anomaly detection algorithm can be applied to anomaly detection. For example, an abnormality in the graph is detected by applying a clustering algorithm such as the K-nearest neighbor method and detecting outliers. FIG. 5 shows an example in which an abnormality is detected by clustering time series graphs at times t0 to t9. Here, in each time series graph, an example is shown in which the closest distance is compared with the threshold value, and if it is larger than the threshold value, it is determined as an outlier. In this way, by applying the anomaly detection algorithm to each node, the node whose adjacent distance from the node is N or less, and the edge connecting the node and the node whose adjacent distance is N or less, the component abnormality Can be detected.

以上説明したように、本実施形態によれば、仮想化基盤の管理者が、仮想化基盤を構成する各コンポーネントに対して十分な知見を有していない場合であっても、仮想化基盤を構成するコンポーネントとその相関から、仮想化基盤やコンポーネントの異常を検出することが可能となる。 As described above, according to the present embodiment, even if the administrator of the virtualization platform does not have sufficient knowledge about each component constituting the virtualization platform, the virtualization platform can be used. It is possible to detect anomalies in the virtualization infrastructure and components from the constituent components and their correlations.

１仮想化基盤解析システム
２受信部
３資源情報保存部
４グラフ生成部
５グラフ保存部
６グラフ解析部
７グラフ表示部
８解析結果送信部
１０−１〜１０−ｎ物理計算機
２０−１〜２０−ｎ仮想化基盤
２１コンポーネント毎のシステム資源情報
２２システム資源情報収集部
２３システム資源情報加工部
２４コンポーネント毎の通信資源情報
２５通信資源情報収集部
２６通信資源情報加工部
２７送信部
1 Virtualization infrastructure analysis system 2 Receiving unit 3 Resource information storage unit 4 Graph generation unit 5 Graph storage unit 6 Graph analysis unit 7 Graph display unit 8 Analysis result transmission unit 10-1 to 10-n Physical computer 20-1 to 20- n Virtualization infrastructure 21 System resource information for each component 22 System resource information collection unit 23 System resource information processing unit 24 Communication resource information for each component 25 Communication resource information collection unit 26 Communication resource information processing unit 27 Transmission unit

Claims

１または複数の物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視システムであって、
前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得し、前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成するグラフ生成部と、
一定の時間間隔で生成された前記グラフ、並びに前記コンポーネントのシステム資源情報に基づく値を有する前記各ノードの属性を示す情報および前記コンポーネント間の通信資源情報に基づく値を有する前記エッジを示す情報を含むマトリクスを保存するグラフ保存部と、
前記一定の時間間隔である時間区間における属性から定まるグラフ構造を取得し、グラフ構造の時系列変化を監視し、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出するグラフ解析部と、を備えることを特徴とする監視システム。 A monitoring system that monitors a plurality of components configured on one or a plurality of physical computers and the correlation between the components.
The system resource information of each component and the communication resource information between the components are acquired, and the value based on the system resource information of each component and the value based on the communication resource information between the components are used at regular time intervals. , A graph generator that creates a graph with each component as a node and the correlation between each component as an edge.
The graph generated at a fixed time interval, the information indicating the attribute of each node having a value based on the system resource information of the component, and the information indicating the edge having a value based on the communication resource information between the components. A graph saver that saves the included matrix,
The graph structure determined from the attributes in the time interval at a fixed time interval is acquired, the time-series change of the graph structure is monitored, and the specific node and other nodes whose distance from the specific node is less than or equal to a predetermined value and the like. A monitoring system including a graph analysis unit that applies an abnormality detection algorithm to an edge connecting the specific node and the other node and detects a time-series change in the graph.

前記グラフ生成部および前記グラフ解析部は、物理計算機の仮想化基板解析システム上に構成され、
前記各コンポーネントは、物理計算機の仮想化基盤上に構成されていることを特徴とする請求項１記載の監視システム。 The graph generation unit and the graph analysis unit are configured on a virtual board analysis system of a physical computer.
The monitoring system according to claim 1, wherein each component is configured on a virtualization platform of a physical computer.

１または複数の物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視装置のプログラムであって、
前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得する処理と、
前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成する処理と、
一定の時間間隔で生成された前記グラフ、並びに前記コンポーネントのシステム資源情報に基づく値を有する前記各ノードの属性を示す情報および前記コンポーネント間の通信資源情報に基づく値を有する前記エッジを示す情報を含むマトリクスを保存する処理と、
前記一定の時間間隔である時間区間における属性から定まるグラフ構造を取得し、グラフ構造の時系列変化を監視し、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出する処理と、の一連の処理をコンピュータに実行させることを特徴とするプログラム。 A program of a monitoring device that monitors a plurality of components configured on one or a plurality of physical computers and the correlation between the components.
The process of acquiring the system resource information of each component and the communication resource information between the components, and
A graph using the value based on the system resource information of each component and the value based on the communication resource information between the components, with each component as a node and the correlation between the components as an edge at regular time intervals. And the process of creating
The graph generated at regular time intervals, information indicating the attributes of the nodes having values based on the system resource information of the components, and information indicating the edges having values based on the communication resource information between the components. The process of saving the containing matrix and
The graph structure determined from the attributes in the time interval at a fixed time interval is acquired, the time-series change of the graph structure is monitored, and the specific node and other nodes whose distance from the specific node is less than or equal to a predetermined value and the like. It is characterized in that an abnormality detection algorithm is applied to an edge connecting the specific node and the other node, and a computer is made to execute a series of processes of detecting a time-series change of the graph. Program to be.

１または複数の物理計算機上に構成された複数のコンポーネントおよび前記各コンポーネント間の相関関係を監視する監視方法であって、
前記各コンポーネントのシステム資源情報および前記各コンポーネント間の通信資源情報を取得するステップと、
前記各コンポーネントのシステム資源情報に基づく値および前記各コンポーネント間の通信資源情報に基づく値を用い、一定の時間間隔で、前記各コンポーネントをノードとし、前記各コンポーネント間の相関関係をエッジとしたグラフを作成するステップと、
一定の時間間隔で生成された前記グラフ、並びに前記コンポーネントのシステム資源情報に基づく値を有する前記各ノードの属性を示す情報および前記コンポーネント間の通信資源情報に基づく値を有する前記エッジを示す情報を含むマトリクスを保存するステップと、
前記一定の時間間隔である時間区間における属性から定まるグラフ構造を取得し、グラフ構造の時系列変化を監視し、特定のノードおよび前記特定のノードからの距離が所定値以下である他のノード並びに前記特定のノードと前記他のノードとを接続するエッジに対して異常検知アルゴリズムを適用し、前記グラフの時系列的な変化を検出するステップと、を少なくとも含むことを特徴とする監視方法。 A monitoring method for monitoring a plurality of components configured on one or a plurality of physical computers and the correlation between the components.
Steps to acquire system resource information of each component and communication resource information between the components, and
A graph using the value based on the system resource information of each component and the value based on the communication resource information between the components, with each component as a node and the correlation between the components as an edge at regular time intervals. And the steps to create
The graph generated at regular time intervals, information indicating the attributes of the nodes having values based on the system resource information of the components, and information indicating the edges having values based on the communication resource information between the components. Steps to save the containing matrix and
The graph structure determined from the attributes in the time interval at a fixed time interval is acquired, the time series change of the graph structure is monitored, and the specific node and other nodes whose distance from the specific node is less than or equal to a predetermined value and the like. A monitoring method comprising at least a step of applying an anomaly detection algorithm to an edge connecting the specific node and the other node and detecting a time-series change in the graph.