JP2004228828A

JP2004228828A - Network failure analysis support system

Info

Publication number: JP2004228828A
Application number: JP2003012984A
Authority: JP
Inventors: Yukio Ogawa; 祐紀雄小川; Eiji Ohira; 栄二大平; Satoshi Hasegawa; 聡長谷川; Naoteru Ishii; 直輝石井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-01-22
Filing date: 2003-01-22
Publication date: 2004-08-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method of measuring a network response time and a network transport factor so as to locate a degradation causing spot in a network system failure analysis support method. <P>SOLUTION: In a network of connecting a plurality of clients and a plurality of servers, a response time and/or a transport factor in a path from a branch line to a counter branch line is measured, a spot which causes degradation in the response time/transport factor is automatically obtained by the comparison of a plurality of pieces of path information, operation information is collected from a network instrument located at the degradation causing spot, it is determined on the basis of the operation information whether degradation in the response time/transport factor is induced by a lack of the performance of the network instrument or a line bandwidth, the interface of the network instrument on a degraded path is blocked, and communication is detoured to another path where no degradation is detected. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ネットワークシステム運用管理方法に関し、さらに詳しくは、ネットワークの通信経路における応答時間および／または到達率を監視することによりネットワークシステムの性能監視や障害分析を行う方法に関する。
【０００２】
【従来の技術】
ネットワークにおいて応答時間の劣化などの性能障害が発生した場合、その原因部位を特定するために、ＩＣＭＰ（ＩｎｔｅｒｎｅｔＣｏｎｔｒｏｌＭｅｓｓａｇｅＰｒｏｔｏｃｏｌ）エコーの要求／応答時間（ｐｉｎｇコマンド）を利用しサーバとクライアントやその他ネットワーク機器間でのＩＰパケットの応答時間や到達度を調査する方法が広く用いられている。また、プローブを利用してパケットを解析し応答時間を算出する方法も利用されている。
【０００３】
特許文献１には、ネットワーク上の様様な位置に設置したプローブを利用してパケットを解析し機器間の応答時間を調査することにより遅延の原因部位を分離する方法が開示されている。
【０００４】
また、特許文献２には、クライアントからサーバまでの遅延時間を測定し、遅延時間が閾値を超えた場合にクライアントからサーバに至るまでに経由する各ルータまでの遅延時間を調査することにより遅延の原因部位を分離する方法が開示されている。
【０００５】
【特許文献１】
特開平１１−３４６２３８号公報
【特許文献２】
特開２００２−１５２２０３号公報
【０００６】
【発明が解決しようとする課題】
特許文献１では、遅延の原因部位を分離するために、クライアントとサーバといった二つ機器を接続する経路において、両端だけでなく様様な位置にプローブを設置する必要がある。
【０００７】
また、特許文献２では、遅延の原因部位を分離するために、クライアントとサーバといった二つの機器を接続する通信経路において、両端間で遅延測定を定常的に行い、かつ、遅延発生時に新たに経路上の各機器への遅延測定を行うというように二段階の遅延測定を行う必要がある。
【０００８】
これら従来方法では、ネットワークの定期的な性能測定のためには各通信経路の両端の機器間での応答時間を測定し、応答時間の劣化時における原因部位の分離のためには各通信経路で経由する機器における応答時間を測定するというように目的別に応答時間測定を実施していた。これらの方法は、数千台以上の機器からなる大規模なネットワークシステムにおいて全体をカバーする応答時間の測定を行い、かつ、応答時間の劣化時に原因部位を求めるためには効率的な方法ではない。
【０００９】
【課題を解決するための手段】
本発明では、ネットワークの定期的な性能測定のために、各通信経路の両端の機器間での応答時間を測定しつつ、かつ、その情報を利用することにより応答時間および／または到達率の劣化時における原因部位の分離を効率的に行うことができるネットワーク障害分析支援システムを提供する。また、本発明は、原因部位の分離を行った上で、応答時間および／または到達率の劣化の原因が、機器の性能不足や回線帯域の不足にあるか否かを判断することができるネットワーク障害分析支援システムを提供する。また、本発明は、原因部位の分離を行った上で、通信経路を、原因部位を経由しない経路に迂回させることができるネットワーク障害分析支援システムを提供する。
【００１０】
具体的には、本発明は、ネットワークにおいて支線部の機器に組み込まれた応答時間測定エージェントを用いて支線部から対向の支線部に至る経路におけるＩＰパケットの応答時間および／または到達率を網羅的に測定する応答時間測定手段と、応答時間および／または到達率の劣化を検知する応答時間劣化検知手段と、各経路にて経由するネットワーク機器のＩＰアドレス情報から経路情報を作成する経路情報作成手段と、複数の経路情報の比較により応答時間および／または到達率の劣化原因部位の部分ＩＰアドレスを特定することにより劣化発生の原因部位を求める劣化原因部位絞込み手段を設けている。
【００１１】
また、本発明は、検出された応答時間および／または到達率の劣化の原因部位に位置するネットワーク機器に対して稼動情報を収集するための設定ファイルを作成する稼動情報収集設定ファイル作成手段と、作成した設定ファイルをもとに前記ネットワーク機器から稼動情報を収集する稼動情報収集手段と、収集した稼動情報をもとに機器性能や回線帯域の不足のために劣化が発生したか否かを判断する劣化原因判定手段を設けている。
【００１２】
さらに、本発明は、クライアントからサーバに対し複数の経路が設定されており、かつ、その中のひとつの経路において応答時間および／または到達率の劣化が検知された場合に、当該経路上のネットワーク機器のインターフェースを閉塞させ劣化が検知されていない経路に迂回させる劣化経路迂回設定手段を設けている。
【００１３】
本発明は以上の構成を備えているので、ネットワークの定期的な性能測定のために各通信経路の両端の機器間での応答時間を測定しつつ、かつ、その情報を利用することにより応答時間および／または到達率の劣化時における原因部位の分離を効率的に行うことができる。また、応答時間および／または到達率の劣化の原因が、機器の性能不足や回線帯域の不足にあるか否かを判断することができる。また、通信経路を、原因部位を経由しない経路に迂回させることができる。
【００１４】
【発明の実施の形態】
以下、図を参照して本発明の実施形態を説明する。
【００１５】
図１は、本発明の一実施形態にかかるネットワーク障害分析支援システムの機能構成例である。図１を参照しながら、ネットワーク障害分析支援システムのハードウェア構成および機能構成を説明する。
【００１６】
ネットワーク機器１００、および、ネットワーク機器１０７は、ルータ、ＡＴＭ交換機、スイッチングハブ、インテリジェントハブなどの機器である。ネットワーク機器１００は、通信経路の両端に位置する機器である。ネットワーク機器１０７は通信経路が経由する機器であり、ネットワーク機器１００と同一の機器であってもよい。
【００１７】
応答時間測定処理部１０１は、ネットワーク機器１００内にあり、ＩＰパケットの応答時間および／または到達率測定処理を行い、応答時間／到達率情報１０３を出力する。ＩＰパケットの応答時間および／または到達率の測定は、ＩＣＭＰ（ＩｎｔｅｒｎｅｔＣｏｎｔｒｏｌＭｅｓｓａｇｅＰｒｏｔｏｃｏｌ）エコー要求／応答機能（ｐｉｎｇコマンド）により実装されている。応答時間測定処理部１０１を応答時間測定エージェントとする。
【００１８】
経路情報収集処理部１０２は、ネットワーク機器１００内にあり、ＩＰパケットが宛先アドレスに到達するまでに経由するルータのＩＰアドレスを収集し、片方向経路情報１０４として出力する。宛先アドレスに到達するまでに経由する経路上のＩＰアドレスの収集は、ｔｒａｃｅｒｏｕｔｅコマンドにより実装されている。経路情報収集処理部１０２を経路情報収集エージェントとする。
【００１９】
稼動情報測定処理部１０５は、ネットワーク機器１０７内にあり、稼動情報測定のためのＳＮＭＰ（ＳｉｍｐｌｅＮｅｔｗｏｒｋＭａｎａｇｅｍｅｎｔＰｒｏｔｏｃｏｌ）エージェントであり、ＣＰＵ利用率やトラフィック量、パケット廃棄数などの稼動情報を出力する。
【００２０】
サーバやクライアントについても、応答時間測定エージェント機能およびＳＮＭＰエージェント機能を備えている場合は、ネットワーク機器に含める。
【００２１】
ネットワーク監視装置１１０は、一般的なパーソナルコンピュータにより構成することができる。
【００２２】
応答時間測定処理起動処理部１１１は、ネットワーク監視装置１１０内にあり、ネットワーク機器１００内の応答時間測定処理部１０１を起動して応答時間／到達率情報１０３を測定させ、測定結果を自身に入力する。
【００２３】
応答時間／到達率情報格納処理部１１３は、ネットワーク監視装置１１０内にあり、応答時間測定処理部１０２により測定された応答時間／到達率情報１０３をハードディスクなどの記憶装置に格納、蓄積する。
【００２４】
応答時間表示処理部１１４は、ネットワーク監視装置１１０内にあり、応答時間／到達率情報１１５を、ネットワーク情報表示装置１４０を通じて表示する。
【００２５】
応答時間劣化検知処理部１１２は、ネットワーク監視装置１１０内にあり、応答時間／到達率の監視経路において測定した応答時間／到達率情報１０３が、各経路毎に設定した閾値以上であるか判定する。
【００２６】
経路情報収集処理起動処理部１１６は、ネットワーク監視装置１１０内にあり、ネットワーク機器１００内の経路情報収集処理部１０２を起動して片方向経路情報１０４を収集させ、収集結果を入力する。
【００２７】
経路情報作成処理部１１７は、ネットワーク監視装置１１０内にあり、入力された片方向経路情報１０４から両方向の経路情報１１８を作成する。経路情報１１８の作成方法の詳細については後述する。
【００２８】
劣化原因部位絞込み処理部１１９は、ネットワーク監視装置１１０内にあり、応答時間／到達率の監視経路において閾値以上の応答時間／到達率が観測された場合に、経路情報１１８を利用して応答時間および／または到達率の劣化の原因となるネットワーク部位を自動的に求める。絞り込み方法の詳細については後述する。
【００２９】
稼動情報収集設定ファイル作成処理部１２０は、ネットワーク監視装置１１０内にあり、自動的に求めた劣化原因部位に位置するネットワーク機器１０７を稼動情報収集対象機器とし、収集する稼動情報の種別、収集周期、収集期間を決定して稼動情報収集のための設定ファイルを作成する。
【００３０】
稼動情報収集処理起動処理部１２１は、ネットワーク監視装置１１０内にあり、稼動情報収集処理部１２６を起動する。
【００３１】
稼動情報収集処理部１２６は、ネットワーク監視装置１１０内にあり、稼動情報収集設定ファイル作成処理部１２０が作成した稼動情報収集設定ファイルに従って、ネットワーク機器１０７が稼動情報測定処理部１０５により測定したネットワーク稼動情報１０６を収集し、収集結果を自身に入力する。
【００３２】
稼動情報格納処理部１２７は、ネットワーク監視装置１１０内にあり、稼動情報収集処理部１２６により収集、入力された稼動情報１０６をハードディスクなどの記憶装置に格納、蓄積する。
【００３３】
稼動情報表示処理部１２８は、ネットワーク監視装置１１０内にあり、稼動情報１２９を、ネットワーク情報表示装置１４０を通じて表示する。
【００３４】
劣化原因判定処理部１２２は、ネットワーク監視装置１１０内にあり、稼動情報収集処理部１２６により収集、入力された稼動情報１０６をもとに応答時間および／または到達率の劣化の原因が機器の性能不足や回線帯域の不足にあるか否かを判定する。判定方法の詳細については後述する。
【００３５】
劣化経路迂回設定処理部１２３は、ネットワーク監視装置１１０内にあり、応答時間および／または到達率の劣化原因部位を経由しない経路に通信経路を迂回させる。迂回設定方法の詳細については後述する。
【００３６】
障害分析支援処理表示処理部１２４は、ネットワーク監視装置１１０内にあり、障害分析支援処理情報１２５を、ネットワーク情報表示装置１４０を通じて表示する。障害分析支援処理情報１２５とは、劣化原因部位絞込み処理部１１９が求めた応答時間および／または到達率の劣化の原因となるネットワーク部位の情報、稼動情報収集処理起動処理部１２１が稼動情報収集処理部１２６を起動したという情報、劣化原因判定処理部１２２が判定した応答時間および／または到達率の劣化原因の情報、および、劣化経路迂回設定処理部１２３が通信経路を迂回させたという情報および迂回経路の情報である。
【００３７】
ネットワーク情報表示装置１４０も、ネットワーク監視装置１１０と同じく一般的なパーソナルコンピュータにより構成することができる。
【００３８】
表示処理呼び出し処理部１４１は、ネットワーク情報表示装置１４０内にあり、ネットワーク監視装置１１０内にある応答時間表示処理部１１４を呼び出すことにより応答時間／到達率情報１１５をグラフ等により表示する。また、ネットワーク監視装置１１０にある稼動情報表示処理部１２８を呼び出すことによりネットワーク稼動情報１２９をグラフ等により表示する。さらに、ネットワーク監視装置１１０内にある障害分析支援処理表示処理部１２４を呼び出すことにより障害分析支援処理情報１２５を表示する。
【００３９】
上記各装置の各処理部は、上記各装置内のＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）がプログラムを実行することにより具現化される。プログラムは、予め各装置内の記憶装置に格納されていても良いし、着脱可能な記憶媒体または通信媒体を介して他の装置から導入されても良い。
【００４０】
図２は、本発明の一実施形態において監視対象となるネットワークシステムの論理構成の例である。数千台以上のネットワーク機器からなるような大規模なネットワークシステムでは、ネットワークの拡張性や回線コストの観点から、データセンタ２００と各支店２０４〜２０７の間にネットワークのハブとなる中継拠点２０１、２０２を設置し、回線を集約するトポロジーとすることが多いが、中継拠点を設置せずデータセンタ２００と各支店２０４〜２０７を物理回線あるいは論理回線でメッシュ状に接続するネットワークトポロジーであってもかまわない。また、信頼性の観点から、クライアント２１２〜２１５からサーバ２１０、２１１へ至る通信経路を複数設置することが多い。
【００４１】
サーバ２１０、２１１、クライアント２１２〜２１５、ルータ２２０〜２３７は、図１におけるネットワーク機器１００に相当し、応答時間測定処理部１０１、経路情報収集処理部１０２、稼動情報測定処理部１０５の各処理部を有する。
【００４２】
監視センタ２０３に設置された監視装置２１６は、図１におけるネットワーク監視装置１１０に相当し、応答時間測定処理起動処理部１１１、応答時間／到達率情報格納処理部１１３、応答時間表示処理部１１４、応答時間劣化検知処理部１１２、経路情報収集処理起動処理部１１６、経路情報作成処理部１１７、劣化原因部位絞込み処理部１１９、稼動情報収集設定ファイル作成処理部１２０、稼動情報収集処理起動処理部１２１、稼動情報収集処理部１２６、稼動情報格納処理部１２７、稼動情報表示処理部１２８、劣化原因判定処理部１２２、劣化経路迂回設定処理部１２３、障害分析支援処理表示処理部１２４の各処理部を有する。監視対象数が多いために１台の監視装置２１６で全監視対象機器をカバーできない場合は、複数台の監視装置２１６で分担することも可能である。
【００４３】
次に、図１の機能構成を持つネットワーク監視装置１１０による、図２の構成を持つネットワークシステムを対象とした障害分析支援処理で利用する経路情報の作成処理の例を、図３のフローチャートに従い、図４を用いて説明する。
【００４４】
（ｓｔｅｐ３００）応答時間／到達率測定を実施している各通信経路について、経由するネットワーク機器のＩＰアドレス情報を経路情報として定期的に作成、更新する。更新周期は、ネットワークのトポロジーチェンジの頻度に従うこととし一日に数回というように設定する。経路情報の取得のために各通信経路の両端のネットワーク機器から双方向にｔｒａｃｅｒｏｕｔｅコマンドを実行する。ｔｒａｃｅｒｏｕｔｅコマンドは、経路上で経由する各機器毎に一アドレスを出力する。経由する各機器の入力インターフェースと出力インターフェースの両方のアドレスを取得するために、経路上で双方向にコマンドを実行し出力結果を足し合わせる。以下、図２の経路Ａ（２４０）について経路情報を作成する場合を例として説明する。
【００４５】
（ｓｔｅｐ３０１）ネットワーク監視装置２１６内の経路情報収集処理起動処理部１１６は、定期的に経路Ａ（２４０）の両端に位置するルータ２３０およびルータ２２０にリモートログインして経路情報収集処理部１０２を起動する。ここでは、経路情報収集エージェントとしてルータ２３０およびルータ２２０内の経路情報収集処理部１０２を利用するが、クライアント２１２およびサーバ２１０内の経路情報収集処理部１０２を利用してもよい。
【００４６】
ルータ２３０内の経路情報収集処理部１０２は、サーバ２１０のアドレスｊ１（２７１）をターゲットとしてｔｒａｃｅｒｏｕｔｅコマンドを実行し、経路Ａ（２４０）についてのクラインアント２１２からサーバ２１０への片方向経路情報４００を出力する。ネットワーク監視装置２１６内の経路情報収集処理起動処理部１１６は、片方向経路情報４００を経路情報作成処理部１１７に入力する。
【００４７】
同様に、ルータ２２０内の経路情報収集処理部１０２は、クライアント２１２のアドレスａ１（２５０）をターゲットとしてｔｒａｃｅｒｏｕｔｅコマンドを実行し、経路Ａ（２４０）についてのサーバ２１０からクラインアント２１２への片方向経路情報４０１を出力する。ネットワーク監視装置２１６内の経路情報収集処理起動処理部１１６は、片方向経路情報４０１を経路情報作成処理部１１７に入力する。
【００４８】
なお、経路情報収集に利用するｔａｒｃｅｒｏｕｔｅコマンドは、ルータなどのネットワーク機器には、通常、実装されており、特別なソフトウェアやハードウェアを組み込む必要はない。
【００４９】
（ｓｔｅｐ３０２）ネットワーク監視装置２１６内の経路情報作成処理部１１７は、経路Ａ（２４０）についてのクラインアント２１２からサーバ２１０への片方向経路情報４００とサーバ２１０からクラインアント２１２への片方向経路情報４０１を図４に示すように相互に組み合わせて、経路Ａ（２４０）についての経路情報４０２を作成する。
【００５０】
次に、図１の機能構成を持つネットワーク監視装置１１０による、図２の構成を持つネットワークシステムを対象としたネットワーク性能測定、障害分析支援処理の例を、図５のフローチャートに従い、図６を用いて説明する。
【００５１】
（ｓｔｅｐ５００）監視対象経路として設定している各通信経路について、定期的に応答時間／到達率測定を実施し、通信品質の劣化が検知された場合に、障害分析支援処理を実施する。測定周期は、１０分毎や５分毎というように数分毎に設定する。監視経路は、データセンタにおけるブロードキャストドメインとしてのネットワークセグメントと、各支店におけるブロードキャストドメインとしてのネットワークセグメントをメッシュ状に接続した通信経路とする。図２の例では、各クライアント２１２〜２１５と各サーバ２１０、２１１をそれぞれ接続する通信経路を監視経路とする。監視トラフィック量の回線帯域に占める割合が大きく通常の業務トラフィックの妨げになる恐れがある場合は、応答時間要件がある業務サーバが設置されたネットワークセグメントと代表的な支店のネットワークセグメントを接続する通信経路というように監視経路を選び出すこととする。以下、図２の経路Ａ（２４０）について応答時間／到達率を測定する場合を例として説明する。
【００５２】
（ｓｔｅｐ５０１）ネットワーク監視装置２１６内の応答時間測定処理起動処理部１１１は、定期的に監視経路の支店側（クライアント側）に位置するルータ２３０にリモートログインして応答時間測定処理部１０１を起動する。ここでは、応答時間測定エージェントとしてルータ２３０内の応答時間測定処理部１０１を利用するが、クライアント２１２内の応答時間測定処理部１０２を利用してもよい。また、クライアント側でなくサーバ側に位置するルータ２２０やサーバ２１０にリモートログインし、それぞれの応答時間測定処理部１０２を利用することも可能である。
【００５３】
ルータ２３０内の応答時間測定処理部１０１は、サーバ２１０のアドレスｊ１（２７１）をターゲットとしてｐｉｎｇコマンドを実行し、経路Ａ（２４０）についてのクラインアント２１２からサーバ２１０への往復の応答時間およびサーバ２１０への到達率（パケットロス情報）を出力する。ネットワーク監視装置２１６内の応答時間測定処理起動処理部１１１は、応答時間／到達率情報を応答時間劣化検知処理部１１２に入力する。
【００５４】
なお、応答時間／到達率測定に利用するｐｉｎｇコマンドは、ルータなどのネットワーク機器には、通常、実装されており、特別なソフトウェアやハードウェアを組み込む必要はない。
【００５５】
（ｓｔｅｐ５０２）ネットワーク監視装置２１６内の応答時間劣化検知処理部１１２は、各監視経路における応答時間およびＩＰパケット到達率が、それぞれの監視経路に対して設定した閾値を設定した一定期間以上超えているかどうか判定する。閾値の設定基準は、以下の通りである。
【００５６】
・ネットワークの各経路における応答時間／到達率の設計値
・過去の測定結果における同一の時間帯の平均値に同時間帯の標準偏差値をｎ倍して加えた値
・過去の測定結果における同一の曜日、時間帯の平均値に同時間帯の標準偏差値をｎ倍して加えた値
・過去の測定結果における同一の週、曜日、時間帯の平均値に同時間帯の標準偏差値をｎ倍して加えた値
・過去の測定結果における同一の日付、時間帯の平均値に同時間帯の標準偏差値をｎ倍して加えた値
ここでｎは２から３ぐらいの値とし、過去の観測結果よりより適切な値を定める。監視経路において、観測された応答時間および／または到達率が閾値を超えている場合は、通信品質の劣化が発生していると判断する。
【００５７】
（ｓｔｅｐ５０３）ネットワーク監視装置２１６内の劣化原因部位絞込み処理部１１９は、少なくとも一つの監視経路において応答時間／到達率がそれぞれの経路に対して設定された閾値を超えている場合には、通信品質劣化の原因部位を自動的に求める。以下、ルータ２２６のインターフェース（ＩＰアドレスｄ１）２６２が原因で、通信品質の劣化が起こった場合を例にとり説明する。この場合、監視経路Ａ（２４０）、監視経路Ｅ（２４４）、監視経路Ｂ（２４１）において品質劣化が検知されている。図６を利用して絞り込み方法を説明する。
【００５８】
（ｓｔｅｐ６−１）：劣化が検知された監視経路についての経路情報（経路Ａ（２４０）の経路情報（６００）、経路Ｅ（２４４）の経路情報（６０１）、経路Ｂ（２４１）の経路情報（６０２））、および、それらと一部でも重なっている正常状態の経路についての経路情報（経路Ｃ（２４２）の経路情報（６０３）、経路Ｄ（２４３）の経路情報（６０４））を全て検索する。
【００５９】
（ｓｔｅｐ６−２）：劣化が検知された経路の経路情報（経路Ａ（２４０）の経路情報（６００）、経路Ｅ（２４４）の経路情報（６０１）、経路Ｂ（２４１）の経路情報（６０２））の積集合（共通部分）６０５を検索する。
【００６０】
（ｓｔｅｐ６−３）：（ｓｔｅｐ６−２）で得られた集合６０５と、各正常経路の経路情報（経路Ｃ（２４２）の経路情報（６０３）、経路Ｄ（２４３）の経路情報（６０４））の積集合６０６、６０６を検索する。
【００６１】
（ｓｔｅｐ６−４）：（ｓｔｅｐ６−３）で得られたそれぞれの集合の和集合６０８を検索する。
【００６２】
（ｓｔｅｐ６−５）：（ｓｔｅｐ６−２）で得られた集合６０５と（ｓｔｅｐ６−４）で得られた集合６０８の差集合（（ｓｔｅｐ６−４）で得られた集合６０８の（ｓｔｅｐ６−２）で得られた集合６０５に対する補集合）を検索する。図６の場合、ルータ２２６のインターフェース（ＩＰアドレスｄ１）２６２が品質劣化の原因部位である判断する。
【００６３】
最終的な集合６０８を算出する過程での演算方法は、集合演算の法則に従って入れ替えても差し支えない。
【００６４】
監視経路が少ない場合は、絞り込み結果がより広範囲になる。例えば、図６の例で経路Ｂ（６０２）を監視していない場合、最終的に求まる原因部位はｃ１およびｄ１となる。ただし、監視経路数に応じて絞込み範囲は変化するが、アルゴリズムは監視経路数に関係なく適用可能である。
【００６５】
（ｓｔｅｐ５０４）ネットワーク監視装置２１６内の稼動情報収集設定ファイル作成処理部１２０は、（ｓｔｅｐ５０３）にて絞り込まれたアドレスを持つネットワーク機器に対して稼動情報収集を行うために、収集情報項目、収集周期、収集期間を決定し、稼動情報収集処理部１２６の設定ファイルを作成する。ネットワーク稼動情報の収集情報項目は、ルータやレイヤー３スイッチなどのネットワーク機器に対しては、ＣＰＵ利用率、空きメモリ量などとする。ネットワーク機器のインターフェースに対しては、入出力トラフィック量、入出力パケット数、入出力パケット廃棄数、入出力エラーバケット数、コリジョン数などとする。ネットワーク稼動情報の収集周期は、１分や３０秒というように予め設定した値を利用するか、通常の定期的な稼動情報収集の周期の１０分の１というように設定する。ネットワーク稼動情報の収集周期は、１時間や３時間というように予め設定した値を利用するか、応答時間／到達率が閾値を超えていた監視経路において、その後の応答時間／到達率測定結果が閾値以下になるまでとする。
【００６６】
（ｓｔｅｐ５０５）ネットワーク監視装置２１６内の稼動情報収集処理起動処理部１２１は、稼動情報収集処理部１２６を起動する。稼動情報収集処理部１１５は、（ｓｔｅｐ５０４）にて作成した設定ファイルに従い、劣化原因部位に位置するネットワーク機器から稼動情報を収集し、稼動情報を劣化原因判定処理部１２２に入力する。
【００６７】
（ｓｔｅｐ５０６）ネットワーク監視装置２１６内の劣化原因判定処理部１２２は、入力された稼動情報をもとに通信品質の劣化原因を推定する。稼動情報がそれぞれに対して設定された閾値を超えている状態が持続している場合、例えば、
・ＣＰＵ利用率が閾値を超えている状態が持続している。
【００６８】
・回線利用率が閾値を超えている状態が持続している。
【００６９】
・パケット廃棄量が閾値を超えている状態が持続している。
【００７０】
・コリジョン数が閾値を超えている状態が持続している。
といった場合は、それらの状態を示している当該ネットワーク機器や回線の性能不足に起因して通信品質の劣化が発生したと判断する。閾値の決定方法は、（ｓｔｅｐ５０２）での応答時間／到達率での閾値決定方法と同じとする。劣化原因部位のネットワーク機器の稼動情報が閾値以下であり、稼動状態が正常であると判断された場合は、劣化原因部位のネットワーク機器のソフトウェアやハードウェアの不具合に起因している、或いは、劣化原因部位のネットワーク機器に隣接しているＡＴＭ交換機やスイッチングハブ等の経路情報としてのＩＰアドレスを持っていない機器の性能不足や不具合に起因していると判断する。
【００７１】
（ｓｔｅｐ５０７）ネットワーク監視装置２１６内の劣化経路迂回設定理部１２３は、監視経路において応答時間およびＩＰパケット到達率が閾値を超えている状態が一定期間以上持続しており、かつ、劣化が検知されている経路に対する迂回経路では劣化が検知されていない場合、劣化が検知されている経路上のネットワーク機器のインターフェースを閉塞させ、ダイナミックルーティングプロトコルの作用により劣化経路から正常経路へと経路を迂回させる。図２を用いてこの動作を説明する。クライアント２１２からサーバ２１０への経路Ａ（２４０）で劣化が検知され続けており、かつ、その迂回経路Ｅ（２４４）は正常状態である場合、ネットワーク監視装置２１６内の劣化経路迂回設定理部１２３はルータ２３０にリモートログインし、ルータ２３０のインターフェース（ＩＰアドレスＣ１）２５８を閉塞させることにより、経路Ａ（２４０）から、経路Ｅ（２４４）への迂回を実行する。閉塞させるインターフェースは、劣化経路上に在り、かつ、自身の閉塞により正常経路への迂回を導くことが可能であれば、いずれのインターフェースでもよい。
【００７２】
本実施例は、以上の構成を備え、以上のｓｔｅｐをネットワーク監視装置において実施することにより、ネットワークの定期的な性能測定のために各通信経路の両端の機器間での応答時間を測定しつつ、かつ、その情報を利用することにより応答時間および／または到達率の劣化時における原因部位の分離を効率的に行うことが可能である。また、応答時間および／または到達率の劣化の原因が、機器の性能不足や回線帯域の不足にあるか否かを判断することが可能である。また、通信経路を、原因部位を経由しない経路に迂回させることが可能である。
【００７３】
【発明の効果】
本発明によれば、ネットワークシステムにおいて、応答時間／到達率の測定や劣化原因部位の絞込み、劣化原因の推定、劣化経路の回避を効率的に行うことが可能になる。
【図面の簡単な説明】
【図１】本実施形態のシステム構成図である。
【図２】本実施形態のネットワーク論理構成図および応答時間監視経路の例である。
【図３】本実施形態の経路情報作成処理の流れである。
【図４】本実施形態の経路情報の作成方法である。
【図５】本実施形態の障害分析支援処理の流れである。
【図６】本実施形態の劣化原因部位の絞込み方法である。
【符号の説明】
１００……ネットワーク機器、１０１……応答時間測定処理部、１０２……経路情報収集処理部、１０３……応答時間／到達率情報、１０４……片方向経路情報、１０５……稼動情報測定処理部、１０６……稼動情報、１０７……ネットワーク機器、１１０……ネットワーク監視装置、１１１……応答時間測定処理起動処理部、１１２……応答時間劣化検知処理部、１１３……応答時間／到達率情報格納処理部、１１４……応答時間表示処理部、１１５……応答時間／到達率情報、１１６……経路情報収集処理起動処理部、１１７……経路情報作成処理部、１１８……経路情報、１１９……劣化原因部位絞込み処理部、１２０……稼動情報収集設定ファイル作成処理部、１２１……稼動情報収集処理起動処理部、１２２……劣化原因判定処理部、１２３……劣化経路迂回設定処理部、１２４……障害分析支援処理表示処理部、１２５……障害分析支援処理情報、１２６……稼動情報収集処理部、１２７……稼動情報格納処理部、１２８……稼動情報表示処理部、１２９……稼動情報、１４０……ネットワーク情報表示装置、１４１……表示処理呼び出し処理部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a network system operation management method, and more particularly, to a method of monitoring performance of a network system and analyzing a failure by monitoring response time and / or arrival rate in a communication path of a network.
[0002]
[Prior art]
When a performance failure such as deterioration of response time occurs in a network, a server / client or other network is used by using a request / response time (ping command) of an ICMP (Internet Control Message Protocol) echo in order to specify a cause of the performance failure. A method of examining the response time and reach of an IP packet between devices has been widely used. A method of analyzing a packet using a probe and calculating a response time is also used.
[0003]
Patent Literature 1 discloses a method of analyzing a packet using a probe installed at a position on a network and investigating a response time between devices to isolate a cause of delay.
[0004]
Japanese Patent Application Laid-Open No. H11-163,199 discloses measuring the delay time from a client to a server and, when the delay time exceeds a threshold value, investigating the delay time from a client to a server to each router. A method for isolating a causative site is disclosed.
[0005]
[Patent Document 1]
JP-A-11-346238
[Patent Document 2]
JP-A-2002-152203
[0006]
[Problems to be solved by the invention]
In Patent Literature 1, it is necessary to install a probe not only at both ends but also at various positions in a path connecting two devices such as a client and a server in order to isolate a part causing a delay.
[0007]
Further, in Patent Document 2, in order to separate the cause of the delay, in a communication path connecting two devices such as a client and a server, delay measurement is constantly performed between both ends, and a new path is generated when a delay occurs. It is necessary to perform two-stage delay measurement, such as performing delay measurement for each of the above devices.
[0008]
In these conventional methods, the response time between the devices at both ends of each communication path is measured for periodic performance measurement of the network, and each communication path is used for separation of the cause part when the response time deteriorates. The response time was measured for each purpose, such as measuring the response time of a device passing through. These methods are not efficient methods for measuring the response time that covers the entirety of a large-scale network system composed of thousands of devices or more, and for finding a cause part when the response time deteriorates. .
[0009]
[Means for Solving the Problems]
According to the present invention, for periodic performance measurement of a network, the response time between devices at both ends of each communication path is measured, and the response time and / or the arrival rate are degraded by using the information. Provided is a network failure analysis support system that can efficiently separate a cause part at the time. Further, according to the present invention, it is possible to determine whether or not the cause of the deterioration of the response time and / or the arrival rate is due to insufficient performance of the device or insufficient line bandwidth after separating the cause part. Provide a failure analysis support system. In addition, the present invention provides a network failure analysis support system capable of separating a cause part and then diverting a communication path to a path not passing through the cause part.
[0010]
More specifically, the present invention provides a comprehensive response time and / or arrival rate of IP packets in a path from a branch to an opposite branch using a response time measurement agent incorporated in a branch device in a network. Response time measuring means for measuring the response time and / or response time deterioration detecting means for detecting the deterioration of the response rate and route information creating means for creating the route information from the IP address information of the network device passing through each route And a deterioration cause part narrowing-down means for finding a cause part of the deterioration occurrence by specifying a partial IP address of the deterioration cause part of the response time and / or the arrival rate by comparing a plurality of pieces of route information.
[0011]
The present invention also provides an operation information collection setting file creating means for creating a setting file for collecting operation information for a network device located at a site where the detected response time and / or arrival rate deteriorates. An operation information collection unit that collects operation information from the network device based on the created setting file; and determines whether deterioration has occurred due to a shortage of device performance or a line band based on the collected operation information. Deterioration determination means is provided.
[0012]
Furthermore, the present invention provides a method for controlling a network on a path when a plurality of paths are set from a client to a server, and when a deterioration in response time and / or arrival rate is detected in one of the paths. There is provided a degraded route detour setting unit that closes the interface of the device and detours to a route where deterioration has not been detected.
[0013]
Since the present invention has the above configuration, the response time between the devices at both ends of each communication path is measured for the periodic performance measurement of the network, and the response time is measured by using the information. And / or the cause site can be efficiently separated at the time of deterioration of the arrival rate. In addition, it is possible to determine whether the deterioration of the response time and / or the arrival rate is due to a lack of performance of the device or a lack of the line band. Further, the communication route can be bypassed to a route that does not pass through the cause part.
[0014]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0015]
FIG. 1 is a functional configuration example of a network failure analysis support system according to an embodiment of the present invention. The hardware configuration and functional configuration of the network failure analysis support system will be described with reference to FIG.
[0016]
The network device 100 and the network device 107 are devices such as a router, an ATM switch, a switching hub, and an intelligent hub. The network devices 100 are devices located at both ends of a communication path. The network device 107 is a device via a communication path, and may be the same device as the network device 100.
[0017]
The response time measurement processing unit 101 is located in the network device 100, performs a response time and / or arrival rate measurement process of an IP packet, and outputs response time / arrival rate information 103. The measurement of the response time and / or the arrival rate of the IP packet is implemented by an ICMP (Internet Control Message Protocol) echo request / response function (ping command). The response time measurement processing unit 101 is a response time measurement agent.
[0018]
The route information collection processing unit 102 collects the IP address of the router that is located in the network device 100 and passes through before the IP packet reaches the destination address, and outputs it as one-way route information 104. The collection of IP addresses on the route through which the destination address is reached is implemented by a traceroute command. The path information collection processing unit 102 is a path information collection agent.
[0019]
The operation information measurement processing unit 105 is an SNMP (Simple Network Management Protocol) agent for measuring operation information, and outputs operation information such as a CPU utilization rate, a traffic amount, and a packet discard number.
[0020]
If the server and the client also have the response time measurement agent function and the SNMP agent function, they are included in the network device.
[0021]
The network monitoring device 110 can be configured by a general personal computer.
[0022]
The response time measurement processing activation processing unit 111 is located in the network monitoring device 110, activates the response time measurement processing unit 101 in the network device 100, measures the response time / arrival rate information 103, and inputs the measurement result to itself. I do.
[0023]
The response time / arrival ratio information storage processing unit 113 is provided in the network monitoring device 110, and stores and accumulates the response time / arrival ratio information 103 measured by the response time measurement processor 102 in a storage device such as a hard disk.
[0024]
The response time display processing unit 114 is provided in the network monitoring device 110 and displays the response time / arrival rate information 115 through the network information display device 140.
[0025]
The response time degradation detection processing unit 112 is located in the network monitoring device 110, and determines whether the response time / arrival rate information 103 measured in the response time / arrival rate monitoring path is equal to or greater than a threshold set for each path. .
[0026]
The route information collection processing activation processing unit 116 is located in the network monitoring device 110, activates the route information collection processing unit 102 in the network device 100 to collect the one-way route information 104, and inputs the collection result.
[0027]
The route information creation processing unit 117 is located in the network monitoring device 110 and creates route information 118 in both directions from the input one-way route information 104. Details of a method of creating the route information 118 will be described later.
[0028]
The deterioration cause part narrowing down processing unit 119 is provided in the network monitoring apparatus 110, and when a response time / arrival ratio equal to or greater than a threshold value is observed in the response time / arrival ratio monitoring route, the response time using the route information 118. And / or automatically determine the network part that causes the deterioration of the arrival rate. Details of the narrowing method will be described later.
[0029]
The operation information collection setting file creation processing unit 120 sets the network device 107 in the network monitoring apparatus 110, which is located at the automatically determined deterioration cause part, as the operation information collection target device, the type of operation information to be collected, and the collection period. Then, a collection period is determined and a setting file for collecting operation information is created.
[0030]
The operation information collection processing start processing unit 121 is located in the network monitoring device 110 and starts the operation information collection processing unit 126.
[0031]
The operation information collection processing unit 126 is located in the network monitoring apparatus 110, and operates according to the operation information collection setting file created by the operation information collection setting file creation processing unit 120. The information 106 is collected, and the result of the collection is input to itself.
[0032]
The operation information storage processing unit 127 is provided in the network monitoring apparatus 110, and stores and accumulates the operation information 106 collected and input by the operation information collection processing unit 126 in a storage device such as a hard disk.
[0033]
The operation information display processing unit 128 is located in the network monitoring device 110 and displays the operation information 129 through the network information display device 140.
[0034]
The deterioration cause determination processing unit 122 is provided in the network monitoring apparatus 110. Based on the operation information 106 collected and input by the operation information collection processing unit 126, the cause of the deterioration of the response time and / or the arrival rate is determined by the performance of the device. It is determined whether there is a shortage or a line bandwidth shortage. Details of the determination method will be described later.
[0035]
The degraded route detour setting processing unit 123 is located in the network monitoring device 110, and diverts the communication route to a route that does not pass through a part that causes a deterioration in response time and / or arrival rate. Details of the detour setting method will be described later.
[0036]
The failure analysis support processing display processing unit 124 is located in the network monitoring device 110 and displays the failure analysis support processing information 125 through the network information display device 140. The failure analysis support processing information 125 is the information on the network part that causes the deterioration of the response time and / or the arrival rate obtained by the deterioration cause part narrowing down processing unit 119, and the operation information collection processing activation processing unit 121 performs the operation information collection processing Information that the unit 126 has been activated, information about the cause of deterioration in the response time and / or the arrival rate determined by the deterioration cause determination processing unit 122, and information that the deteriorated route detour setting processing unit 123 has detoured the communication route and the detour. This is route information.
[0037]
The network information display device 140 can also be configured by a general personal computer, like the network monitoring device 110.
[0038]
The display processing call processing unit 141 is located in the network information display device 140, and displays the response time / arrival rate information 115 in a graph or the like by calling the response time display processing unit 114 in the network monitoring device 110. Also, by calling the operation information display processing unit 128 in the network monitoring device 110, the network operation information 129 is displayed as a graph or the like. Furthermore, the fault analysis support processing display information 124 is displayed by calling the fault analysis support processing display processing unit 124 in the network monitoring apparatus 110.
[0039]
Each processing unit of each of the above devices is embodied by a central processing unit (CPU) in each of the above devices executing a program. The program may be stored in advance in a storage device in each device, or may be introduced from another device via a removable storage medium or a communication medium.
[0040]
FIG. 2 is an example of a logical configuration of a network system to be monitored in one embodiment of the present invention. In a large-scale network system including thousands or more network devices, from the viewpoint of network expandability and line cost, a relay hub 201 serving as a network hub between the data center 200 and each of the branch offices 204 to 207, In many cases, the network topology is a topology in which the circuit 202 is installed and the lines are aggregated. However, a network topology in which the data center 200 and each of the branches 204 to 207 are connected in a mesh form by a physical line or a logical line without a relay base is often used. I don't care. In addition, from the viewpoint of reliability, a plurality of communication paths from the clients 212 to 215 to the servers 210 and 211 are often provided.
[0041]
The servers 210 and 211, the clients 212 to 215, and the routers 220 to 237 correspond to the network device 100 in FIG. 1, and each processing unit of the response time measurement processing unit 101, the path information collection processing unit 102, and the operation information measurement processing unit 105 Having.
[0042]
The monitoring device 216 installed in the monitoring center 203 corresponds to the network monitoring device 110 in FIG. 1, and includes a response time measurement processing activation processing unit 111, a response time / arrival ratio information storage processing unit 113, a response time display processing unit 114, Response time deterioration detection processing section 112, path information collection processing start processing section 116, path information creation processing section 117, deterioration cause part narrowing down processing section 119, operation information collection setting file creation processing section 120, operation information collection processing start processing section 121 , An operation information collection processing unit 126, an operation information storage processing unit 127, an operation information display processing unit 128, a deterioration cause determination processing unit 122, a deterioration route detour setting processing unit 123, and a failure analysis support processing display processing unit 124. Have. When one monitoring device 216 cannot cover all the monitoring target devices due to a large number of monitoring targets, the plurality of monitoring devices 216 can share the monitoring target devices.
[0043]
Next, an example of a process of creating route information used in the failure analysis support process for the network system having the configuration of FIG. 2 by the network monitoring device 110 having the functional configuration of FIG. 1 will be described with reference to the flowchart of FIG. This will be described with reference to FIG.
[0044]
(Step 300) For each communication path for which the response time / arrival ratio measurement is being performed, the IP address information of the passing network device is periodically created and updated as the path information. The update cycle is set to be several times a day according to the frequency of the topology change of the network. A traceroute command is bidirectionally executed from the network devices at both ends of each communication path to obtain the path information. The traceroute command outputs one address for each device passing on the route. In order to obtain the addresses of both the input interface and the output interface of each device that passes through, a command is executed bidirectionally on the path and the output results are added. Hereinafter, a case where the route information is created for the route A (240) in FIG. 2 will be described as an example.
[0045]
(Step 301) The path information collection processing activation processing unit 116 in the network monitoring device 216 periodically logs in remotely to the routers 230 and 220 located at both ends of the path A (240) to execute the path information collection processing unit 102. to start. Here, the route information collection processing unit 102 in the router 230 and the router 220 is used as the route information collection agent, but the route information collection processing unit 102 in the client 212 and the server 210 may be used.
[0046]
The route information collection processing unit 102 in the router 230 executes a traceroute command with the address j1 (271) of the server 210 as a target, and obtains the one-way route information 400 from the client 212 to the server 210 for the route A (240). Output. The route information collection process activation processing unit 116 in the network monitoring device 216 inputs the one-way route information 400 to the route information creation processing unit 117.
[0047]
Similarly, the route information collection processing unit 102 in the router 220 executes the traceroute command targeting the address a1 (250) of the client 212, and the one-way route from the server 210 to the client 212 for the route A (240). The information 401 is output. The route information collection process activation processing unit 116 in the network monitoring device 216 inputs the one-way route information 401 to the route information creation processing unit 117.
[0048]
Note that the route command used for collecting the route information is usually implemented in a network device such as a router, and it is not necessary to incorporate special software or hardware.
[0049]
(Step 302) The route information creation processing unit 117 in the network monitoring device 216 includes a one-way route information 400 for the route A (240) from the client 212 to the server 210 and a one-way route information from the server 210 to the client 212. The information 401 is combined with each other as shown in FIG. 4 to create the route information 402 for the route A (240).
[0050]
Next, an example of network performance measurement and fault analysis support processing for a network system having the configuration of FIG. 2 by the network monitoring device 110 having the functional configuration of FIG. 1 will be described with reference to FIG. Will be explained.
[0051]
(Step 500) The response time / arrival rate is periodically measured for each communication path set as the monitoring target path, and when deterioration in communication quality is detected, a failure analysis support process is performed. The measurement cycle is set every few minutes, such as every 10 minutes or every 5 minutes. The monitoring path is a communication path in which a network segment as a broadcast domain in a data center and a network segment as a broadcast domain in each branch are connected in a mesh. In the example of FIG. 2, a communication path connecting each of the clients 212 to 215 and each of the servers 210 and 211 is defined as a monitoring path. If the amount of monitoring traffic occupies a large part of the line bandwidth and might interfere with normal business traffic, communication between the network segment where business servers with response time requirements are installed and the network segment of a representative branch A monitoring route is selected as a route. Hereinafter, a case where the response time / arrival rate is measured for the route A (240) in FIG. 2 will be described as an example.
[0052]
(Step 501) The response time measurement processing activation processing unit 111 in the network monitoring apparatus 216 periodically remotely logs in to the router 230 located on the branch side (client side) of the monitoring path to activate the response time measurement processing unit 101. I do. Here, the response time measurement processing unit 101 in the router 230 is used as the response time measurement agent, but the response time measurement processing unit 102 in the client 212 may be used. It is also possible to remotely log in to the router 220 or the server 210 located on the server side instead of the client side and use the respective response time measurement processing units 102.
[0053]
The response time measurement processing unit 101 in the router 230 executes the ping command with the address j1 (271) of the server 210 as a target, and the response time of the round trip from the client 212 to the server 210 on the route A (240) and the server. The arrival rate to 210 (packet loss information) is output. The response time measurement processing activation processing unit 111 in the network monitoring device 216 inputs the response time / arrival rate information to the response time deterioration detection processing unit 112.
[0054]
The ping command used for measuring the response time / arrival ratio is usually implemented in a network device such as a router, and does not need to incorporate special software or hardware.
[0055]
(Step 502) The response time degradation detection processing unit 112 in the network monitoring device 216 determines that the response time and the IP packet arrival rate of each monitoring path exceed a threshold set for each monitoring path for a certain period or more. It is determined whether or not. The criteria for setting the threshold are as follows.
[0056]
・ Design value of response time / arrival ratio for each route of the network
・ A value obtained by multiplying the average value of the same time zone in the past measurement results by n times the standard deviation value of the same time zone
・ A value obtained by multiplying the standard deviation value of the same time zone by n times to the average value of the same day and time zone in the past measurement results
・ A value obtained by multiplying the average value of the same week, day, and time zone in the past measurement results by n times the standard deviation value of the same time zone
・ A value obtained by multiplying the average value of the same date and time zone in the past measurement result by n times the standard deviation of the same time zone
Here, n is a value of about 2 to 3, and a more appropriate value is determined based on past observation results. If the observed response time and / or arrival rate exceeds the threshold value on the monitoring path, it is determined that the communication quality has deteriorated.
[0057]
(Step 503) If the response time / arrival rate exceeds the threshold set for each of the at least one monitoring path, the deterioration cause part narrowing down processing unit 119 in the network monitoring apparatus 216 performs communication. Automatically find the cause of quality deterioration. Hereinafter, a case where the communication quality is deteriorated due to the interface (IP address d1) 262 of the router 226 will be described as an example. In this case, quality deterioration is detected in the monitoring route A (240), the monitoring route E (244), and the monitoring route B (241). A narrowing-down method will be described with reference to FIG.
[0058]
(Step 6-1): route information (route information (600) of route A (240), route information (601) of route E (244), route of route B (241)) of the monitoring route in which the deterioration is detected. Information (602)) and the route information (route information (603) of route C (242) and route information (604) of route D (243)) of the route in a normal state that partially overlaps with them. Search all.
[0059]
(Step 6-2): route information of the route in which the deterioration is detected (route information (600) of route A (240), route information (601) of route E (244), route information of route B (241) ( 602)) is searched for the intersection (intersection) 605.
[0060]
(Step 6-3): The set 605 obtained in (step 6-2), the path information of each normal path (the path information (603) of the path C (242), and the path information (604) of the path D (243)) )) Are searched.
[0061]
(Step 6-4): Search the union 608 of the respective sets obtained in (Step 6-3).
[0062]
(Step 6-5): The difference set of the set 605 obtained in (step 6-2) and the set 608 obtained in (step 6-4) ((of the set 608 obtained in (step 6-4) The complement set to the set 605 obtained in step 6-2) is searched. In the case of FIG. 6, it is determined that the interface (IP address d1) 262 of the router 226 is the cause of the quality degradation.
[0063]
The operation method in the process of calculating the final set 608 may be changed according to the set operation rule.
[0064]
When the number of monitoring paths is small, the narrowing result is wider. For example, when the route B (602) is not monitored in the example of FIG. 6, the cause parts finally determined are c1 and d1. However, although the narrowing range changes according to the number of monitoring paths, the algorithm can be applied regardless of the number of monitoring paths.
[0065]
(Step 504) The operation information collection setting file creation processing unit 120 in the network monitoring device 216 collects the operation information for the network device having the address narrowed down in (Step 503). A collection cycle and a collection period are determined, and a setting file for the operation information collection processing unit 126 is created. The collected information items of the network operation information are, for network devices such as routers and layer 3 switches, CPU utilization, available memory, and the like. For the interface of the network device, the input / output traffic amount, the input / output packet number, the input / output packet discard number, the input / output error bucket number, the collision number, and the like are set. The collection period of the network operation information is set using a preset value such as 1 minute or 30 seconds, or set to 1/10 of the period of the normal periodic operation information collection. The collection period of the network operation information may use a preset value such as 1 hour or 3 hours, or the subsequent response time / arrival ratio measurement result may be used for a monitoring route whose response time / arrival ratio exceeds the threshold. It will be until it becomes below the threshold.
[0066]
(Step 505) The operation information collection processing activation processing unit 121 in the network monitoring device 216 activates the operation information collection processing unit 126. The operation information collection processing unit 115 collects operation information from the network device located at the deterioration cause part according to the setting file created in (step 504), and inputs the operation information to the deterioration cause determination processing unit 122.
[0067]
(Step 506) The deterioration cause determination processing unit 122 in the network monitoring device 216 estimates the deterioration cause of the communication quality based on the input operation information. When the state in which the operation information exceeds the threshold set for each of them continues, for example,
-The state where the CPU utilization exceeds the threshold value continues.
[0068]
-The state where the line utilization rate has exceeded the threshold value has been maintained.
[0069]
-The state where the amount of discarded packets exceeds the threshold continues.
[0070]
-The state where the number of collisions exceeds the threshold continues.
In such a case, it is determined that the communication quality has deteriorated due to insufficient performance of the network device or line indicating the state. The method for determining the threshold is the same as the method for determining the threshold based on the response time / arrival rate in (step 502). If the operation information of the network device at the cause of deterioration is equal to or less than the threshold value and the operation status is determined to be normal, it is due to a software or hardware defect of the network device at the cause of deterioration, or It is determined that the problem is caused by insufficient performance or malfunction of a device that does not have an IP address as route information, such as an ATM switch or a switching hub adjacent to the network device of the cause part.
[0071]
(Step 507) The degraded route detour setting processing unit 123 in the network monitoring device 216 detects that the response time and the IP packet arrival rate in the monitored route have exceeded the threshold for a certain period or more, and that the degradation has been detected. If no deterioration is detected in the detour route for the route that has been detected, the interface of the network device on the route in which the deterioration is detected is closed, and the route is detoured from the degraded route to the normal route by the action of the dynamic routing protocol. . This operation will be described with reference to FIG. When the deterioration is continuously detected in the route A (240) from the client 212 to the server 210 and the detour route E (244) is in a normal state, the degraded route detour setting processing unit 123 in the network monitoring device 216. Performs a detour from the route A (240) to the route E (244) by remotely logging in to the router 230 and closing the interface (IP address C1) 258 of the router 230. The interface to be closed may be any interface as long as it is on the degraded route and can detour to the normal route by its own blocking.
[0072]
In the present embodiment, the above configuration is provided, and the above steps are performed in the network monitoring device, so that the response time between the devices at both ends of each communication path is measured for the periodic performance measurement of the network. In addition, by using the information, it is possible to efficiently separate the cause site when the response time and / or the arrival rate is deteriorated. In addition, it is possible to determine whether the deterioration of the response time and / or the arrival rate is due to insufficient performance of the device or insufficient line bandwidth. Further, it is possible to detour the communication route to a route that does not pass through the cause part.
[0073]
【The invention's effect】
ADVANTAGE OF THE INVENTION According to this invention, in a network system, it becomes possible to measure a response time / arrival rate, narrow down a deterioration cause part, estimate a deterioration cause, and avoid a deterioration route efficiently.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram of the present embodiment.
FIG. 2 is an example of a network logical configuration diagram and a response time monitoring path according to the present embodiment.
FIG. 3 is a flowchart of a route information creation process according to the embodiment;
FIG. 4 is a diagram illustrating a method of generating route information according to the embodiment;
FIG. 5 is a flowchart of a failure analysis support process according to the embodiment;
FIG. 6 is a method of narrowing down a deterioration cause portion according to the embodiment.
[Explanation of symbols]
100 network device 101 response time measurement processing unit 102 path information collection processing unit 103 response time / arrival rate information 104 one-way path information 105 operating information measurement processing unit .., 106 operation information, 107 network device, 110 network monitoring device, 111 response time measurement processing start processing unit, 112 response time degradation detection processing unit, 113 response time / arrival rate information Storage processing unit, 114: Response time display processing unit, 115: Response time / arrival rate information, 116: Route information collection processing activation processing unit, 117: Route information creation processing unit, 118: Route information, 119 ... Deterioration cause part narrowing down processing section, 120... Operation information collection setting file creation processing section, 121... Operation information collection processing start processing section, 122. .., 123... Degraded route detour setting processing section, 124... Failure analysis support processing display processing section, 125... Failure analysis support processing information, 126... Operation information collection processing section, 127. ... Operation information display processing unit, 129 ... Operation information, 140 ... Network information display device, 141 ... Display processing call processing unit

Claims

複数のクライアントと複数のサーバとが複数のネットワーク機器を経由して接続するネットワークにおいて、
支線部の機器に組み込まれた応答時間測定エージェントを用いて支線部から対向の支線部に至る経路におけるＩＰパケットの応答時間および／または到達率を測定し、
前記各経路において応答時間および／または到達率の劣化が検知された場合に、前記各経路にて経由するネットワーク機器のＩＰアドレス情報を経路情報として、複数の経路情報の比較により応答時間および／または到達率の劣化原因部位を求めることを特徴とするネットワーク障害分析支援方法。In a network in which a plurality of clients and a plurality of servers are connected via a plurality of network devices,
Measuring the response time and / or the arrival rate of the IP packets in the path from the branch to the opposite branch using the response time measurement agent incorporated in the branch unit;
When the deterioration of the response time and / or the arrival rate is detected in each of the routes, the response time and / or / and / or the IP address information of the network device passing through each of the routes is compared by using a plurality of pieces of route information. A network failure analysis support method, wherein a cause of deterioration of the arrival rate is obtained.

複数のクライアントと複数のサーバが複数のネットワーク機器を経由して接続するネットワークにおいて、
支線部の機器に組み込まれた応答時間測定エージェントを用いて支線部から対向の支線部に至る経路におけるＩＰパケットの応答時間および／または到達率を測定する応答時間測定手段と、
応答時間および／または到達率の劣化を検知する応答時間劣化検知手段と、
各経路にて経由するネットワーク機器のＩＰアドレス情報から経路情報を作成する経路情報作成手段と、
複数の経路情報の比較により応答時間および／または到達率の劣化原因部位の部分ＩＰアドレスを特定することにより、劣化発生の原因部位を求める劣化原因部位絞込み手段とを備える
ことを特徴とするネットワーク障害分析支援システム。In a network where multiple clients and multiple servers are connected via multiple network devices,
Response time measuring means for measuring the response time and / or arrival rate of an IP packet in a path from a branch to an opposite branch using a response time measurement agent incorporated in a device of the branch;
Response time deterioration detecting means for detecting deterioration of response time and / or arrival rate;
Route information creating means for creating route information from IP address information of network devices passing through each route;
A network fault characterized by comprising a degradation cause location narrowing means for determining a degradation cause location by specifying a partial IP address of a degradation cause location of a response time and / or an arrival rate by comparing a plurality of pieces of route information. Analysis support system.

請求項２に記載のネットワーク障害分析支援システムにおいて、さらに、
検出された応答時間および／または到達率の劣化の原因部位に位置するネットワーク機器に対して稼動情報を収集するための設定ファイルを作成する稼動情報収集設定ファイル作成手段と、
作成した設定ファイルをもとに前記ネットワーク機器から稼動情報を収集する稼動情報収集手段と、
収集した稼動情報をもとに機器性能や回線帯域の不足のために劣化が発生したか否かを判断する劣化原因判定手段を備える
ことを特徴とするネットワーク障害分析支援システム。The network failure analysis support system according to claim 2, further comprising:
Operating information collection setting file creating means for creating a setting file for collecting operating information for a network device located at a site where the detected response time and / or arrival rate deteriorates;
Operating information collecting means for collecting operating information from the network device based on the created setting file;
A network failure analysis support system comprising: a deterioration cause determining unit that determines whether deterioration has occurred due to a shortage of device performance or a line band based on collected operation information.

請求項２に記載のネットワーク障害分析支援システムにおいて、さらに、
クライアントからサーバに対し複数の経路が設定されており、かつ、その中のひとつの経路において応答時間および／または到達率の劣化が検知された場合に、当該経路上のネットワーク機器のインターフェースを閉塞させ劣化が検知されていない経路に迂回させる劣化経路迂回設定手段を備える
ことを特徴とするネットワーク障害分析支援システム。The network failure analysis support system according to claim 2, further comprising:
When a plurality of routes are set from the client to the server, and if the response time and / or the arrival rate is deteriorated in one of the routes, the interface of the network device on the route is blocked. A network failure analysis support system, comprising: a degraded route detour setting unit for detouring to a route for which degradation has not been detected.