JP5767617B2

JP5767617B2 - Network failure detection system and network failure detection device

Info

Publication number: JP5767617B2
Application number: JP2012213349A
Authority: JP
Inventors: 直規立石; 光穂田原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-09-27
Filing date: 2012-09-27
Publication date: 2015-08-19
Anticipated expiration: 2032-09-27
Also published as: JP2014068283A

Description

本発明は、ネットワークに発生する障害を検出する、ネットワーク障害検出システムおよびネットワーク障害検出装置に関する。 The present invention relates to a network failure detection system and a network failure detection apparatus that detect a failure that occurs in a network.

ネットワークの監視において、監視拠点（監視端末）から監視対象装置に監視用メッセージを送り、その応答の有無や応答内容、応答に要する時間（ＲＴＴ：Round Trip Time）等から、装置の正常、異常を判定する手法が一般的に用いられている（非特許文献１、非特許文献２参照）。 In network monitoring, a monitoring message is sent from the monitoring base (monitoring terminal) to the monitoring target device, and the normality / abnormality of the device is determined from the presence / absence of the response, the response content, the time required for the response (RTT: Round Trip Time) A determination method is generally used (see Non-Patent Document 1 and Non-Patent Document 2).

また、この応答に要する時間（ＲＴＴ）は、システムの状態評価の一般的な指標として用いられている。例えば、システム負荷が増大等している場合は、外部から送信されたメッセージへの応答処理が遅延し応答時間が増大する。そこで、平常時の応答時間を記憶しておき、現時点の応答時間と平常時の応答時間の乖離度を監視することにより、システムの負荷増大が検知可能となる。乖離度の監視については、所定の閾値を予め設定しておき、この閾値の超過を検出する方法が用いられる場合が多い（例えば、非特許文献３参照）。 The time required for this response (RTT) is used as a general index for evaluating the state of the system. For example, when the system load is increased, response processing to a message transmitted from the outside is delayed and response time is increased. Therefore, it is possible to detect an increase in system load by storing the normal response time and monitoring the degree of deviation between the current response time and the normal response time. For monitoring the deviation degree, a method is often used in which a predetermined threshold value is set in advance and an excess of this threshold value is detected (see, for example, Non-Patent Document 3).

図７は、従来のＲＴＴを用いた、監視対象装置２００の正常、異常を判定する手法の例を示す図である。
図７（ａ）に示すように、正常時において、監視端末１０は、送信した監視用メッセージ（例えば、ｐｉｎｇ等）に対する応答を、監視対象装置２００から所定の閾値以下の時間（ＲＴＴ値）で受信する。一方、図７（ｂ）に示すように、監視対象装置２００が負荷増大等により応答に時間を要すると、監視端末１０が受信する応答のＲＴＴ値が増大し、所定の閾値を超過する。これにより、監視端末１０は、監視対象装置２００に異常が発生したと判定する。 FIG. 7 is a diagram illustrating an example of a method for determining whether the monitoring target device 200 is normal or abnormal using a conventional RTT.
As shown in FIG. 7A, in a normal state, the monitoring terminal 10 sends a response to the transmitted monitoring message (for example, ping or the like) from the monitoring target device 200 in a time (RTT value) equal to or less than a predetermined threshold. Receive. On the other hand, as shown in FIG. 7B, when the monitoring target device 200 takes time for a response due to an increase in load or the like, the RTT value of the response received by the monitoring terminal 10 increases and exceeds a predetermined threshold. Thereby, the monitoring terminal 10 determines that an abnormality has occurred in the monitoring target device 200.

“INTERNET CONTROL MESSAGE PROTOCOL”,[online], September 1981, IETF RFC792, ［平成２４年９月７日検索］,インターネット<URL:http://www.ietf.org/rfc/rfc792.txt>“INTERNET CONTROL MESSAGE PROTOCOL”, [online], September 1981, IETF RFC792, [searched September 7, 2012], Internet <URL: http: //www.ietf.org/rfc/rfc792.txt> “An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks”,[online], December 2002, IETF RFC3411, ［平成２４年９月７日検索］,インターネット<URL:http://www.ietf.org/rfc/rfc3411.txt>“An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks”, [online], December 2002, IETF RFC3411, [searched September 7, 2012], Internet <URL: http://www.ietf.org /rfc/rfc3411.txt> 「NetKids iMark V3 Ping版」，［online］，株式会社アイ・エス・ティ，［平成２４年９月７日検索］，インターネット<URL:http://www.istinc.co.jp/product/net/NKI3Ping_pre120201.pdf>"NetKids iMark V3 Ping Edition", [online], IS Corporation, [searched September 7, 2012], Internet <URL: http://www.istinc.co.jp/product/net /NKI3Ping_pre120201.pdf>

監視用メッセージにおけるＲＴＴは、監視対象装置のメッセージ処理時間と、監視拠点・監視対象装置間の伝搬遅延との合計値に相当する。１監視拠点から広範囲のネットワークを監視する場合は、監視拠点・監視対象装置間の伝搬遅延が監視対象装置によって、大幅に異なる場合がある。このとき、ＲＴＴの閾値をネットワーク全体で１つのみとすると、監視拠点に近い装置は負荷が大きく増大してメッセージ処理時間が大幅に伸びても閾値を超過しない一方で、監視拠点から遠い装置は、少しの負荷増大でも閾値を超過するなど、監視対象によって異なる事象が検出される問題が起こる。 The RTT in the monitoring message corresponds to the total value of the message processing time of the monitoring target device and the propagation delay between the monitoring base and the monitoring target device. When monitoring a wide area network from one monitoring base, the propagation delay between the monitoring base and the monitoring target device may vary greatly depending on the monitoring target device. At this time, if only one RTT threshold is set for the entire network, the device close to the monitoring base does not exceed the threshold even if the load increases greatly and the message processing time greatly increases. There is a problem in that different events are detected depending on the monitoring target, such as exceeding the threshold even with a slight increase in load.

図８に示すように、監視端末１０が、至近地の監視対象装置２００（２１０）と、遠隔地の監視対象装置２００（２２０）の正常、異常の判定を行う場合において、ＲＴＴの閾値が１つであり、至近地の監視対象装置２００（２１０）と、遠隔地の監視対象装置２００（２２０）とが、同じＣＰＵ（Central Processing Unit）やメモリ等の性能を持ち、同じプログラムで同一の処理を実行していたとする。この場合において、至近地の監視対象装置２００（２１０）では、正常であるときのＲＴＴ値が小さくなるため、閾値に対してマージンが大きく、重大異常が発生してもＲＴＴ値が閾値を超過せず、異常を検出しない場合がある。一方、遠隔地の監視対象装置２００（２２０）では、正常であるときのＲＴＴ値が大きくなり、閾値に対してマージンが小さいため、軽微な異常でもＲＴＴ値が閾値を超過し、異常と判定してしまう場合がある。よって、至近地の監視対象装置２００（２１０）と遠隔地の監視対象装置２００（２２０）とで、たとえ、同じ異常が発生していたとしても、正常、異常の判定において異なる結果となることが起こり得る。 As shown in FIG. 8, when the monitoring terminal 10 determines normality / abnormality of the monitoring target device 200 (210) in the vicinity and the monitoring target device 200 (220) at the remote location, the threshold value of RTT is 1. The nearby monitoring target device 200 (210) and the remote monitoring target device 200 (220) have the same CPU (Central Processing Unit) and memory performance, and the same processing with the same program. Is executed. In this case, in the monitoring target device 200 (210) in the immediate vicinity, since the RTT value when normal is small, the margin is large with respect to the threshold value, and even if a serious abnormality occurs, the RTT value exceeds the threshold value. In some cases, no abnormality is detected. On the other hand, in the remote monitoring target apparatus 200 (220), since the RTT value when normal is large and the margin is small with respect to the threshold, the RTT value exceeds the threshold even if a minor abnormality occurs, and it is determined as abnormal. May end up. Therefore, even if the same abnormality occurs between the monitoring target device 200 (210) in the vicinity and the monitoring target device 200 (220) at the remote location, different results may be obtained in the determination of normality or abnormality. Can happen.

また、ＲＴＴの増大については、監視対象装置の負荷増大のほか、ネットワークの経路変更等の異常（以下、「ネットワーク異常」という。）が原因となることがある。このネットワーク異常は、例えば、ネットワーク内の装置の故障・点検等に伴う経路変更や、中継装置等の処理負荷の増大に伴う遅延等がある。しかしながら、従来のＲＴＴによる監視手法では、ＲＴＴ値の増大が、監視対象装置の負荷増大等に起因するものか、ネットワーク異常に起因するものか、を区別することができない。 Further, the increase in RTT may be caused by an increase in the load of the monitoring target apparatus or an abnormality such as a network path change (hereinafter referred to as “network abnormality”). This network abnormality includes, for example, a route change associated with a failure / inspection of a device in the network, a delay associated with an increase in processing load on the relay device, and the like. However, in the conventional monitoring method using RTT, it cannot be distinguished whether the increase in the RTT value is caused by an increase in the load on the monitoring target apparatus or the like, or due to a network abnormality.

図９（ａ）は、監視対象装置２００の負荷増大等に起因し、応答処理に時間がかかったため、ＲＴＴ値が所定の閾値を超えた例を示している。一方、図９（ｂ）は、ネットワーク異常に起因し、監視用メッセージが監視端末１０と監視対象装置２００との間での送信に正常時に比べ遅延が発生したため、ＲＴＴ値が所定の閾値を超えた例を示している。ここでは、両者ともＲＴＴ値が所定の閾値を超えており、異常が発生していることは判定できるが、このような監視端末１０から１つの監視対象装置２００毎にＲＴＴを監視する手法では、その異常の原因について、監視対象装置２００の負荷増大等に起因するものか、ネットワーク異常に起因するものなのかを区別することはできなかった。 FIG. 9A shows an example in which the RTT value exceeds a predetermined threshold because the response process takes time due to an increase in the load of the monitoring target device 200 or the like. On the other hand, FIG. 9B shows that the RTT value exceeds a predetermined threshold because the monitoring message is delayed in transmission between the monitoring terminal 10 and the monitoring target device 200 due to a network abnormality. An example is shown. Here, both RTT values exceed a predetermined threshold value, and it can be determined that an abnormality has occurred. However, in such a method of monitoring RTT for each monitoring target device 200 from such a monitoring terminal 10, Regarding the cause of the abnormality, it was not possible to distinguish whether it was caused by an increase in the load of the monitoring target device 200 or the like, or due to a network abnormality.

このような背景に鑑みて本発明がなされたのであり、本発明は、伝搬遅延を考慮した適切なＲＴＴの閾値を設定することができる、ネットワーク障害検出システムおよびネットワーク障害検出装置を提供することを課題とする。 The present invention has been made in view of such a background, and the present invention provides a network failure detection system and a network failure detection device capable of setting an appropriate RTT threshold value considering propagation delay. Let it be an issue.

前記した課題を解決するため、請求項１に記載の発明は、ネットワークを構成する複数の監視対象装置と、前記複数の監視対象装置に対し監視用メッセージを送信し、前記監視対象装置それぞれから受信した応答メッセージに基づくＲＴＴ（Round Trip Time）を用いて、前記ネットワークの障害を検出するネットワーク障害検出装置と、を備えるネットワーク障害検出システムであって、前記ネットワーク障害検出装置が、前記監視対象装置のＲＴＴのばらつきを解析するために使用する分布を示す分布情報と、前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記監視対象装置を異常と判定する閾値として決定する閾値決定ロジックと、正常時において前記複数の監視対象装置それぞれから受信した前記応答メッセージに基づく監視結果として、前記監視対象装置の識別情報および当該監視対象装置の前記ＲＴＴを示す試験結果情報と、を記憶する記憶部と、前記複数の監視対象装置それぞれの前記正常時の試験結果情報を取得し、所定のグルーピング手法を用いて、前記ＲＴＴが類似する監視対象装置をグルーピングし、グルーピングにより生成された複数のグループ毎に、当該グループに属する各監視対象装置のＲＴＴの分布を前記分布情報に示される分布により生成するグルーピング処理部と、前記グループ毎に生成された分布それぞれにおいて、前記閾値決定ロジックに基づき前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記グループ毎の閾値に決定する閾値決定部と、を備えることを特徴とするネットワーク障害検出システムとした。 In order to solve the above-described problem, the invention according to claim 1 transmits a monitoring message to a plurality of monitoring target devices configuring a network and the plurality of monitoring target devices, and receives the monitoring messages from each of the monitoring target devices. A network failure detection system that detects a failure of the network using an RTT (Round Trip Time) based on the response message, wherein the network failure detection device Threshold information for determining distribution information indicating a distribution used for analyzing variation in RTT and a value of RTT corresponding to a predetermined value of the degree of variation in the distribution as a threshold for determining that the monitoring target device is abnormal; Based on the response message received from each of the plurality of monitoring target devices in a normal state. As a monitoring result, a storage unit that stores identification information of the monitoring target device and test result information indicating the RTT of the monitoring target device, and normal test result information of each of the plurality of monitoring target devices is acquired. Then, by using a predetermined grouping method, the monitoring target devices having similar RTTs are grouped, and for each of a plurality of groups generated by grouping, the distribution of the RTT of each monitoring target device belonging to the group is used as the distribution information. In each of the grouping processing unit generated by the distribution shown and the distribution generated for each group, an RTT value corresponding to a predetermined value of the variation degree in the distribution is set as the threshold for each group based on the threshold determination logic. A network failure detection system comprising: a threshold determination unit for determining

また、請求項５に記載の発明は、ネットワークを構成する複数の監視対象装置と、前記複数の監視対象装置に対し監視用メッセージを送信し、前記監視対象装置それぞれから受信した応答メッセージに基づくＲＴＴ（Round Trip Time）を用いて、前記ネットワークの障害を検出するネットワーク障害検出装置と、を備えるネットワーク障害検出システムの前記ネットワーク障害検知装置であって、前記監視対象装置のＲＴＴのばらつきを解析するために使用する分布を示す分布情報と、前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記監視対象装置を異常と判定する閾値として決定する閾値決定ロジックと、正常時において前記複数の監視対象装置それぞれから受信した前記応答メッセージに基づく監視結果として、前記監視対象装置の識別情報および当該監視対象装置の前記ＲＴＴを示す試験結果情報と、を記憶する記憶部と、前記複数の監視対象装置それぞれの前記正常時の試験結果情報を取得し、所定のグルーピング手法を用いて、前記ＲＴＴが類似する監視対象装置をグルーピングし、グルーピングにより生成された複数のグループ毎に、当該グループに属する各監視対象装置のＲＴＴの分布を前記分布情報に示される分布により生成するグルーピング処理部と、前記グループ毎に生成された分布それぞれにおいて、前記閾値決定ロジックに基づき前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記グループ毎の閾値に決定する閾値決定部と、を備えることを特徴とするネットワーク障害検出装置とした。 According to a fifth aspect of the present invention, a plurality of monitoring target devices configuring a network, a monitoring message is transmitted to the plurality of monitoring target devices, and an RTT based on a response message received from each of the monitoring target devices A network failure detection device including a network failure detection device that detects a failure of the network using (Round Trip Time), and for analyzing variation in RTT of the monitoring target device Distribution information indicating a distribution to be used for the distribution, a threshold value determination logic for determining a value of RTT corresponding to a predetermined value of the degree of variation in the distribution as a threshold value for determining that the monitoring target device is abnormal, and the plurality of the normal times As a monitoring result based on the response message received from each monitoring target device, the monitoring is performed. A storage unit for storing identification information of the target device and test result information indicating the RTT of the monitoring target device, and obtaining normal test result information of each of the plurality of monitoring target devices, and a predetermined grouping method Are used to group the monitoring target devices having similar RTTs, and for each of a plurality of groups generated by the grouping, the RTT distribution of each monitoring target device belonging to the group is generated according to the distribution indicated in the distribution information. A grouping processing unit, and a threshold value determination unit that determines an RTT value corresponding to a predetermined value of the degree of variation in the distribution as a threshold value for each group based on the threshold value determination logic in each of the distributions generated for each group. The network failure detection device is characterized by comprising:

このようにすることで、ネットワーク障害検出装置を備えるネットワーク障害検出システムは、正常時において類似したＲＴＴをもつ監視対象装置をグルーピングし、グループ毎に閾値を決定することができる。よって、ネットワークにおける伝搬遅延を考慮した適切なＲＴＴの閾値を設定することができる。 By doing in this way, the network failure detection system provided with the network failure detection device can group the monitoring target devices having similar RTTs in the normal state and determine the threshold value for each group. Therefore, an appropriate RTT threshold can be set in consideration of propagation delay in the network.

請求項２に記載の発明は、前記ネットワーク障害検出装置の前記記憶部には、前記グルーピング処理部が生成した複数のグループそれぞれに対応付けて、前記閾値決定部が決定した当該グループの閾値を格納したグループ閾値情報と、前記複数の監視対象装置それぞれに対する監視実行時の前記試験結果情報と、がさらに記憶されており、前記ネットワーク障害検出装置が、前記複数の監視対象装置それぞれの前記監視実行時の試験結果情報を取得し、前記監視実行時の試験結果情報に含まれる前記監視対象装置の識別情報を用いて、前記グループ閾値情報を参照し、当該監視対象装置の属する前記グループの閾値を抽出し、前記抽出した前記グループの閾値を、前記監視実行時の試験結果情報に含まれる当該監視対象装置のＲＴＴが超えるか否かを判定し、判定結果を閾値判定結果情報として出力する閾値判定部を、さらに備えることを特徴とする請求項１に記載のネットワーク障害検出システムとした。 The invention according to claim 2 stores the threshold value of the group determined by the threshold value determination unit in association with each of the plurality of groups generated by the grouping processing unit in the storage unit of the network failure detection apparatus. Group threshold information and the test result information at the time of monitoring execution for each of the plurality of monitoring target devices are further stored, and the network failure detection device is at the time of monitoring execution of each of the plurality of monitoring target devices. The test result information is acquired, and the group threshold information is referred to using the identification information of the monitoring target device included in the test result information at the time of monitoring execution, and the threshold value of the group to which the monitoring target device belongs is extracted. The RTT of the monitoring target device included in the test result information at the time of monitoring exceeds the threshold value of the extracted group It determines whether the threshold determination section for outputting a determination result as the threshold determination result information, and the network failure detection system according to claim 1, further comprising.

また、請求項６に記載の発明は、前記記憶部には、前記グルーピング処理部が生成した複数のグループそれぞれに対応付けて、前記閾値決定部が決定した当該グループの閾値を格納したグループ閾値情報と、前記複数の監視対象装置それぞれに対する監視実行時の前記試験結果情報と、がさらに記憶されており、前記複数の監視対象装置それぞれの前記監視実行時の試験結果情報を取得し、前記監視実行時の試験結果情報に含まれる前記監視対象装置の識別情報を用いて、前記グループ閾値情報を参照し、当該監視対象装置の属する前記グループの閾値を抽出し、前記抽出した前記グループの閾値を、前記監視実行時の試験結果情報に含まれる当該監視対象装置のＲＴＴが超えるか否かを判定し、判定結果を閾値判定結果情報として出力する閾値判定部を、さらに備えることを特徴とする請求項５に記載のネットワーク障害検出装置とした。 In the invention according to claim 6, group threshold information in which the threshold value of the group determined by the threshold value determination unit is stored in the storage unit in association with each of the plurality of groups generated by the grouping processing unit. And the test result information at the time of monitoring execution for each of the plurality of monitoring target devices, and acquiring the test result information at the time of monitoring execution of each of the plurality of monitoring target devices, and executing the monitoring Using the identification information of the monitoring target device included in the test result information at the time, referring to the group threshold information, extracting the threshold value of the group to which the monitoring target device belongs, and extracting the threshold value of the group, It is determined whether or not the RTT of the monitoring target device included in the test result information at the time of monitoring is exceeded, and the determination result is output as threshold determination result information A value judgment unit, and a network failure detection apparatus according to claim 5, further comprising.

このようにすることで、ネットワーク障害検出装置は、監視実行時における監視対象装置が正常か異常かの判定を、当該監視対象装置が属するグループに設定された閾値を用いて実行することができる。よって、ネットワークにおける伝搬遅延を考慮した適切な監視対象装置の正常、異常の判定をすることができる。 By doing in this way, the network failure detection apparatus can perform determination of whether the monitoring target apparatus is normal or abnormal at the time of monitoring execution using the threshold set for the group to which the monitoring target apparatus belongs. Therefore, it is possible to determine whether the monitoring target apparatus is normal or abnormal in consideration of the propagation delay in the network.

請求項３に記載の発明は、前記ネットワーク障害検出装置が、前記閾値判定結果情報を参照し、前記グループ毎に前記判定結果を抽出し、当該グループの閾値を超えたＲＴＴをもつ監視対象装置の数を計算し、当該グループに属する監視対象装置の全体数に対する当該数の割合を計算し、前記グループ毎の前記計算した割合を示す閾値超過原因情報を出力する閾値超過原因判別部を、さらに備えることを特徴とする請求項２に記載のネットワーク障害検出システムとした。 According to a third aspect of the present invention, the network failure detection apparatus refers to the threshold determination result information, extracts the determination result for each group, and includes a monitoring target apparatus having an RTT exceeding the threshold of the group. A threshold excess cause determining unit that calculates a number, calculates a ratio of the number to the total number of monitoring target devices belonging to the group, and outputs threshold excess cause information indicating the calculated ratio for each group; The network failure detection system according to claim 2 is provided.

また、請求項７に記載の発明は、前記閾値判定結果情報を参照し、前記グループ毎に前記判定結果を抽出し、当該グループの閾値を超えたＲＴＴをもつ監視対象装置の数を計算し、当該グループに属する監視対象装置の全体数に対する当該数の割合を計算し、前記グループ毎の前記計算した割合を示す閾値超過原因情報を出力する閾値超過原因判別部を、さらに備えることを特徴とする請求項６に記載のネットワーク障害検出装置とした。 The invention according to claim 7 refers to the threshold determination result information, extracts the determination result for each group, calculates the number of monitoring target devices having an RTT exceeding the threshold of the group, It further comprises a threshold excess cause determining unit that calculates a ratio of the number to the total number of monitoring target devices belonging to the group and outputs threshold excess cause information indicating the calculated ratio for each group. The network failure detection device according to claim 6 is provided.

このように、ネットワーク障害検出装置は、グループ毎に、そのグループの閾値を超えたＲＴＴをもつ監視対象装置の数を計算し、当該グループに属する監視対象装置の全体数に対する割合を計算し、閾値超過原因情報として出力することができる。 As described above, the network failure detection apparatus calculates, for each group, the number of monitoring target devices having an RTT exceeding the threshold of the group, calculates a ratio to the total number of monitoring target devices belonging to the group, Can be output as excess cause information.

請求項４に記載の発明は、前記閾値超過原因判別部が、前記計算した割合が所定の割合を超えるか否かを前記グループ毎に判別し、前記所定の割合を超える場合に、ネットワーク異常を示す警告情報を出力することを特徴とする請求項３に記載のネットワーク障害検出システムとした。 According to a fourth aspect of the present invention, the threshold excess cause determining unit determines, for each group, whether the calculated ratio exceeds a predetermined ratio, and if the calculated ratio exceeds the predetermined ratio, a network abnormality is detected. The network failure detection system according to claim 3, wherein warning information is output.

また、請求項８に記載の発明は、前記閾値超過原因判別部が、前記計算した割合が所定の割合を超えるか否かを前記グループ毎に判別し、前記所定の割合を超える場合に、ネットワーク異常を示す警告情報を出力することを特徴とする請求項７に記載のネットワーク障害検出装置とした。 In the invention according to claim 8, the threshold excess cause determining unit determines whether the calculated ratio exceeds a predetermined ratio for each group, and if the calculated ratio exceeds the predetermined ratio, 8. The network failure detection apparatus according to claim 7, wherein warning information indicating an abnormality is output.

このように、ネットワーク障害検出装置は、そのグループの閾値を超えたＲＴＴをもつ監視対象装置の割合が、所定の割合を超える場合に、ネットワーク異常を示す警告情報を、閾値超過原因情報に付して出力することができる。 In this way, the network failure detection device adds warning information indicating a network error to the threshold excess cause information when the proportion of monitored devices having RTT exceeding the threshold of the group exceeds a predetermined rate. Can be output.

本発明によれば、伝搬遅延を考慮した適切なＲＴＴの閾値を設定する、ネットワーク障害検出システムおよびネットワーク障害検出装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the network failure detection system and network failure detection apparatus which set the threshold value of the appropriate RTT which considered the propagation delay can be provided.

本実施形態に係るネットワーク障害検出システムによる、ＲＴＴの閾値決定処理の概要を説明するための図である。It is a figure for demonstrating the outline | summary of the threshold value determination process of RTT by the network failure detection system which concerns on this embodiment. 本実施形態に係るネットワーク障害検出システムによる、ＲＴＴ値増大の異常原因判別処理を説明するための図である。It is a figure for demonstrating the abnormality cause discrimination | determination process of the RTT value increase by the network failure detection system which concerns on this embodiment. 本実施形態に係るネットワーク障害検出装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the network failure detection apparatus which concerns on this embodiment. 本実施形態に係るグループ閾値情報のデータ構成例を示す図である。It is a figure which shows the data structural example of the group threshold value information which concerns on this embodiment. 本実施形態に係るネットワーク障害検出装置による、グルーピング閾値決定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the grouping threshold value determination process by the network failure detection apparatus which concerns on this embodiment. 本実施形態に係るネットワーク障害検出装置による、閾値判定処理およびＲＴＴ値増大の異常原因判別処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the threshold value determination process and the abnormality cause determination process of an RTT value increase by the network failure detection apparatus which concerns on this embodiment. 従来のＲＴＴを用いた監視対象装置の正常、異常を判定する手法の例を示す図である。It is a figure which shows the example of the method of determining the normality and abnormality of the monitoring target apparatus using the conventional RTT. ネットワーク全体で１つの閾値とする場合の問題を説明するための図である。It is a figure for demonstrating the problem in setting it as one threshold value in the whole network. ＲＴＴ値増大の原因を説明するための図である。It is a figure for demonstrating the cause of RTT value increase.

次に、本発明を実施するための形態（以下、「本実施形態」という。）におけるネットワーク障害検出システム１等について説明する。 Next, the network failure detection system 1 and the like in a mode for carrying out the present invention (hereinafter referred to as “the present embodiment”) will be described.

＜概要＞
まず、本実施形態に係るネットワーク障害検出システム１が実行する処理の概要について説明する。 <Overview>
First, an overview of processing executed by the network failure detection system 1 according to the present embodiment will be described.

図１は、本実施形態に係るネットワーク障害検出システム１による、ＲＴＴの閾値決定処理の概要を説明するための図である。
図１に示すように、本実施形態に係るネットワーク障害検出システム１は、ネットワーク障害検出装置１００と、そのネットワーク障害検出装置１００にネットワークを介して接続される複数の監視対象装置２００とを含んで構成される。なお、図１においては、一例として、監視対象装置２００が、ネットワーク障害検出装置１００の至近地に設置されている監視対象装置２００（２１１，２１２，２１３）と、ネットワーク障害検出装置１００の遠隔地に設置されている監視対象装置２００（２２１，２２２，２２３）とを含んで構成されるものとする。そして、ネットワーク障害検出装置１００が、各監視対象装置２００に対し、監視用メッセージを送信してその応答メッセージを受信し、ＲＴＴの所定の閾値を超過するか否かにより、各監視対象装置２００の正常、異常を判定する。 FIG. 1 is a diagram for explaining an outline of RTT threshold determination processing by the network failure detection system 1 according to the present embodiment.
As shown in FIG. 1, the network failure detection system 1 according to the present embodiment includes a network failure detection device 100 and a plurality of monitoring target devices 200 connected to the network failure detection device 100 via a network. Composed. In FIG. 1, as an example, the monitoring target device 200 includes a monitoring target device 200 (211, 212, 213) installed in the immediate vicinity of the network failure detection device 100 and a remote location of the network failure detection device 100. And the monitoring target device 200 (221, 222, 223) installed in the network. Then, the network failure detection device 100 transmits a monitoring message to each monitoring target device 200, receives the response message, and determines whether or not each monitoring target device 200 exceeds the predetermined threshold of RTT. Determine whether normal or abnormal.

まず、ネットワーク障害検出装置１００は、平常時（正常時）において取得した、各監視対象装置２００のＲＴＴ値に基づき、類似したＲＴＴ値となる監視対象装置２００をグルーピングする（ステップＳ１）。これにより、例えば、図１に示すように、至近地に設置されている監視対象装置２００（２１１，２１２，２１３）のグループや、遠隔地に設置されている監視対象装置２００（２２１，２２２，２２３）のグループのように、各監視対象装置２００について、類似したＲＴＴ値をとるようにグルーピングしておく。 First, the network failure detection device 100 groups the monitoring target devices 200 having similar RTT values based on the RTT values of the monitoring target devices 200 acquired in normal times (normal time) (step S1). Thereby, for example, as shown in FIG. 1, a group of monitoring target devices 200 (211, 212, 213) installed in the nearest place, or a monitoring target device 200 (221, 222, Like the group 223), each monitoring target device 200 is grouped so as to have a similar RTT value.

次に、ネットワーク障害検出装置１００は、グループ毎にＲＴＴ値の分布を生成する（ステップＳ２）。そして、ネットワーク障害検出装置１００は、生成したＲＴＴ値の分布から、所定の確率ラインとなる値をそのグループの閾値として設定する（ステップＳ３）。この確率ラインは、例えば、ＲＴＴ値の分布（正規分布）において、ばらつき度合の所定値としての３σ（９９．７％）（σ：標準偏差）に対応するＲＴＴ値を、所定の閾値として設定する。
図１においては、至近地に設置されている監視対象装置２００（２１１，２１２，２１３）のグループの閾値として「ａ」が設定される。また、遠隔地に設置されている監視対象装置２００（２２１，２２２，２２３）のＲＴＴの閾値として、閾値「ａ」よりも長い時間のＲＴＴ値である閾値「ｂ」が設定される。 Next, the network failure detection apparatus 100 generates a distribution of RTT values for each group (step S2). Then, the network failure detection apparatus 100 sets a value that becomes a predetermined probability line from the generated distribution of RTT values as a threshold value of the group (step S3). In this probability line, for example, in a distribution of RTT values (normal distribution), an RTT value corresponding to 3σ (99.7%) (σ: standard deviation) as a predetermined value of the variation degree is set as a predetermined threshold value. .
In FIG. 1, “a” is set as the threshold value of the group of the monitoring target devices 200 (211, 212, 213) installed in the immediate vicinity. In addition, a threshold “b” that is an RTT value of a longer time than the threshold “a” is set as the RTT threshold of the monitoring target device 200 (221, 222, 223) installed in a remote place.

このように、平常時（正常時）において類似するＲＴＴ値をもつ監視対象装置２００のグループ毎にＲＴＴの閾値を設定し、監視対象装置２００のＲＴＴ値が、その監視対象装置２００が属するグループの閾値を超えるか否かを判定することにより、各監視対象装置２００の異常を適切に検出することができる。 In this way, an RTT threshold is set for each group of monitoring target devices 200 having similar RTT values in normal times (normal time), and the RTT value of the monitoring target device 200 is the group to which the monitoring target device 200 belongs. By determining whether or not the threshold value is exceeded, an abnormality of each monitoring target device 200 can be detected appropriately.

図２は、本実施形態に係るネットワーク障害検出システム１による、ＲＴＴ値増大の異常原因判別処理を説明するための図である。
図２においては、ある同一グループ内の監視対象装置２００（２０１，２０２，２０３，２０４）が、ネットワーク障害検出装置１００により、監視されているものとする。 FIG. 2 is a diagram for explaining an abnormality cause determination process for an increase in the RTT value by the network failure detection system 1 according to the present embodiment.
In FIG. 2, it is assumed that the monitoring target devices 200 (201, 202, 203, and 204) in a certain group are monitored by the network failure detection device 100.

平常時（正常時）のＲＴＴ値が類似する同一グループに属する各監視対象装置２００（２０１，２０２，２０３，２０４）において、そのグループに設定した閾値を超えるＲＴＴ値の監視対象装置２００がある場合、その閾値超過の原因がその監視対象装置２００自体にあるときは、当該監視対象装置２００のみが閾値を超過する一方で、閾値超過の原因がネットワーク異常（経路変更等）の場合は、グルーピングした監視対象装置２００において多くの装置で閾値を超過することとなる。これは、監視用メッセージに対する応答メッセージ（応答パケット）が、ネットワークの異常個所を通過する各監視対象装置２００において、同様に遅延が発生するためである。このように、閾値を超過する監視対象装置２００が少ないか多いかの相違を判定することで、ＲＴＴ値増大の原因が、監視対象装置２００自体の負荷増大等に起因するものか、ネットワーク異常に起因するものなのかを判別する。 When each monitoring target device 200 (201, 202, 203, 204) belonging to the same group having a similar normal (normal) RTT value has a monitoring target device 200 with an RTT value exceeding the threshold set for that group When the cause of exceeding the threshold value is in the monitoring target device 200 itself, only the monitoring target device 200 exceeds the threshold value. On the other hand, when the cause of exceeding the threshold value is a network error (path change or the like), grouping is performed. In the monitoring target device 200, the threshold value is exceeded in many devices. This is because a response message (response packet) for the monitoring message similarly causes a delay in each monitoring target device 200 that passes through an abnormal part of the network. In this way, by determining the difference between whether the number of monitoring target devices 200 exceeding the threshold is small or large, whether the cause of the increase in the RTT value is due to an increase in the load of the monitoring target device 200 itself, or the like Determine whether it is caused by the problem.

具体的には、ネットワーク障害検出システム１は、グループ毎にそのグループに属する各監視対象装置２００を抽出し、そのグループに設定された閾値を超えるＲＴＴ値をもつ監視対象装置２００の数を計算し、そのグループの全体数に占める割合を計算する。そして、図２（ａ）に示すように、例えば、１つの監視対象装置２００（２０４）のみがそのグループに設定された閾値を超過する場合、グループ内で所定の割合を超える数の監視対象装置２００がＲＴＴの閾値を超えていることに該当せず、ネットワーク障害検出装置１００は、異常がその監視対象装置２００側に起因すると判別する。一方、図２（ｂ）に示すように、ネットワーク障害検出装置１００は、グループ内で所定の割合を超える数の監視対象装置２００がＲＴＴの閾値を超えている場合には、異常がネットワーク側に起因すると判別する。 Specifically, the network failure detection system 1 extracts each monitoring target device 200 belonging to the group for each group, and calculates the number of monitoring target devices 200 having an RTT value exceeding the threshold set for the group. Calculate the percentage of the total number of groups. Then, as shown in FIG. 2A, for example, when only one monitoring target device 200 (204) exceeds the threshold set for the group, the number of monitoring target devices exceeding a predetermined ratio in the group 200 does not correspond to exceeding the threshold of RTT, and the network failure detection apparatus 100 determines that the abnormality is caused by the monitoring target apparatus 200 side. On the other hand, as shown in FIG. 2B, the network failure detection apparatus 100 determines that an abnormality has occurred on the network side when the number of monitoring target devices 200 exceeding a predetermined ratio in the group exceeds the RTT threshold. Determined to be caused.

このようにすることにより、ネットワーク障害検出装置１００は、ＲＴＴの閾値を超える異常が発生した原因が、監視対象装置２００の負荷増大等に起因するものか、ネットワーク異常に起因するものなのか判別することができる。 By doing so, the network failure detection apparatus 100 determines whether the cause of the abnormality exceeding the RTT threshold is due to an increase in the load on the monitoring target apparatus 200 or the network abnormality. be able to.

＜ネットワーク障害検出装置＞
図３は、本実施形態に係るネットワーク障害検出装置１００の構成例を示す機能ブロック図である。
ネットワーク障害検出装置１００は、各監視対象装置２００から受信した監視用メッセージに対する応答メッセージのＲＴＴ値に基づき、ネットワークやそのネットワーク内の監視対象装置２００の障害を検出する装置であり、入出力部１１０と、制御部１２０と、記憶部１３０とを備える。 <Network failure detection device>
FIG. 3 is a functional block diagram illustrating a configuration example of the network failure detection apparatus 100 according to the present embodiment.
The network failure detection device 100 is a device that detects a failure of the network or the monitoring target device 200 in the network based on the RTT value of the response message to the monitoring message received from each monitoring target device 200. The input / output unit 110 And a control unit 120 and a storage unit 130.

入出力部１１０（出力部）は、各監視対象装置２００や不図示のネットワーク管理装置等との間の情報の入出力を行う。また、この入出力部１１０は、通信回線を介して情報の送受信を行う通信インタフェースと、不図示のキーボード等の入力手段やモニタ等の出力手段等との間で入出力を行う入出力インタフェースとから構成される。 The input / output unit 110 (output unit) inputs / outputs information to / from each monitoring target device 200 and a network management device (not shown). The input / output unit 110 includes a communication interface that transmits / receives information via a communication line, and an input / output interface that performs input / output between an input unit such as a keyboard (not shown) and an output unit such as a monitor. Consists of

制御部１２０は、ネットワーク障害検出装置１００全体の制御を司り、入力処理部１２１と、監視処理部１２２と、グルーピング処理部１２３と、閾値決定部１２４と、閾値判定部１２５と、閾値超過原因判別部１２６と、出力処理部１２７とを含んで構成される。 The control unit 120 controls the network failure detection apparatus 100 as a whole, and includes an input processing unit 121, a monitoring processing unit 122, a grouping processing unit 123, a threshold determination unit 124, a threshold determination unit 125, and a threshold excess cause determination. Unit 126 and an output processing unit 127.

入力処理部１２１は、入出力部１１０を介して、各監視対象装置２００やネットワーク管理装置（不図示）等から、情報の入力を受け付ける。
具体的には、入力処理部１２１は、後記するグルーピング処理に必要となる、使用する分布の情報（以下、「分布情報」という。）や、閾値決定ロジック等を取得し、パラメータ情報１３１として記憶部１３０に記憶する。
ここで、分布情報とは、グルーピング処理部１２３が、各監視対象装置２００のＲＴＴ値のばらつきを解析しグルーピングするために使用する、正規分布や、対数正規分布、ガンマ分布等の分布を示す情報（確率密度関数）である。なお、ここでは、分布情報として正規分布を用いる例として説明する。また、閾値決定ロジックとは、分布におけるばらつき度合の所定値に対応するＲＴＴ値を、閾値に設定する情報を意味する。例えば、正規分布において、ばらつき度合の所定値としての３σ（９９．７％）に対応するＲＴＴ値を、閾値に設定することを意味する。
さらに、入力処理部１２１は、不図示のネットワーク管理装置等から、入出力部１１０を介して、各監視対象装置２００の情報（各監視対象装置２００に固有な識別情報とそのアドレス情報等）を監視対象装置情報１３２として取得し、記憶部１３０に記憶する。
また、入力処理部１２１は、各監視対象装置２００から、入出力部１１０を介して、監視用メッセージに対する応答メッセージを受信し、監視処理部１２２に引き渡す。 The input processing unit 121 receives input of information from each monitoring target device 200, a network management device (not shown), and the like via the input / output unit 110.
Specifically, the input processing unit 121 acquires information on a distribution to be used (hereinafter referred to as “distribution information”), threshold determination logic, and the like necessary for grouping processing described later, and stores the acquired information as parameter information 131. Store in the unit 130.
Here, the distribution information is information indicating a distribution such as a normal distribution, a log normal distribution, a gamma distribution, or the like, which is used by the grouping processing unit 123 to analyze and group variations in RTT values of each monitoring target device 200. (Probability density function). Here, an example in which a normal distribution is used as the distribution information will be described. The threshold value determination logic means information for setting an RTT value corresponding to a predetermined value of the degree of variation in the distribution as a threshold value. For example, in a normal distribution, this means that an RTT value corresponding to 3σ (99.7%) as a predetermined value of the degree of variation is set as a threshold value.
Further, the input processing unit 121 obtains information on each monitoring target device 200 (identification information unique to each monitoring target device 200 and its address information, etc.) from a network management device (not shown) via the input / output unit 110. Obtained as monitoring target device information 132 and stored in the storage unit 130.
Further, the input processing unit 121 receives a response message to the monitoring message from each monitoring target device 200 via the input / output unit 110, and delivers it to the monitoring processing unit 122.

監視処理部１２２は、記憶部１３０に記憶された監視対象装置情報１３２を参照して、各監視対象装置２００に対し、監視用メッセージを送信し、その応答メッセージを受信することにより、ＲＴＴを検出する試験を実行する。そして、監視処理部１２２は、監視対象装置２００の識別情報とそのＲＴＴ値とのペアを試験結果として取得し、試験結果情報１３３として記憶部１３０に記憶する。 The monitoring processing unit 122 detects the RTT by referring to the monitoring target device information 132 stored in the storage unit 130 and transmitting a monitoring message to each monitoring target device 200 and receiving the response message. Run the test you want. Then, the monitoring processing unit 122 acquires a pair of the identification information of the monitoring target device 200 and its RTT value as a test result, and stores it in the storage unit 130 as the test result information 133.

グルーピング処理部１２３は、平常時（正常時）において、監視処理部１２２が各監視対象装置２００に対し試験を実行した結果である試験結果情報１３３を取得し、任意のグルーピング手法に基づき、ＲＴＴ値が類似する監視対象装置２００をグルーピングし、監視対象装置２００群が属する複数のグループで構成されるグループ情報を生成する。このグループ情報は、本ネットワーク障害検出システム１においてグループ毎に固有な識別情報（グループＩＤ）と、当該グループに属する監視対象装置２００それぞれの識別情報（監視対象装置ＩＤ）とで構成される。
このグルーピング処理部１２３が実行するグルーピング手法には、例えば、階層型クラスタリング（最短距離法、群平均化法）や、分割最適化クラスタリング（K-means法）等を適用することができる。 The grouping processing unit 123 acquires the test result information 133 that is a result of the test executed by the monitoring processing unit 122 on each monitoring target device 200 in normal times (normal time), and based on an arbitrary grouping method, the RTT value Are grouped, and group information composed of a plurality of groups to which the group of monitoring target devices 200 belongs is generated. The group information includes identification information (group ID) unique to each group in the network failure detection system 1 and identification information (monitoring target device ID) of each monitoring target device 200 belonging to the group.
As the grouping method executed by the grouping processing unit 123, for example, hierarchical clustering (shortest distance method, group averaging method), division optimization clustering (K-means method), or the like can be applied.

閾値決定部１２４は、パラメータ情報１３１として記憶部１３０に記憶した、分布情報および閾値決定ロジックと、グルーピング処理部１２３が生成したグループ情報とを取得する。そして、閾値決定部１２４は、グループ情報に示されるグループ毎にＲＴＴ値の分布を生成し、閾値決定ロジックに基づき、そのグループの閾値を決定する。例えば、閾値決定部１２４は、グループ情報に示されるグループ毎にＲＴＴ値の正規分布を生成し、正規分布において、ばらつき度合の所定値としての３σ（９９．７％）に対応するＲＴＴ値をそのグループの閾値として決定する。
そして、閾値決定部１２４は、グループ情報にそのグループの閾値を対応付けたグループ閾値情報１３４を生成し、記憶部１３０に記憶する。 The threshold value determination unit 124 acquires the distribution information and threshold value determination logic stored in the storage unit 130 as the parameter information 131 and the group information generated by the grouping processing unit 123. Then, the threshold value determination unit 124 generates an RTT value distribution for each group indicated in the group information, and determines the threshold value of the group based on the threshold value determination logic. For example, the threshold value determination unit 124 generates a normal distribution of RTT values for each group indicated in the group information, and sets the RTT value corresponding to 3σ (99.7%) as a predetermined value of the variation degree in the normal distribution. Determine as group threshold.
Then, the threshold determination unit 124 generates group threshold information 134 in which the group information is associated with the threshold of the group, and stores the group threshold information 134 in the storage unit 130.

図４は、本実施形態に係るグループ閾値情報１３４のデータ構成例を示す図である。
図４に示すように、グループ閾値情報１３４は、グループＩＤに対応づけて、そのグループに属する各監視対象装置２００の監視対象装置ＩＤ、および、閾値決定部１２４が決定したそのグループのＲＴＴの閾値（グループ閾値）が記憶される。
例えば、図４の１行目のグループＩＤが「Ｇ００１」のグループは、監視対象装置ＩＤが「２１１」「２１２」「２１３」…等の監視対象装置２００で構成され、このグループのＲＴＴの閾値（グループ閾値）が「ａ」であることを示している。 FIG. 4 is a diagram illustrating a data configuration example of the group threshold information 134 according to the present embodiment.
As shown in FIG. 4, the group threshold information 134 is associated with a group ID, the monitoring target device ID of each monitoring target device 200 belonging to the group, and the RTT threshold of the group determined by the threshold determination unit 124. (Group threshold) is stored.
For example, the group with the group ID “G001” in the first row in FIG. 4 is configured by the monitoring target devices 200 with the monitoring target device IDs “211”, “212”, “213”, etc., and the RTT threshold value of this group It indicates that (group threshold) is “a”.

図３に戻り、閾値判定部１２５は、監視処理部１２２が試験を実行した結果である試験結果情報１３３を取得する。この試験結果情報１３３は、前記したように、監視対象装置２００の識別情報とそのＲＴＴ値とを含む情報である。なお、閾値判定部１２５による以下に説明する処理は、グルーピング処理部１２３および閾値決定部１２４の処理の結果、記憶部１３０にグループ閾値情報１３４が記憶された後（監視実行時）に実行される。
閾値判定部１２５は、試験結果情報１３３を取得すると、グループ閾値情報１３４を参照して、その監視対象装置２００が属するグループに設定された閾値（グループ閾値）に基づき、その監視対象装置２００のＲＴＴ値が閾値を超過するか否かを判定する。そして、閾値判定部１２５は、各監視対象装置２００のＲＴＴ値の閾値の判定結果を、閾値判定結果情報１３５として記憶部１３０に記憶するとともに、その閾値を超過した監視対象装置２００の情報を、出力処理部１２７を介して、ネットワーク管理装置（不図示）等に出力する。 Returning to FIG. 3, the threshold determination unit 125 acquires test result information 133 that is a result of the test performed by the monitoring processing unit 122. As described above, the test result information 133 is information including the identification information of the monitoring target device 200 and its RTT value. Note that the processing described below by the threshold determination unit 125 is executed after the group threshold information 134 is stored in the storage unit 130 as a result of the processing by the grouping processing unit 123 and the threshold determination unit 124 (at the time of monitoring execution). .
Upon obtaining the test result information 133, the threshold determination unit 125 refers to the group threshold information 134, and based on the threshold (group threshold) set for the group to which the monitoring target device 200 belongs, the RTT of the monitoring target device 200 It is determined whether the value exceeds a threshold value. Then, the threshold determination unit 125 stores the determination result of the threshold value of the RTT value of each monitoring target device 200 in the storage unit 130 as the threshold determination result information 135, and information on the monitoring target device 200 that exceeds the threshold value, The data is output to a network management device (not shown) or the like via the output processing unit 127.

閾値超過原因判別部１２６は、閾値判定結果情報１３５を参照して、グループ毎にそのグループに属する各監視対象装置２００を抽出し、そのグループに設定された閾値（グループ閾値）を超えるＲＴＴ値をもつ監視対象装置２００の数を計算し、そのグループの全体数に占める割合を計算する。そして、閾値超過原因判別部１２６は、その各グループの計算結果を、閾値超過原因情報１３６として記憶部１３０に記憶するとともに、出力処理部１２７を介して、ネットワーク管理装置（不図示）等に出力する。
このとき、閾値超過原因判別部１２６は、所定の割合を超えるグループについて、ネットワーク異常の発生を示す警告を付して、ネットワーク管理装置等に出力するようにしてもよい。
この所定の割合は、例えば、グループに属する監視対象装置２００の総数のうちのＮ割（例えば、３割）にように設定する。なお、この所定の割合の設定値は、予め、パラメータ情報１３１に含め記憶部１３０に記憶しておく。 The threshold excess cause determination unit 126 refers to the threshold determination result information 135, extracts each monitoring target device 200 belonging to the group for each group, and sets an RTT value exceeding the threshold (group threshold) set for the group. The number of devices 200 to be monitored is calculated, and the proportion of the total number of the group is calculated. Then, the threshold excess cause determination unit 126 stores the calculation result of each group in the storage unit 130 as the threshold excess cause information 136 and outputs the result to a network management device (not shown) or the like via the output processing unit 127. To do.
At this time, the threshold excess cause determination unit 126 may attach a warning indicating the occurrence of a network abnormality to a group exceeding a predetermined ratio and output the warning to a network management device or the like.
For example, the predetermined ratio is set to N (for example, 30%) of the total number of the monitoring target devices 200 belonging to the group. The set value of the predetermined ratio is included in the parameter information 131 and stored in the storage unit 130 in advance.

出力処理部１２７は、入出力部１１０を介して、各監視対象装置２００やネットワーク管理装置（不図示）等に対し、情報を出力する。
例えば、出力処理部１２７は、監視処理部１２２の処理により、監視用メッセージを各監視対象装置２００に送信する。また、出力処理部１２７は、閾値判定部１２５が生成した閾値判定結果情報１３５や、閾値超過原因判別部１２６が生成した閾値超過原因情報１３６を、ネットワーク管理装置等に出力する。 The output processing unit 127 outputs information to each monitoring target device 200, a network management device (not shown), and the like via the input / output unit 110.
For example, the output processing unit 127 transmits a monitoring message to each monitoring target device 200 by the processing of the monitoring processing unit 122. Further, the output processing unit 127 outputs the threshold determination result information 135 generated by the threshold determination unit 125 and the threshold excess cause information 136 generated by the threshold excess cause determination unit 126 to a network management device or the like.

記憶部１３０は、ＲＡＭ（Random Access Memory）や、ＨＤＤ（Hard Dick Drive）、フラッシュメモリ等の記憶媒体から構成され、前記した、パラメータ情報１３１、監視対象装置情報１３２、試験結果情報１３３、グループ閾値情報１３４、閾値判定結果情報１３５および閾値超過原因情報１３６が記憶される。 The storage unit 130 includes a storage medium such as a RAM (Random Access Memory), an HDD (Hard Dick Drive), and a flash memory, and includes the parameter information 131, the monitoring target device information 132, the test result information 133, the group threshold value described above. Information 134, threshold determination result information 135, and threshold excess cause information 136 are stored.

なお、このネットワーク障害検出装置１００をプログラム実行処理により実現する場合、記憶部１３０には、ネットワーク障害検出装置１００の制御部１２０の機能を実現するためのプログラムが格納される。そして、制御部１２０は、記憶部１３０に記憶されたプログラムを、不図示のＣＰＵが、ＲＡＭ等に展開し実行することで実現される。 When the network failure detection apparatus 100 is realized by program execution processing, the storage unit 130 stores a program for realizing the function of the control unit 120 of the network failure detection apparatus 100. And the control part 120 is implement | achieved when CPU not shown expand | deploys the program memorize | stored in the memory | storage part 130 to RAM etc., and is performed.

また、このネットワーク障害検出装置１００は、監視処理機能を備えた従来の監視端末１０（図７〜図９参照）に接続される別装置として実現してもよい。その場合、ネットワーク障害検出装置１００は、監視処理部１２２（図３参照）を備えず、その替わりに、監視端末１０が各監視対象装置２００に対して実行した監視処理の結果である試験結果情報１３３を、監視端末１０から受信してネットワーク障害検出装置１００の記憶部１３０に記憶するようにすればよい。 Further, the network failure detection device 100 may be realized as a separate device connected to the conventional monitoring terminal 10 (see FIGS. 7 to 9) having a monitoring processing function. In that case, the network failure detection apparatus 100 does not include the monitoring processing unit 122 (see FIG. 3), but instead, test result information that is a result of the monitoring process executed by the monitoring terminal 10 on each monitoring target apparatus 200. 133 may be received from the monitoring terminal 10 and stored in the storage unit 130 of the network failure detection apparatus 100.

＜処理の流れ＞
次に、本実施形態に係るネットワーク障害検出システム１の処理の流れについて説明する。本実施形態に係るネットワーク障害検出システム１（ネットワーク障害検出装置１００）は、（１）平常時（正常時）において、監視対象装置２００のグルーピングを実行し、そのグループ毎のＲＴＴの閾値を決定する処理（以下、「グルーピング閾値決定処理」という。）と、（２）グループ毎のＲＴＴの閾値を用いて、監視対象装置２００の正常、異常を判定する処理（以下、「閾値判定処理」という。）とを実行する。具体的には、ネットワーク障害検出装置１００が、平常時（正常時）において、グルーピング閾値決定処理を実行しておき、その後、各監視対象装置２００に関して、例えば、所定の時間間隔毎の監視実行時に閾値判定処理を実行することにより、ネットワークの障害を検出する。
さらに、ネットワーク障害検出システム１（ネットワーク障害検出装置１００）は、各グループ内において、ＲＴＴの閾値を超過する監視対象装置２００の割合を計算することにより、ＲＴＴ値増大の異常原因判別処理を実行する。 <Process flow>
Next, a processing flow of the network failure detection system 1 according to the present embodiment will be described. The network failure detection system 1 (network failure detection device 100) according to the present embodiment performs (1) grouping of the monitoring target devices 200 in normal times (normal time), and determines the RTT threshold value for each group. Processing (hereinafter referred to as “grouping threshold determination processing”) and (2) processing for determining normality / abnormality of the monitoring target device 200 using the RTT threshold value for each group (hereinafter referred to as “threshold determination processing”). ) And execute. Specifically, the network failure detection device 100 executes grouping threshold value determination processing in normal times (normal time), and thereafter, for example, at the time of monitoring execution for each monitoring target device 200 at predetermined time intervals. A network failure is detected by executing a threshold determination process.
Furthermore, the network failure detection system 1 (network failure detection device 100) executes an abnormality cause determination process for increasing the RTT value by calculating the ratio of the monitoring target devices 200 that exceed the RTT threshold in each group. .

≪グルーピング閾値決定処理≫
図５は、本実施形態に係るネットワーク障害検出装置１００による、グルーピング閾値決定処理の流れを示すフローチャートである。
なお、ここでは、ネットワーク障害検出装置１００の入力処理部１２１により、分布情報として正規分布を用いること、および、閾値決定ロジックとして正規分布におけるばらつき度合の所定値である３σ（９９．７％）に対応するＲＴＴ値を閾値に設定すること、がパラメータ情報１３１として既に記憶部１３０に記憶されるとともに、各監視対象装置２００に関する監視対象装置情報１３２が記憶部１３０に記憶されているものとする。また、監視処理部１２２により、平常時（正常時）において、各監視対象装置２００に対し試験が実行され、平常時（正常時）における試験結果情報１３３が記憶部１３０に記憶されているものする。 << Grouping threshold determination process >>
FIG. 5 is a flowchart showing a flow of grouping threshold value determination processing by the network failure detection apparatus 100 according to the present embodiment.
Here, the normal processing is used as the distribution information by the input processing unit 121 of the network failure detection apparatus 100, and 3σ (99.7%), which is a predetermined value of the variation degree in the normal distribution, is used as the threshold determination logic. It is assumed that setting the corresponding RTT value as a threshold value is already stored in the storage unit 130 as the parameter information 131 and the monitoring target device information 132 regarding each monitoring target device 200 is stored in the storage unit 130. In addition, the monitoring processing unit 122 performs a test on each monitoring target device 200 in normal time (normal time), and the test result information 133 in normal time (normal time) is stored in the storage unit 130. .

まず、ネットワーク障害検出装置１００のグルーピング処理部１２３は、平常時（正常時）における試験結果情報１３３を、記憶部１３０から取得する（ステップＳ１０）。この試験結果情報１３３には、監視対象装置２００の識別情報とそのＲＴＴ値のペアが、監視を実行する各監視対象装置２００に関して記憶されている。 First, the grouping processing unit 123 of the network failure detection apparatus 100 acquires the test result information 133 in the normal time (normal time) from the storage unit 130 (step S10). In the test result information 133, a pair of identification information of the monitoring target device 200 and its RTT value is stored for each monitoring target device 200 that performs monitoring.

次に、グルーピング処理部１２３は、平常時における試験結果情報１３３に記憶された各監視対象装置２００のＲＴＴ値を参照し、所定のグルーピング手法を用いて、ＲＴＴ値が類似する監視対象装置２００をグルーピングし（ステップＳ１１）、監視対象装置２００群が属するグループが複数で構成されるグループ情報を生成する。 Next, the grouping processing unit 123 refers to the RTT value of each monitoring target device 200 stored in the test result information 133 in the normal state, and uses the predetermined grouping technique to select the monitoring target devices 200 having similar RTT values. Grouping is performed (step S11), and group information including a plurality of groups to which the monitoring target device 200 group belongs is generated.

続いて、ネットワーク障害検出装置１００の閾値決定部１２４は、閾値計算の対象となるｉ番目のグループ（以下、「グループｉ」とする。）のグループ番号ｉ＝１（初期化）を設定する（ステップＳ１２）。 Subsequently, the threshold value determination unit 124 of the network failure detection apparatus 100 sets the group number i = 1 (initialization) of the i-th group (hereinafter referred to as “group i”) that is a threshold calculation target ( Step S12).

そして、閾値決定部１２４は、グループｉに属する監視対象装置２００のＲＴＴ値の分布（正規分布）を生成する（ステップＳ１３）。 Then, the threshold value determination unit 124 generates an RTT value distribution (normal distribution) of the monitoring target devices 200 belonging to the group i (step S13).

続いて、閾値決定部１２４は、パラメータ情報１３１として設定された閾値決定ロジック（正規分布におけるばらつき度合の所定値である３σ（９９．７％）に対応するＲＴＴ値を閾値に設定）に基づき、そのグループｉの閾値を決定する（ステップＳ１４）。 Subsequently, the threshold value determination unit 124, based on the threshold value determination logic set as the parameter information 131 (the RTT value corresponding to 3σ (99.7%), which is a predetermined value of the variation degree in the normal distribution) is set as the threshold value) The threshold value of the group i is determined (step S14).

次に、閾値決定部１２４は、ステップＳ１１で生成したグループ情報の該当グループに、決定した閾値（グループ閾値）を対応付けてグループ閾値情報１３４（図４参照）として記憶部１３０に記憶する（ステップＳ１５）。 Next, the threshold value determination unit 124 associates the determined threshold value (group threshold value) with the corresponding group of the group information generated in step S11 and stores it in the storage unit 130 as group threshold value information 134 (see FIG. 4) (step S1). S15).

そして、閾値決定部１２４は、ステップＳ１１で生成したすべてのグループについて処理を終えたか否かを判定する（ステップＳ１６）。ここで、まだ、すべてのグループについて処理を終えていない場合は（ステップＳ１６→Ｎｏ）、「ｉ」に「１」を加えて（ステップＳ１７）、ステップＳ１３に戻る。
一方、閾値決定部１２４は、すべてのグループの処理を終えている場合は（ステップＳ１６→Ｙｅｓ）、グルーピング閾値決定処理を終了する。 Then, the threshold value determination unit 124 determines whether or not the processing has been completed for all the groups generated in step S11 (step S16). If the processing has not been completed for all the groups (step S16 → No), “1” is added to “i” (step S17), and the process returns to step S13.
On the other hand, when all the groups have been processed (step S16 → Yes), the threshold determination unit 124 ends the grouping threshold determination process.

このようにすることで、ネットワーク障害検出装置１００は、類似したＲＴＴ値をもつ監視対象装置２００のグループ毎に、閾値（グループ閾値）を決定することができる。 By doing in this way, the network failure detection apparatus 100 can determine a threshold value (group threshold value) for each group of monitoring target apparatuses 200 having similar RTT values.

≪閾値判定処理およびＲＴＴ値増大の異常原因判別処理≫
図６は、本実施形態に係るネットワーク障害検出装置１００による、閾値判定処理およびＲＴＴ値増大の異常原因判別処理の流れを示すフローチャートである。
なお、ここでは、監視処理部１２２により、図５に示したグルーピング閾値決定処理の後、各監視対象装置２００に対し試験が実行され、監視実行時の試験結果として、監視対象装置２００の識別情報とそのＲＴＴ値のペアとで構成される試験結果情報１３３が記憶部１３０に記憶されているものする。 << Threshold determination process and RTT value increase abnormality cause determination process >>
FIG. 6 is a flowchart showing the flow of threshold determination processing and RTT value increase abnormality cause determination processing by the network failure detection apparatus 100 according to the present embodiment.
Here, after the grouping threshold value determination process shown in FIG. 5 is performed by the monitoring processing unit 122, a test is executed on each monitoring target device 200, and the identification information of the monitoring target device 200 is obtained as a test result at the time of monitoring execution. And test result information 133 composed of a pair of the RTT values is stored in the storage unit 130.

まず、ネットワーク障害検出装置１００の閾値判定部１２５は、記憶部１３０に記憶された監視実行時の試験結果情報１３３を取得する（ステップＳ２０）。 First, the threshold determination unit 125 of the network failure detection apparatus 100 acquires the test result information 133 at the time of monitoring stored in the storage unit 130 (step S20).

次に、閾値判定部１２５は、記憶部１３０に記憶されたグループ閾値情報１３４を取得する（ステップＳ２１）。 Next, the threshold determination unit 125 acquires group threshold information 134 stored in the storage unit 130 (step S21).

続いて、閾値判定部１２５は、ステップＳ２０で取得した試験結果情報１３３において、閾値判定を実行するｊ番目の監視対象装置２００について、初期化（ｊ＝１）する（ステップＳ２２）。 Subsequently, the threshold determination unit 125 initializes (j = 1) the j-th monitoring target device 200 that performs threshold determination in the test result information 133 acquired in step S20 (step S22).

そして、閾値判定部１２５は、ｊ番目の監視対象装置２００について、グループ閾値情報１３４（図４参照）に基づき、その監視対象装置２００が属するグループの閾値（グループ閾値）を抽出し、試験結果であるその監視対象装置２００のＲＴＴ値と比較する（ステップＳ２３）。 Then, the threshold determination unit 125 extracts the threshold (group threshold) of the group to which the monitoring target device 200 belongs based on the group threshold information 134 (see FIG. 4) for the jth monitoring target device 200, and uses the test result. A comparison is made with the RTT value of the monitoring target device 200 (step S23).

次に、閾値判定部１２５は、ステップＳ２３の比較の結果、試験結果のＲＴＴ値がその監視対象装置２００が属するグループの閾値（グループ閾値）を超過しているか否かを判定する（ステップＳ２４）。 Next, the threshold value determination unit 125 determines whether the RTT value of the test result exceeds the threshold value (group threshold value) of the group to which the monitoring target device 200 belongs as a result of the comparison in step S23 (step S24). .

そして、閾値判定部１２５は、試験結果のＲＴＴ値が閾値（グループ閾値）を超えている場合に（ステップＳ２４→Ｙｅｓ）、その判定結果を閾値判定結果情報１３５として記憶部１３０に記憶するとともに、出力処理部１２７を介して、ネットワーク管理装置（不図示）等に出力する（ステップＳ２５）。そして、次のステップＳ２７に進む。
一方、閾値判定部１２５は、試験結果のＲＴＴ値が閾値以下の場合に（ステップＳ２４→Ｎｏ）、その判定結果を閾値判定結果情報１３５として記憶部１３０に記憶する（ステップＳ２６）。そして、次のステップＳ２７に進む。 Then, when the RTT value of the test result exceeds the threshold (group threshold) (step S24 → Yes), the threshold determination unit 125 stores the determination result as the threshold determination result information 135 in the storage unit 130, and The data is output to a network management device (not shown) or the like via the output processing unit 127 (step S25). Then, the process proceeds to next Step S27.
On the other hand, when the RTT value of the test result is equal to or less than the threshold (Step S24 → No), the threshold determination unit 125 stores the determination result in the storage unit 130 as threshold determination result information 135 (Step S26). Then, the process proceeds to next Step S27.

ステップＳ２７において、閾値判定部１２５は、監視対象装置２００の閾値判定処理をすべて終了したか否か判定する。ここで、まだ、閾値判定処理を実行していない監視対象装置２００がある場合には（ステップＳ２７→Ｎｏ）、「ｊ」に「１」を加えて（ステップＳ２８）、ステップＳ２３に戻る。一方、閾値判定部１２５は、監視対象装置２００の閾値判定処理をすべて終了している場合には（ステップＳ２７→Ｙｅｓ）、次のステップＳ２９に進む。 In step S <b> 27, the threshold determination unit 125 determines whether all the threshold determination processing of the monitoring target device 200 has been completed. If there is a monitoring target device 200 that has not yet executed the threshold determination process (step S27 → No), “1” is added to “j” (step S28), and the process returns to step S23. On the other hand, when all of the threshold determination processing of the monitoring target device 200 has been completed (step S27 → Yes), the threshold determination unit 125 proceeds to the next step S29.

なお、このステップＳ２０〜Ｓ２８が閾値判定処理であり、以下に説明するステップＳ２９〜Ｓ３３が、ＲＴＴ値増大の異常原因判別処理である。 Note that steps S20 to S28 are threshold determination processing, and steps S29 to S33 described below are abnormality cause determination processing for increasing the RTT value.

ステップＳ２９において、閾値超過原因判別部１２６は、閾値超過原因の判別対象となるｉ番目のグループ（グループｉ）のグループ番号ｉ＝１（初期化）を設定する。 In step S29, the threshold excess cause determination unit 126 sets the group number i = 1 (initialization) of the i-th group (group i) to be determined as a threshold excess cause determination target.

そして、閾値超過原因判別部１２６は、そのグループｉの閾値超過原因の判別処理を実行する（ステップＳ３０）。具体的には、閾値超過原因判別部１２６は、そのグループｉに属する監視対象装置２００全体のうち、そのグループのＲＴＴの閾値（グループ閾値）を超過した監視対象装置２００の数を計算し、そのグループの全体数に対するＲＴＴの閾値を超過した監視対象装置２００の数の割合を計算する。 Then, the threshold excess cause determination unit 126 executes a threshold excess cause determination process for the group i (step S30). Specifically, the threshold excess cause determination unit 126 calculates the number of monitoring target devices 200 that have exceeded the RTT threshold (group threshold) of the group among the entire monitoring target devices 200 belonging to the group i, and The ratio of the number of monitoring target devices 200 exceeding the RTT threshold to the total number of groups is calculated.

続いて、閾値超過原因判別部１２６は、ステップＳ３０において計算した、そのグループにおけるＲＴＴの閾値を超過した監視対象装置２００の割合を示す閾値超過原因情報１３６を生成し、記憶部１３０に記憶するとともに、出力処理部１２７を介して、ネットワーク管理装置（不図示）等に出力する（ステップＳ３１）。そして、次のステップＳ３１に進む。
なお、閾値超過原因情報１３６を生成したとき、閾値超過原因判別部１２６は、計算した割合が所定の割合（Ｎ割）を超えているか否かを判定する。そして、閾値超過原因判別部１２６は、所定の割合（Ｎ割）を超えたグループについては、ネットワーク異常を示す警報情報を付して出力するようにしてもよい。また、閾値超過原因判別部１２６は、所定の割合（Ｎ割）以下のグループの判別結果については、ネットワーク管理装置等に出力せず、所定の割合（Ｎ割）を超えたグループについてのみ、判別結果を出力するようにしてもよい。 Subsequently, the threshold excess cause determining unit 126 generates threshold excess cause information 136 indicating the proportion of the monitoring target devices 200 that have exceeded the RTT threshold in the group calculated in step S30, and stores the threshold excess cause information 136 in the storage unit 130. The data is output to a network management device (not shown) or the like via the output processing unit 127 (step S31). Then, the process proceeds to next Step S31.
When the threshold excess cause information 136 is generated, the threshold excess cause determination unit 126 determines whether or not the calculated ratio exceeds a predetermined ratio (N percent). Then, the threshold excess cause determination unit 126 may output alarm information indicating a network abnormality for a group that exceeds a predetermined ratio (N percent). Further, the threshold excess cause determining unit 126 does not output the determination result of the group having a predetermined ratio (N percent) or less to the network management apparatus or the like, and only determines the group exceeding the predetermined ratio (N percent). The result may be output.

ステップＳ３２において、閾値超過原因判別部１２６は、すべてのグループについて処理を終えたか否かを判定する。ここで、まだ、すべてのグループについて処理を終えていない場合は（ステップＳ３２→Ｎｏ）、「ｉ」に「１」を加えて（ステップＳ３３）、ステップＳ３０に戻る。
一方、閾値超過原因判別部１２６は、すべてのグループについて処理を終えている場合は（ステップＳ３２→Ｙｅｓ）、処理を終了する。 In step S32, the threshold excess cause determining unit 126 determines whether or not the processing has been completed for all the groups. If the processing has not been completed for all the groups (step S32 → No), “1” is added to “i” (step S33), and the process returns to step S30.
On the other hand, the threshold excess cause determination unit 126 ends the process when the process has been completed for all the groups (step S32 → Yes).

このように、ネットワーク障害検出装置１００は、閾値判定処理により、監視実行時における監視対象装置２００が正常か異常かの判定を、当該監視対象装置２００が属するグループに設定された閾値（グループ閾値）を用いて実行することができる。よって、ネットワークにおける伝搬遅延を考慮した適切な監視対象装置２００の正常、異常の判定を行うことができる。
また、ネットワーク障害検出装置１００は、ＲＴＴ値増大の異常原因判別処理により、グループ毎に、そのグループの閾値を超えたＲＴＴをもつ監視対象装置２００の数を計算し、当該グループに属する監視対象装置２００の全体数に対する割合を計算し、閾値超過原因情報１３６として出力することができる。この閾値超過原因情報１３６に基づき、閾値を超える異常が発生した原因が、監視対象装置２００の負荷増大等に起因するものか、ネットワーク異常に起因するものなのかを判別することが可能となる。 As described above, the network failure detection device 100 determines whether the monitoring target device 200 is normal or abnormal at the time of monitoring execution by the threshold determination processing. The threshold (group threshold) set for the group to which the monitoring target device 200 belongs. Can be used. Therefore, it is possible to determine whether the monitoring target device 200 is normal or abnormal considering the propagation delay in the network.
Further, the network failure detection device 100 calculates the number of monitoring target devices 200 having an RTT exceeding the threshold of the group for each group by an abnormality cause determination process for increasing the RTT value, and the monitoring target devices belonging to the group A ratio to the total number of 200 can be calculated and output as threshold excess cause information 136. Based on the threshold value excess cause information 136, it is possible to determine whether the cause of the abnormality exceeding the threshold value is caused by an increase in the load of the monitoring target device 200 or the network abnormality.

以上説明したように、本実施形態に係る、ネットワーク障害検出システム１およびネットワーク障害検出装置１００によれば、伝搬遅延を考慮した適切なＲＴＴの閾値を設定することができる。そして、閾値を超える異常が発生した原因が、監視対象装置２００の負荷増大等に起因するものか、ネットワーク異常に起因するものなのかを判別することが可能となる。 As described above, according to the network failure detection system 1 and the network failure detection apparatus 100 according to the present embodiment, it is possible to set an appropriate RTT threshold considering propagation delay. Then, it is possible to determine whether the cause of the abnormality exceeding the threshold is due to an increase in the load of the monitoring target device 200 or the network abnormality.

なお、ネットワーク障害検出装置１００は、類似したＲＴＴ値をもつ監視対象装置２００のグループ毎に閾値を決定することにより、各監視対象装置２００に対し個別に閾値を設定するよりも少ない稼動（処理負荷）で、適切なＲＴＴの閾値の設定が可能となる。
また、ネットワーク障害検出装置１００は、閾値決定のために使用する分布を示す分布情報と、当該分布におけるばらつき度合の所定値に対応するＲＴＴ値を閾値として決定する閾値決定ロジックとを、記憶部１３０に記憶しておく。これにより、ネットワーク障害検出装置１００の閾値決定部１２４は、グルーピング処理部１２３が生成した複数のグループについて、自動的に各グループの閾値を決定することができる。よって、ネットワーク管理者が手動で各グループの閾値を設定する必要をなくすことができる。 Note that the network failure detection apparatus 100 determines a threshold value for each group of monitoring target apparatuses 200 having similar RTT values, thereby reducing operation (processing load) compared to setting a threshold value for each monitoring target apparatus 200 individually. ), An appropriate RTT threshold value can be set.
Further, the network failure detection apparatus 100 stores the distribution information indicating the distribution used for threshold determination and the threshold determination logic for determining the RTT value corresponding to the predetermined value of the degree of variation in the distribution as the threshold. Remember it. Thereby, the threshold value determination unit 124 of the network failure detection apparatus 100 can automatically determine the threshold value of each group for a plurality of groups generated by the grouping processing unit 123. Therefore, it is possible to eliminate the need for the network administrator to manually set the threshold value of each group.

１ネットワーク障害検出システム
１０監視端末
１００ネットワーク障害検出装置
１１０入出力部（出力部）
１２０制御部
１２１入力処理部
１２２監視処理部
１２３グルーピング処理部
１２４閾値決定部
１２５閾値判定部
１２６閾値超過原因判別部
１２７出力処理部
１３０記憶部
１３１パラメータ情報
１３２監視対象装置情報
１３３試験結果情報
１３４グループ閾値情報
１３５閾値判定結果情報
１３６閾値超過原因情報
２００監視対象装置
DESCRIPTION OF SYMBOLS 1 Network failure detection system 10 Monitoring terminal 100 Network failure detection apparatus 110 Input / output part (output part)
120 Control Unit 121 Input Processing Unit 122 Monitoring Processing Unit 123 Grouping Processing Unit 124 Threshold Determination Unit 125 Threshold Determination Unit 126 Threshold Excess Cause Determination Unit 127 Output Processing Unit 130 Storage Unit 131 Parameter Information 132 Monitoring Target Device Information 133 Test Result Information 134 Group Threshold information 135 Threshold judgment result information 136 Threshold excess cause information 200 Monitoring target device

Claims

ネットワークを構成する複数の監視対象装置と、前記複数の監視対象装置に対し監視用メッセージを送信し、前記監視対象装置それぞれから受信した応答メッセージに基づくＲＴＴ（Round Trip Time）を用いて、前記ネットワークの障害を検出するネットワーク障害検出装置と、を備えるネットワーク障害検出システムであって、
前記ネットワーク障害検出装置は、
前記監視対象装置のＲＴＴのばらつきを解析するために使用する分布を示す分布情報と、前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記監視対象装置を異常と判定する閾値として決定する閾値決定ロジックと、正常時において前記複数の監視対象装置それぞれから受信した前記応答メッセージに基づく監視結果として、前記監視対象装置の識別情報および当該監視対象装置の前記ＲＴＴを示す試験結果情報と、を記憶する記憶部と、
前記複数の監視対象装置それぞれの前記正常時の試験結果情報を取得し、所定のグルーピング手法を用いて、前記ＲＴＴが類似する監視対象装置をグルーピングし、グルーピングにより生成された複数のグループ毎に、当該グループに属する各監視対象装置のＲＴＴの分布を前記分布情報に示される分布により生成するグルーピング処理部と、
前記グループ毎に生成された分布それぞれにおいて、前記閾値決定ロジックに基づき前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記グループ毎の閾値に決定する閾値決定部と、
を備えることを特徴とするネットワーク障害検出システム。 A plurality of monitoring target devices constituting a network, and a RTT (Round Trip Time) based on a response message transmitted from each of the monitoring target devices and transmitting a monitoring message to the plurality of monitoring target devices. A network failure detection system comprising a network failure detection device for detecting a failure of
The network failure detection device includes:
The distribution information indicating the distribution used to analyze the variation in the RTT of the monitored device and the RTT value corresponding to a predetermined value of the variation degree in the distribution are determined as threshold values for determining the monitored device as abnormal. Threshold value determination logic, and as a monitoring result based on the response message received from each of the plurality of monitoring target devices at the normal time, identification information of the monitoring target device and test result information indicating the RTT of the monitoring target device, A storage unit for storing
Obtaining the normal test result information of each of the plurality of monitoring target devices, grouping the monitoring target devices having similar RTTs using a predetermined grouping method, and for each of the plurality of groups generated by the grouping, A grouping processing unit that generates an RTT distribution of each monitoring target device belonging to the group based on the distribution indicated in the distribution information;
In each of the distributions generated for each group, a threshold value determination unit that determines an RTT value corresponding to a predetermined value of the degree of variation in the distribution based on the threshold value determination logic as a threshold value for each group;
A network failure detection system comprising:

前記ネットワーク障害検出装置の前記記憶部には、前記グルーピング処理部が生成した複数のグループそれぞれに対応付けて、前記閾値決定部が決定した当該グループの閾値を格納したグループ閾値情報と、前記複数の監視対象装置それぞれに対する監視実行時の前記試験結果情報と、がさらに記憶されており、
前記ネットワーク障害検出装置は、
前記複数の監視対象装置それぞれの前記監視実行時の試験結果情報を取得し、前記監視実行時の試験結果情報に含まれる前記監視対象装置の識別情報を用いて、前記グループ閾値情報を参照し、当該監視対象装置の属する前記グループの閾値を抽出し、前記抽出した前記グループの閾値を、前記監視実行時の試験結果情報に含まれる当該監視対象装置のＲＴＴが超えるか否かを判定し、判定結果を閾値判定結果情報として出力する閾値判定部を、さらに備えること
を特徴とする請求項１に記載のネットワーク障害検出システム。 In the storage unit of the network failure detection apparatus, group threshold information storing the threshold of the group determined by the threshold determination unit in association with each of the plurality of groups generated by the grouping processing unit, and the plurality of the plurality of groups The test result information at the time of monitoring execution for each monitored device is further stored,
The network failure detection device includes:
Obtaining test result information at the time of execution of monitoring for each of the plurality of monitoring target devices, using the identification information of the monitoring target device included in the test result information at the time of monitoring execution, referring to the group threshold information, The threshold value of the group to which the monitoring target device belongs is extracted, and it is determined whether or not the RTT of the monitoring target device included in the test result information at the time of monitoring exceeds the extracted threshold value of the group. The network failure detection system according to claim 1, further comprising a threshold determination unit that outputs a result as threshold determination result information.

前記ネットワーク障害検出装置は、
前記閾値判定結果情報を参照し、前記グループ毎に前記判定結果を抽出し、当該グループの閾値を超えたＲＴＴをもつ監視対象装置の数を計算し、当該グループに属する監視対象装置の全体数に対する当該数の割合を計算し、前記グループ毎の前記計算した割合を示す閾値超過原因情報を出力する閾値超過原因判別部を、さらに備えること
を特徴とする請求項２に記載のネットワーク障害検出システム。 The network failure detection device includes:
Referencing the threshold determination result information, extracting the determination result for each group, calculating the number of monitoring target devices having an RTT exceeding the threshold of the group, and for the total number of monitoring target devices belonging to the group The network failure detection system according to claim 2, further comprising: a threshold excess cause determining unit that calculates a ratio of the number and outputs threshold excess cause information indicating the calculated ratio for each group.

前記閾値超過原因判別部は、
前記計算した割合が所定の割合を超えるか否かを前記グループ毎に判別し、前記所定の割合を超える場合に、ネットワーク異常を示す警告情報を出力すること
を特徴とする請求項３に記載のネットワーク障害検出システム。 The threshold excess cause determination unit
4. The system according to claim 3, wherein whether or not the calculated ratio exceeds a predetermined ratio is determined for each group, and warning information indicating a network abnormality is output when the ratio exceeds the predetermined ratio. Network failure detection system.

ネットワークを構成する複数の監視対象装置と、前記複数の監視対象装置に対し監視用メッセージを送信し、前記監視対象装置それぞれから受信した応答メッセージに基づくＲＴＴ（Round Trip Time）を用いて、前記ネットワークの障害を検出するネットワーク障害検出装置と、を備えるネットワーク障害検出システムの前記ネットワーク障害検知装置であって、
前記監視対象装置のＲＴＴのばらつきを解析するために使用する分布を示す分布情報と、前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記監視対象装置を異常と判定する閾値として決定する閾値決定ロジックと、正常時において前記複数の監視対象装置それぞれから受信した前記応答メッセージに基づく監視結果として、前記監視対象装置の識別情報および当該監視対象装置の前記ＲＴＴを示す試験結果情報と、を記憶する記憶部と、
前記複数の監視対象装置それぞれの前記正常時の試験結果情報を取得し、所定のグルーピング手法を用いて、前記ＲＴＴが類似する監視対象装置をグルーピングし、グルーピングにより生成された複数のグループ毎に、当該グループに属する各監視対象装置のＲＴＴの分布を前記分布情報に示される分布により生成するグルーピング処理部と、
前記グループ毎に生成された分布それぞれにおいて、前記閾値決定ロジックに基づき前記分布におけるばらつき度合の所定値に対応するＲＴＴの値を、前記グループ毎の閾値に決定する閾値決定部と、
を備えることを特徴とするネットワーク障害検出装置。 A plurality of monitoring target devices constituting a network, and a RTT (Round Trip Time) based on a response message transmitted from each of the monitoring target devices and transmitting a monitoring message to the plurality of monitoring target devices. A network failure detection device for detecting a failure of the network failure detection system, the network failure detection device comprising:
The distribution information indicating the distribution used to analyze the variation in the RTT of the monitored device and the RTT value corresponding to a predetermined value of the variation degree in the distribution are determined as threshold values for determining the monitored device as abnormal. Threshold value determination logic, and as a monitoring result based on the response message received from each of the plurality of monitoring target devices at the normal time, identification information of the monitoring target device and test result information indicating the RTT of the monitoring target device, A storage unit for storing
Obtaining the normal test result information of each of the plurality of monitoring target devices, grouping the monitoring target devices having similar RTTs using a predetermined grouping method, and for each of the plurality of groups generated by the grouping, A grouping processing unit that generates an RTT distribution of each monitoring target device belonging to the group based on the distribution indicated in the distribution information;
In each of the distributions generated for each group, a threshold value determination unit that determines an RTT value corresponding to a predetermined value of the degree of variation in the distribution based on the threshold value determination logic as a threshold value for each group;
A network failure detection apparatus comprising:

前記記憶部には、前記グルーピング処理部が生成した複数のグループそれぞれに対応付けて、前記閾値決定部が決定した当該グループの閾値を格納したグループ閾値情報と、前記複数の監視対象装置それぞれに対する監視実行時の前記試験結果情報と、がさらに記憶されており、
前記複数の監視対象装置それぞれの前記監視実行時の試験結果情報を取得し、前記監視実行時の試験結果情報に含まれる前記監視対象装置の識別情報を用いて、前記グループ閾値情報を参照し、当該監視対象装置の属する前記グループの閾値を抽出し、前記抽出した前記グループの閾値を、前記監視実行時の試験結果情報に含まれる当該監視対象装置のＲＴＴが超えるか否かを判定し、判定結果を閾値判定結果情報として出力する閾値判定部を、さらに備えること
を特徴とする請求項５に記載のネットワーク障害検出装置。 In the storage unit, group threshold information storing the threshold value of the group determined by the threshold value determination unit in association with each of the plurality of groups generated by the grouping processing unit, and monitoring for each of the plurality of monitoring target devices The test result information at the time of execution is further stored,
Obtaining test result information at the time of execution of monitoring for each of the plurality of monitoring target devices, using the identification information of the monitoring target device included in the test result information at the time of monitoring execution, referring to the group threshold information, The threshold value of the group to which the monitoring target device belongs is extracted, and it is determined whether or not the RTT of the monitoring target device included in the test result information at the time of monitoring exceeds the extracted threshold value of the group. The network failure detection apparatus according to claim 5, further comprising a threshold determination unit that outputs a result as threshold determination result information.

前記閾値判定結果情報を参照し、前記グループ毎に前記判定結果を抽出し、当該グループの閾値を超えたＲＴＴをもつ監視対象装置の数を計算し、当該グループに属する監視対象装置の全体数に対する当該数の割合を計算し、前記グループ毎の前記計算した割合を示す閾値超過原因情報を出力する閾値超過原因判別部を、さらに備えること
を特徴とする請求項６に記載のネットワーク障害検出装置。 Referencing the threshold determination result information, extracting the determination result for each group, calculating the number of monitoring target devices having an RTT exceeding the threshold of the group, and for the total number of monitoring target devices belonging to the group The network failure detection apparatus according to claim 6, further comprising: a threshold excess cause determining unit that calculates a ratio of the number and outputs threshold excess cause information indicating the calculated ratio for each group.

前記閾値超過原因判別部は、
前記計算した割合が所定の割合を超えるか否かを前記グループ毎に判別し、前記所定の割合を超える場合に、ネットワーク異常を示す警告情報を出力すること
を特徴とする請求項７に記載のネットワーク障害検出装置。 The threshold excess cause determination unit
8. The system according to claim 7, wherein whether or not the calculated ratio exceeds a predetermined ratio is determined for each group, and warning information indicating a network abnormality is output when the ratio exceeds the predetermined ratio. Network failure detection device.