KR20170084445A - Method and apparatus for detecting abnormality using time-series data - Google Patents

Method and apparatus for detecting abnormality using time-series data Download PDF

Info

Publication number
KR20170084445A
KR20170084445A KR1020160003500A KR20160003500A KR20170084445A KR 20170084445 A KR20170084445 A KR 20170084445A KR 1020160003500 A KR1020160003500 A KR 1020160003500A KR 20160003500 A KR20160003500 A KR 20160003500A KR 20170084445 A KR20170084445 A KR 20170084445A
Authority
KR
South Korea
Prior art keywords
series data
time series
cluster
clusters
monitoring object
Prior art date
Application number
KR1020160003500A
Other languages
Korean (ko)
Inventor
김정경
Original Assignee
삼성에스디에스 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성에스디에스 주식회사 filed Critical 삼성에스디에스 주식회사
Priority to KR1020160003500A priority Critical patent/KR20170084445A/en
Publication of KR20170084445A publication Critical patent/KR20170084445A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F17/30705

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

According to an aspect of the present invention, there is provided an abnormality detection method using time series data, comprising: collecting time series data from a monitoring object; classifying the monitoring object into clusters using the time series data; And detecting the abnormality of the monitoring object belonging to the cluster using the dynamically calculated threshold value.

Description

[0001] The present invention relates to a method and an apparatus for detecting abnormality using time series data,

The present invention relates to a method and apparatus for detecting abnormality using time series data. More particularly, the present invention relates to a method of classifying monitored objects into a plurality of groups using characteristics of time series data, and dynamically calculating a threshold value based on past time series data of each group, and an apparatus for performing the method .

Recently, due to the development of parallel processing technology, the construction and monitoring of a distributed processing environment using a plurality of servers have become important. For example, infrastructure is built using servers, networks, and applications with various roles, and infrastructure is monitored periodically through separate management systems or solutions to improve the performance of the infrastructure that is built.

In order to secure the availability and performance of the infrastructure composed of a plurality of servers, a threshold value of the time series data generated by the monitoring object must be set in advance. That is, if time-series data higher than or equal to a preset threshold value is generated, it is detected as abnormal.

Conventionally, a static threshold is mainly used. However, in the monitoring using the fixed threshold value that does not take into account the characteristics of the server or the timing, it is often determined that the abnormality is not abnormal but is not abnormal even though the abnormality is abnormal.

In the past, when a fixed threshold value is set, a fixed threshold is set assuming that the time series data to be monitored is implicitly or explicitly followed by a normal distribution. In other words, in order to detect an abnormality, the average (μ

Figure pat00001
And the standard deviation (σ) were found, and it was confirmed whether or not it deviated from the range of μ ± nσ (n = 1, 2, 3, ...). However, the actual time series data often does not follow the normal distribution.

In order to solve problems that may occur when the fixed threshold value is used and problems that may occur when the normal distribution is not followed, an anomaly detection method using a dynamic threshold that can be applied to time series data that is not a normal distribution is required do.

Korean Patent Publication No. 2013-0020265 (Publication date 2013.02.27)

SUMMARY OF THE INVENTION The present invention provides a method and apparatus for detecting abnormalities using time series data.

The technical problems of the present invention are not limited to the above-mentioned technical problems, and other technical problems which are not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided an abnormality detection method using time series data, the method comprising: collecting time series data from a monitoring object; classifying the monitoring object into clusters using the time series data; Calculating a threshold value of the cluster dynamically, and detecting the abnormality of the monitoring object belonging to the cluster using the dynamically calculated threshold value.

In one embodiment, collecting the time series data may include collecting only time series data within a predetermined period from the present time.

In another embodiment, the step of classifying the monitoring objects into clusters may include classifying the clusters into clusters using a K-means algorithm.

In yet another embodiment, the step of classifying the monitoring objects into clusters may include classifying the clusters into clusters using at least one of variance, maximum value, and seasonality of the time series data.

In yet another embodiment, dynamically computing the threshold of the cluster may include dynamically computing the threshold of the cluster using a non-parametric bootstrap method.

In another embodiment, the step of detecting an abnormality of the monitoring object belonging to the cluster may include determining that the abnormality of the monitoring object belonging to the cluster is abnormal only when the time series data exceeding the threshold occurs consecutively a predetermined number of times or more.

According to another aspect of the present invention, there is provided a method for detecting a CPU utilization abnormality in a server, comprising the steps of: collecting time series data of a CPU usage rate from a plurality of servers; Dividing the cluster into a cluster, dynamically computing an upper limit threshold of the cluster, and monitoring whether the upper limit threshold is exceeded, thereby detecting an abnormality in CPU utilization of a server belonging to the cluster.

In one embodiment, the cluster comprises a first cluster having a small variance and a small maximum value, a second cluster having a small variance and a maximum value, a third cluster having a large variance and seasonality, and a fourth cluster having a large variance, .

According to another aspect of the present invention, there is provided an abnormality sensing apparatus using time series data, the apparatus comprising: a network interface; at least one processor; a memory for loading a computer program executed by the processor; . The computer program may further comprise operations for collecting the time series data from the monitoring object and operations for classifying the monitoring object into clusters using the time series data, operations for dynamically calculating a threshold value of the cluster, And an operation of detecting an abnormality of the monitoring object belonging to the cluster using the threshold value.

In one embodiment, the operation of collecting the time series data may include an operation of collecting only time series data within a predetermined period from the present time.

In another embodiment, the operation of classifying the monitoring object into clusters may include an operation of classifying the clusters into clusters using a K-means algorithm.

In another embodiment, the operation of classifying the monitoring object into clusters may include an operation of classifying the clusters into clusters using at least one of variance, maximum value, and seasonality of the time series data.

In another embodiment, the operation of dynamically computing the threshold of the cluster may include an operation of dynamically computing the threshold of the cluster using a non-parametric bootstrap method.

In another embodiment, the operation for detecting an abnormality of the monitoring object belonging to the cluster may include an operation for determining abnormality only when the time series data exceeding the threshold occurs consecutively a predetermined number of times or more.

According to another aspect of the present invention, there is provided a computer program product for causing a computer to perform the steps of: collecting time series data from a monitoring object; classifying the monitoring object into clusters using the time series data; Calculating a threshold value of the cluster, and using the dynamically calculated threshold value to detect an abnormality of the monitoring object belonging to the cluster.

According to the present invention, the monitoring is performed using the dynamic threshold value, so that it is possible to reduce false positives that determine that the abnormality is not an abnormality or that the abnormality is not abnormal. Also, by dividing the monitoring object into a plurality of groups and calculating the dynamic threshold value of each group, the efficiency of monitoring can be improved rather than calculating the dynamic threshold value for each monitoring object.

By using the non-parametric bootstrapping method, time series data not corresponding to the normal distribution can also be monitored. In addition, the accuracy of the abnormality detection can be improved by calculating the dynamic threshold value based on the past time series data.

The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood to those of ordinary skill in the art from the following description.

1 is a flowchart of a method of monitoring time series data according to an embodiment of the present invention.
2 is a conceptual diagram for explaining a method of monitoring time series data according to an embodiment of the present invention.
3 is a flowchart illustrating a process of sorting servers into groups according to characteristics of time series data in an embodiment of the present invention.
4A and 4B are conceptual diagrams for explaining a K average algorithm used in an embodiment of the present invention.
FIG. 5 is a conceptual diagram for explaining the X-12-ARIMA algorithm used for determining the seasonality in an embodiment of the present invention.
6A and 6B are conceptual diagrams illustrating a non-parametric bootstrap algorithm and an ROC curve used in an embodiment of the present invention.
FIG. 7 is an exemplary diagram for explaining a process of monitoring an abnormality detection method according to an embodiment of the present invention. Referring to FIG.
8 is a conceptual diagram for explaining an anomaly detection apparatus using time series data according to an embodiment of the present invention.
9 is a hardware block diagram of an anomaly detection apparatus using time series data according to an embodiment of the present invention.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification.

It is noted that the terms "comprises" and / or "comprising" used in the specification are intended to be inclusive in a manner similar to the components, steps, operations, and / Or additions.

In the following, the description will be continued on the assumption that the CPU utilization of several servers is monitored in order to facilitate understanding. However, this is for the sake of convenience of understanding, and is not intended to be limited to such a configuration.

That is, data other than the CPU utilization of the server may be monitored. For example, you can monitor the server's network traffic and monitor the storage usage of the cloud server. These various types of time series data can be monitored. That is, data generated at a predetermined time interval can be monitored for abnormal detection.

Also, the description will be continued on the assumption that multiple servers are monitored at the same time. That is, it is assumed that the CPU utilization of one or more servers is monitored, or network traffic of one or more servers is monitored. A method for increasing the efficiency when monitoring a plurality of servers at the same time will be described together with an embodiment of the present invention.

Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

1 is a flowchart of a method of monitoring time series data according to an embodiment of the present invention.

Referring to FIG. 1, CPU utilization data of several servers are extracted and refined (S1000). The process of extracting CPU utilization data of several servers is a process of generating time series data. That is, CPU utilization data is extracted at regular intervals and monitoring is performed using the data. For example, the average CPU utilization rate is generated as time series data in units of one minute.

In addition, in order to reflect the recent state of the server, time series data of all times are used, but only time series data of a certain time is used from the present time. For example, time series data is extracted for the CPU usage rate of the last three weeks.

In summary, the data extraction process generates time series data considering the period and the period of the data. By generating the time series data in consideration of the period and the period of the data, it is possible to reflect the latest state of the server, thereby improving the accuracy of the abnormality detection.

Next, the time series data is refined. The refinement of the time series data means the selection of the time series data generated at a constant time interval. For example, if the physical expansion of the server has been performed within the last 3 weeks during the monitoring of the CPU usage of the server, the time series data prior to the expansion operation should be excluded from the monitoring process. Because the environment of the server has changed, the threshold for detecting anomaly also needs to be changed.

If the time series data is missing for a predetermined time or longer, the process of data refinement can be automatically performed in such a manner that it is regarded as a result of a physical operation and the time series data before the missing time is ignored. For example, if CPU utilization data of about 5 hours is empty in the time series data for one minute of CPU utilization within the last three weeks, it is estimated that the CPU utilization data is changed in the physical environment and the previous time series data is automatically excluded from the monitoring target . ≪ / RTI >

If the time series data of the CPU utilization rate to be monitored is generated, the plurality of servers are classified into groups using the characteristics of the time series data (S2000). The process of classifying servers into groups using time series data of CPU utilization refers to a process of performing clustering using characteristics of time series data. Various properties and algorithms can be used at this time.

For example, clustering can be performed using variance, maximum value, and seasonality among various characteristics of time series data. Of course, clustering may be performed using other characteristics than these characteristics. In addition, among various clustering algorithms, clustering can be performed using a K-means algorithm. The process of dividing a plurality of servers into groups will be described in more detail with reference to FIGS. 4A to 4B.

By grouping several servers into clusters, the efficiency of monitoring can be increased. In other words, rather than setting and monitoring thresholds for each server separately, it is possible to perform monitoring more quickly and easily by grouping similar types of servers and setting and monitoring group thresholds. This may be more effective when a distributed environment is configured using multiple servers of similar performance.

After dividing a plurality of servers into groups, a threshold value of each group is calculated (S3000). That is, instead of calculating the threshold value of each server separately, it divides each server into several groups and generates a threshold value of the group to perform monitoring. This can increase the monitoring efficiency. After the threshold value of the group is set, an abnormality is detected by using it (S4000).

2 is a conceptual diagram for explaining a method of monitoring time series data according to an embodiment of the present invention.

Referring to FIG. 2, CPU utilization data generated every second from several servers to be monitored is transmitted to a collection server. The collection server is a system for processing such big data. The collection server averages the collected CPU utilization data every second, and generates time series data of CPU utilization. Of course, the 1 minute figure is only for convenience of understanding, and it can generate time series data on any other standard.

The raw data of the CPU usage measured per second can be used as time series data, and it can be used as time series data by averaging in 5-minute increments or 10-minute increments. The generation period of the time series data is a value that can be changed depending on the monitoring target. Since data such as CPU utilization is an important and urgent item to be monitored urgently, it is possible to use a small value such as 1 minute, but conversely, data such as storage usage is relatively insignificant. Therefore, Can be used as a cycle.

After the collection server generates the time series data for the CPU usage rate in one minute, the data refining operation proceeds. The refining operation of the data can be performed in two steps. First, as described briefly above, analyze the time series data to see if there was any physical activity on the server. Thereafter, when it is determined that there is a physical operation such as expansion or contraction of the server, only the time series data after that point is used for the threshold calculation.

The criterion for determining whether the server has a physical operation can be determined automatically using the time of the missing time series data. For example, if there is a physical operation such as server expansion, CPU utilization data can not be generated. Therefore, after extracting the time series data, the CPU utilization data will be missing for a certain period of time. In this case, if the missing time exceeds the reference value, it can be automatically determined that there is a physical operation. Of course, in addition to automatically determining this, it is also possible to refine the time series data by receiving an input from the manager to exclude the time series data prior to a specific point in time.

If the missing time is below the reference value, only the missing time series data can be excluded from the threshold calculation. For example, suppose that a time series data in a time series of 1 minute is missing for more than one hour, and that it is determined that there is a physical operation and the data before the missing time is excluded. Here, if the time series data corresponding to the two-minute time is missing, only the missing two-time series data can be excluded.

In other words, it is not necessary to separately generate the time series data by the same operation as the average in order to correct the missing time series data. This is because the threshold value is calculated using the time series data of the past period. Even if some time series data is missing value in the process of calculating the threshold value, there is no influence on the dynamic threshold value calculation.

In summary, if the missing time series data is found after the time series data is extracted, the missing time is compared with the predetermined reference value. If the time series data is below the reference value, only the missing time series data is excluded. It is possible to perform an operation of refining the time series data by excluding all of them.

Next, after extracting the time series data of 1 minute unit of the CPU utilization rate, it is used to sort several servers into several groups. In this case, we can use variance, maximum value and seasonality among the characteristics of time series data. Using this, a total of four groups of servers can be clustered. The process of clustering servers in four groups will be described in more detail in FIG.

Group 1 is a group of patterns whose time series data of CPU utilization is monotonous. In other words, most of the time series data converge on average and the average value is low. Alternatively, the maximum value may be smaller than the average. In the case of CPU utilization, the lower threshold does not need to be monitored, and the monitoring is mainly performed based on the upper threshold.

Group 2 is a group of patterns in which time series data of CPU utilization is high. That is, most of the time series data are collected on the average and the maximum value is large. For example, servers with time-series data of CPU utilization that consistently maintain values of 80% to 85% may be included. Numerical values such as 80% and 85% in this example are figures for the purpose of understanding, but are not intended to be limited to such a configuration.

The reason why the server continuously monitors the CPU usage rate is classified into a separate group because the possibility of reaching the upper limit threshold is high in such a server, which is the object of the server expansion in the future. In the present invention, each server is classified into groups according to types based on the characteristics of time series data, and the thresholds of each group are dynamically calculated to perform monitoring. Here, the upper limit threshold value of each group also has a value below 100% which is the maximum usage rate.

In this case, group 2, which has a high CPU usage rate, is highly likely to reach the upper threshold, so that it is separated from group 1 and monitored. If an error occurs in another group, the response proceeds in the direction of resolving the cause. In the case of an error in group 2, however, the response proceeds mainly through a physical operation such as server expansion.

Group 3 is a group in which the time series data of the CPU usage rate has a time / periodic pattern. That is, a group in which the variance of the time series data is large and has a constant period. Finally, group 4 is a group of patterns in which the time series data of the CPU utilization is large. In other words, the group 4 is the same as the group 3, but the group 4 does not have the periodicity.

By dividing these servers into four groups according to the characteristics of the time series data of the CPU utilization and dynamically calculating the upper threshold value of each group, the monitoring can be performed quickly. The dynamic threshold value must be constantly updated as compared with the conventional fixed threshold value, so that the load of the monitoring system is inevitably incurred. However, by dividing several servers into several groups and calculating the dynamic thresholds, the monitoring speed can be improved by reducing the load relatively more than when calculating the dynamic thresholds of the respective servers.

After dividing the monitored servers into four groups, the upper threshold of each group is dynamically calculated. In this process, the non-parametric bootstrap algorithm and the ROC curve can be used. The non-parametric bootstrap algorithm and the ROC curve will be described in more detail later in FIGS. 6A to 6B. After dynamically computing the upper threshold value of each group, the threshold value is detected based on the threshold value.

By using dynamic thresholds, it is possible to reduce false positives, and by dividing several servers into groups, it is possible to reduce the cost of dynamic threshold calculation. In addition, only recent time series data can be used for dynamic threshold calculation to improve monitoring performance. Also, by using the nonparametric bootstrap algorithm, it is possible to monitor time series data that do not follow the normal distribution.

3 is a flowchart illustrating a process of sorting servers into groups according to characteristics of time series data in an embodiment of the present invention.

Referring to FIG. 3, a process of dividing a plurality of servers into four groups can be seen. Referring to FIG. 3, first, the variance of the extracted time series data is calculated and then it is compared whether the variance is equal to or greater than a reference value (S2100). If there is a variance above the reference value, the corresponding time series data corresponds to a case where the scattering degree is large.

If the variance of the time series data is small, most of the time series data is in the average. At this time, the group can be divided according to whether the maximum value is larger or smaller (S2200). If the maximum value has a value less than or equal to the reference value, the time series data corresponds to the case of Group 1 (110) having a monotonous pattern. The time series data 115 of the first group 110 can be seen to be densely distributed within a certain range.

On the other hand, if the maximum value of the time series data has a value equal to or larger than the reference value, the time series data corresponds to the case of the group 2 (120), which is a usage pattern. Looking at an example 125 of the time series data of group 2 120, it can be seen that the CPU usage rate is consistently high.

If the variance of the time series data is large, the group can be divided based on whether the scattering degree is repeated with a predetermined period or not (S2300). If it has seasonality, it can be regarded as a time / periodic pattern of group 3 (130). Referring to the example 135 of the time series data of the third group 130, it can be seen that the time series data exist repeatedly at regular intervals.

On the contrary, if there is no seasonality, the variance of group 4 (140) corresponds to a large pattern. That is, the servers that have a large variance but do not repeat the time series data with a constant period correspond to group 4 (140). Referring to an example diagram 145 of the time series data of the fourth group 140, it can be seen that the time series data are scattered with a large deviation without a certain period.

As described above, FIG. 3 shows that a plurality of servers are divided into four groups. However, this is just an example, and not necessarily divided into four groups. It is also not necessary to divide the group by the characteristics of dispersion, maximum value, and seasonality. It is also possible to classify them into different types of groups using different characteristics. However, by classifying the monitoring objects into groups as described above, it is sufficient that the load that can be generated by the dynamic threshold calculation can be reduced.

In most distributed environments, you configure multiple servers with similar performance. For example, a cluster of servers for storage is configured by grouping several servers with increased storage capacity. Alternatively, a parallel supercomputer may be constructed by grouping several servers that have only enhanced computing power. Alternatively, multiple servers may be grouped together to form a load balancing server cluster to ensure availability. Because servers with similar performance show similar pattern in time series data, it is less burden to the monitoring server to compute the dynamic threshold value after grouping them into one group rather than calculating the dynamic threshold value for each server.

4A and 4B are conceptual diagrams for explaining a K average algorithm used in an embodiment of the present invention.

4A is a representation of a formula used in the K-means algorithm. The K average algorithm is an algorithm that classifies given N data into K clusters that are less than or equal to N. [ In this case, the process of dividing each lump is performed by minimizing the cost function of data belonging to each lump. The cost function is the square sum of the Euclidean geometric distances from the center of each lump to each data. That is, || x-μ i || 2 (where x is each data, and μ i is the center of each lump) is minimized.

FIG. 4B shows a process of classifying data into three chunks by a K average algorithm. Referring to FIG. 4B, it can be seen that there is data on the coordinate plane, and the data is bundled into three lumps 211a, 211b, and 211c. The K-means algorithm finds the cost function as the sum of the squares of the Euclidean geometric distances, and therefore coordinates are needed to calculate the distances in one, two, three, or more dimensions. Here, coordinates refers to coordinates made of attributes that are used as reference when dividing data into several groups.

For example, in FIG. 3, groups are divided on the basis of variance, maximum value, and seasonality of time series data. Therefore, we can calculate the distance of (x, y, z) with the coordinates of the time series data of the CPU utilization of each server (distribution, maximum value, seasonality) as coordinates and divide the servers into groups have. For more information on the K average algorithm, see the https://www.wikipedia.org/wiki/K-means_AlgorithmInternetpage.

FIG. 5 is a conceptual diagram for explaining the X-12-ARIMA algorithm used for determining the seasonality in an embodiment of the present invention.

There are various algorithms for determining the seasonality (or periodicity) of the time-series data. One example is the seasonality test of X-12-ARIMA. In regard to the method of checking the stable seasonality in X-12-ARIMA, the variation of the time series data x ij is decomposed as shown in Equation 1 of FIG. Where i can have a value from 1 to N, and j can have a value from 1 to l. i is the year effect and j is the month or quarter effect.

Figure pat00002

Figure pat00003

In Equation 1 of FIG. 5, Equation 1 represents the average of the monthly time series and Equation 2 represents the average of the entire time series. (1) of FIG. 5 can be rewritten as Equation (2). In this case, the statistical test for seasonality is expressed as Equation 3). F S follows the F distribution with degrees of freedom l - 1 and n - l. Normally, 7 is used as the rejection value of the F statistic. If the F statistic value is larger than 7, it is judged that there is stable seasonality.

The K average algorithm for dividing a plurality of servers into groups is shown in FIGS. 4A and 4B. Also, an algorithm for determining whether there is seasonality in the process is shown in FIG. After dividing the group by using various characteristics of the time series data, the threshold value of each group should be calculated dynamically.

6A and 6B are conceptual diagrams illustrating a non-parametric bootstrap algorithm and an ROC curve used in an embodiment of the present invention.

The nonparametric bootstrap algorithm is a method of estimating the statistic of a sample when the probability distribution of the population is unknown. This allows the population to be applied to time series data that do not follow a normal distribution. In other words, each sample is reconstructed and extracted, and individual statistics of each of the obtained bootstrap samples are obtained.

To summarize this again, the probability variable X = {X 1 , ... , X n } are given, and a sampling function for the random variable X is F (). The t samples are extracted by F () and expressed as an equation, which can be expressed by the equation (1) in FIG. 6A. Applying a quantile function Q (1-α, s) of α = (0,1) to all bootstrap samples (s 1 , ..., s t ) quantile) values. When the average of these values is obtained, the threshold value can be calculated as shown in Equation 3 of FIG. 6A.

Here, the threshold value may vary depending on the value of?. α is a type 1 error and has a value between 0 and 1. Normally, a user can determine the method. Here, a method of automatically determining using an ROC curve (Receiver Operating Characteristic Curve) will be described with reference to FIG. 6B.

Referring to FIG. 6B, an example of the ROC curve can be seen. The ROC curve refers to a graph in which the FPR is the x-axis and the TPR is the y-axis. Here, FPR is the abbreviation of False Positive Rate and refers to the ratio (= 1-specificity) detected abnormally. TPR is the abbreviation of True Positive Rate and refers to the abnormality rate (= sensitivity).

A false accept rate is a true accept rate. The false accept rate is a false positive rate. The following table summarizes these.

False Positive (FP) In fact, it is not abnormal. False Negative (FN) In fact, if it is abnormal and is predicted not to be abnormal True Positive (TP) In fact, True Negative (TN) If it is not actually an abnormality,

Here, FPR and TPR are obtained by using FP, FN, TP, and TN.

Figure pat00004

The FPR and the TPR are inversely related to each other, although they can be confirmed in the formulas. Referring to FIG. 6B, it can be seen that the ROC curve is drawn in various ways according to the value of α. In this case, it is important to ensure the accuracy of the prediction, so that the closer the area of the ROC curve is, the better. In the example of FIG. 6B, when? 3 is? 1,? 2, and? 3,? 3 is most preferable because FPR decreases and TPR increases. As described above, since a plurality of servers are divided into groups and the dynamic threshold value of each group is calculated, the value of? Is also calculated for each group.

6A to 6B, a method of calculating the dynamic threshold value through bootstrap has been described even when the conventional time-series data are not used by using the past time series data. In this case, if only the time series data of the recent period is used, the accuracy of prediction can be further increased. For example, by dividing groups using only time series data of the last three weeks, and calculating the dynamic threshold value of each group, it is possible to further reflect the recent trend of the server, thereby further reducing the false rate.

If CPU utilization data exceeding the computed dynamic threshold value is input at this time, it can be judged as abnormal. However, in some cases, due to the nature of the CPU utilization, it may exceed the dynamic threshold temporarily. Therefore, in order to lower the false positives in such cases, it can be judged as abnormal if the data of the CPU usage rate exceeding the dynamic threshold is continuously generated for a predetermined time. For example, if CPU utilization data exceeding the dynamic threshold value for more than two minutes, that is, two or more consecutive times, is generated in the CPU usage time series data collected in one minute, it can be determined as abnormal and an alarm can be provided to the administrator.

FIG. 7 is an exemplary diagram for explaining a process of monitoring an abnormality detection method according to an embodiment of the present invention. Referring to FIG.

Referring to FIG. 7, the CPU utilization rate of the A server is shown on the coordinate plane with time as x axis and usage rate as y axis. In the case of the A server, since it has a usage rate within 5%, it corresponds to the type of group 1 (110). If you look at the CPU usage of A server during the day from 00:00 to 23:25, it shows temporary use rate of 10% ~ 25%, but overall usage is within 5%. Conventionally, a server corresponding to the type of the group 1 (110) also detects abnormality using a fixed threshold value collectively. Generally, the CPU utilization rate is set to a fixed threshold value of 80%, and the servers showing the CPU usage rate exceeding the fixed threshold value are classified as the expansion target.

As shown in FIG. 7, the CPU usage rate is somewhat higher than usual but is lower than 80%, which is a fixed threshold value, from 14:25 to 16:40. When the fixed threshold is set and monitored, It is not judged to be abnormal. However, the reason why the A server which showed the CPU usage rate within 5% usually shows the CPU utilization rate of about 35% is that the abnormality occurred.

In fact, the A server did not detect abnormality even though the monitoring system using the fixed threshold had a failure at 14:25, but instead notified the user at 14:45 that the failure was recognized. However, if the dynamic threshold is used, the server A is classified into group 1 (110) from the CPU usage time series data within the last 3 weeks. Then, by calculating the dynamic threshold value of the group 1 (110), a dynamic threshold value of 15.17% can be obtained by setting the value of alpha to 0.01. If monitoring was performed based on a dynamic threshold of 15.17%, if the CPU usage rate reached 35% due to a failure at 14:25, it would be determined as abnormal and the failure would be handled promptly.

8 is a conceptual diagram for explaining an anomaly detection apparatus using time series data according to an embodiment of the present invention.

Referring to FIG. 8, the abnormality sensing apparatus 10 collects average CPU utilization data for one minute from each individual server. The anomaly detection device 10 is configured as a big data platform for collecting and analyzing a large amount of data. If we take a closer look at the configuration of Big Data Platform, we first collect data through Apache Flume for data collection. To analyze the collected data, you pass data to Apache's Hadoop cluster (Apache Hadoop Cluster).

The Hadoop cluster consists of Ambari for management and monitoring of Hadoop cluster, Hbase for non-relational distributed database, HDFS as a file system for storing distributed data, Hive for data summarization, query and analysis, MapReduce (MR), which is a framework, and YARN, which is a resource management platform of Hadoop cluster.

Also, for data analysis, the pattern of time series data of the server can be analyzed and divided into groups through programming language and software R for statistics and graphics. It can also calculate the dynamic threshold of the group. When the preparation for detecting abnormality is completed by using the time series data in this manner, the monitoring environment can be provided by using it. For example, the CPU utilization data of the present server is shown in units of one minute, and the dynamic threshold value is shown together with it, and it is possible to provide in real time whether or not an abnormality occurs.

9 is a hardware block diagram of an anomaly detection apparatus using time series data according to an embodiment of the present invention.

9, an anomaly detection apparatus 10 using time series data may include one or more processors 510, a memory 520, a storage 560, and an interface 570. The processor 510, the memory 520, the storage 560, and the interface 570 transmit and receive data via the system bus 550.

The processor 510 executes a computer program loaded into the memory 520 and the memory 520 loads the computer program from the storage 560. [ The computer program may include a monitoring data management operation 521, a cluster configuration operation 523, a threshold data operation operation 535, and an anomaly monitoring operation 527.

The monitoring data management operation 521 collects time series data for monitoring at each server via the interface 570. [ And stores the collected data as monitoring data 561 of the storage 560 via the system bus 550. The Hadoop cluster shown in FIG. 8 corresponds to the monitoring data management operation 521. FIG.

The cluster configuration operation 523 classifies a plurality of servers into groups using the collected time series data and stores this information in the cluster information 563 of the storage 560 through the system bus 550. The threshold value calculation operation 525 uses the cluster information 563 to calculate a dynamic threshold value for each cluster.

The threshold data operation operation 525 computes a dynamic threshold value per cluster and stores it in the threshold data 565 of the storage 560 via the system bus 550. The stored threshold value data 565 can be used for monitoring through the anomaly monitoring operation 527. [ The R shown in FIG. 8 corresponds to the cluster configuration operation 523 and the threshold value operation operation 525. FIG.

The abnormal monitoring operation 527 compares the time series data collected by the monitoring data management operation 521 with the threshold data 565 through the interface 570 and monitors whether or not an error has occurred. It can also be visualized and provided to the user. In addition, when an abnormality occurs, it is possible to inform the user that an abnormality has occurred through the alarm.

Each component in FIG. 9 may refer to software or hardware such as an FPGA (Field Programmable Gate Array) or an ASIC (Application-Specific Integrated Circuit). However, the components are not limited to software or hardware, and may be configured to be addressable storage media, and configured to execute one or more processors. The functions provided in the components may be implemented by a more detailed component, or may be implemented by a single component that performs a specific function by combining a plurality of components.

While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims (17)

Collecting time series data from a monitoring object;
Classifying the monitoring object into clusters using the time series data;
Dynamically computing a threshold value of the cluster; And
And detecting an abnormality of a monitoring object belonging to the cluster using the dynamically calculated threshold value.
Anomaly detection method using time series data.
The method according to claim 1,
Wherein the collecting the time-
And collecting only time series data within a predetermined period from the current time,
Anomaly detection method using time series data.
The method according to claim 1,
The step of classifying the monitoring object into clusters includes:
Using a K-means algorithm, into a cluster.
Anomaly detection method using time series data.
The method according to claim 1,
The step of classifying the monitoring object into clusters includes:
And dividing the time series data into clusters using at least one of a variance, a maximum value, and a seasonality of the time series data.
Anomaly detection method using time series data.
The method according to claim 1,
The step of dynamically computing the threshold of the cluster comprises:
Using a non-parametric bootstrap method, dynamically computing a threshold of the cluster.
Anomaly detection method using time series data.
6. The method of claim 5,
Using the non-parametric bootstrap to dynamically calculate a threshold of the cluster,
And automatically calculating an alpha value of the bootstrap using a ROC curve (Receiver Operating Characteristic Curve)
Anomaly detection method using time series data.
The method according to claim 1,
Wherein the step of detecting an abnormality of the monitoring object belonging to the cluster comprises:
And determining that the time series data exceeding the threshold value is abnormal only if the time series data is continuously generated a predetermined number of times or more.
Anomaly detection method using time series data.
Collecting time series data of a CPU usage rate from a plurality of servers;
Classifying the plurality of servers into clusters using time series data of the CPU usage rate;
Dynamically computing an upper bound threshold of the cluster; And
Monitoring an abnormality of a CPU usage rate of a server belonging to the cluster by monitoring whether or not the upper limit threshold is exceeded;
How to detect abnormal CPU usage on the server.
9. The method of claim 8,
Wherein classifying the plurality of servers into clusters comprises:
And classifying the data into clusters using at least one of a variance, a maximum value, and a seasonality of time series data of the CPU usage rate.
How to detect abnormal CPU usage on the server.
10. The method of claim 9,
The cluster includes:
A first cluster having a small dispersion and a smallest value, a second cluster having a small dispersion and a greatest value, a third cluster having a large dispersion and seasonality, and a fourth cluster having a large dispersion and no seasonality.
How to detect abnormal CPU usage on the server.
Network interface;
One or more processors;
A memory for loading a computer program executed by the processor; And
Including storage for storing time series data,
The computer program comprising:
An operation of collecting the time series data from a monitoring target;
An operation of classifying the monitoring object into clusters using the time series data;
An operation for dynamically calculating a threshold value of the cluster; And
And an operation for detecting an abnormality of a monitoring object belonging to the cluster using the dynamically calculated threshold,
Anomaly detection device using time series data.
12. The method of claim 11,
The operation for collecting the time-
And an operation of collecting only time series data within a predetermined period from the present,
Anomaly detection device using time series data.
12. The method of claim 11,
The operation of classifying the monitoring object into clusters includes:
(K-means algorithm), which includes operations to classify into clusters,
Anomaly detection device using time series data.
12. The method of claim 11,
The operation of classifying the monitoring object into clusters includes:
And classifying the data into clusters using at least one of a variance, a maximum value, and a seasonality of the time series data,
Anomaly detection device using time series data.
12. The method of claim 11,
The operation for dynamically calculating the threshold value of the cluster is,
The method of claim 1, further comprising an operation of dynamically computing a threshold of the cluster using a non-parametric bootstrap method.
Anomaly detection device using time series data.
12. The method of claim 11,
Wherein the operation for detecting an abnormality of the monitoring object belonging to the cluster is performed by:
And an operation for determining an abnormality only when the time-series data exceeding the threshold value occurs consecutively a predetermined number of times or more,
Anomaly detection device using time series data.
In combination with the computing device,
Collecting time series data from a monitoring object;
Classifying the monitoring object into clusters using the time series data;
Dynamically computing a threshold value of the cluster; And
A step of detecting an abnormality of a monitoring object belonging to the cluster using the dynamically calculated threshold value,
Computer program.
KR1020160003500A 2016-01-12 2016-01-12 Method and apparatus for detecting abnormality using time-series data KR20170084445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160003500A KR20170084445A (en) 2016-01-12 2016-01-12 Method and apparatus for detecting abnormality using time-series data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160003500A KR20170084445A (en) 2016-01-12 2016-01-12 Method and apparatus for detecting abnormality using time-series data

Publications (1)

Publication Number Publication Date
KR20170084445A true KR20170084445A (en) 2017-07-20

Family

ID=59443362

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160003500A KR20170084445A (en) 2016-01-12 2016-01-12 Method and apparatus for detecting abnormality using time-series data

Country Status (1)

Country Link
KR (1) KR20170084445A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255667A (en) * 2017-12-27 2018-07-06 阿里巴巴集团控股有限公司 A kind of business monitoring method, device and electronic equipment
KR101960755B1 (en) * 2018-12-20 2019-03-21 문경훈 Method and apparatus of generating unacquired power data
CN110032495A (en) * 2019-03-28 2019-07-19 阿里巴巴集团控股有限公司 Data exception detection method and device
CN111290917A (en) * 2020-02-26 2020-06-16 深圳市云智融科技有限公司 YARN-based resource monitoring method and device and terminal equipment
KR102179290B1 (en) * 2019-11-07 2020-11-18 연세대학교 산학협력단 Method for indentifying anomaly symptom about workload data
CN112667479A (en) * 2020-12-30 2021-04-16 联想(北京)有限公司 Information monitoring method and device
CN114402575A (en) * 2020-03-25 2022-04-26 株式会社日立制作所 Action recognition server, action recognition system and action recognition method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255667A (en) * 2017-12-27 2018-07-06 阿里巴巴集团控股有限公司 A kind of business monitoring method, device and electronic equipment
CN108255667B (en) * 2017-12-27 2021-07-06 创新先进技术有限公司 Service monitoring method and device and electronic equipment
KR101960755B1 (en) * 2018-12-20 2019-03-21 문경훈 Method and apparatus of generating unacquired power data
CN110032495A (en) * 2019-03-28 2019-07-19 阿里巴巴集团控股有限公司 Data exception detection method and device
CN110032495B (en) * 2019-03-28 2023-08-25 创新先进技术有限公司 Data anomaly detection method and device
KR102179290B1 (en) * 2019-11-07 2020-11-18 연세대학교 산학협력단 Method for indentifying anomaly symptom about workload data
CN111290917A (en) * 2020-02-26 2020-06-16 深圳市云智融科技有限公司 YARN-based resource monitoring method and device and terminal equipment
CN114402575A (en) * 2020-03-25 2022-04-26 株式会社日立制作所 Action recognition server, action recognition system and action recognition method
CN114402575B (en) * 2020-03-25 2023-12-12 株式会社日立制作所 Action recognition server, action recognition system, and action recognition method
CN112667479A (en) * 2020-12-30 2021-04-16 联想(北京)有限公司 Information monitoring method and device

Similar Documents

Publication Publication Date Title
KR20170084445A (en) Method and apparatus for detecting abnormality using time-series data
US11514354B2 (en) Artificial intelligence based performance prediction system
CN109542740B (en) Abnormality detection method and apparatus
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
US20160210556A1 (en) Heuristic Inference of Topological Representation of Metric Relationships
CN112188531B (en) Abnormality detection method, abnormality detection device, electronic apparatus, and computer storage medium
Soualhia et al. Infrastructure fault detection and prediction in edge cloud environments
US10282458B2 (en) Event notification system with cluster classification
US20180211172A1 (en) Machine Discovery and Rapid Agglomeration of Similar States
US9244711B1 (en) Virtual machine capacity planning
US11060885B2 (en) Univariate anomaly detection in a sensor network
WO2018140556A1 (en) Machine discovery of aberrant operating states
CN109857618B (en) Monitoring method, device and system
CN110955586A (en) System fault prediction method, device and equipment based on log
US10268505B2 (en) Batch job frequency control
CN113992602B (en) Cable monitoring data uploading method, device, equipment and storage medium
CN109976986B (en) Abnormal equipment detection method and device
CN113283502B (en) Device state threshold determining method and device based on clustering
CN110874601B (en) Method for identifying running state of equipment, state identification model training method and device
CN108255710B (en) Script abnormity detection method and terminal thereof
Agrawal et al. Adaptive anomaly detection in cloud using robust and scalable principal component analysis
JP6226463B2 (en) Network management system, network device and control device
Alkasem et al. Utility cloud: a novel approach for diagnosis and self-healing based on the uncertainty in anomalous metrics
CN112732517B (en) Disk fault alarm method, device, equipment and readable storage medium
Jha et al. Holistic measurement-driven system assessment