CN112749035B

CN112749035B - Abnormality detection method, abnormality detection device, and computer-readable medium

Info

Publication number: CN112749035B
Application number: CN201911056820.1A
Authority: CN
Inventors: 王梦天; 王梦杰; 莫登耀
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2024-06-11
Anticipated expiration: 2039-10-31
Also published as: CN112749035A

Abstract

The application provides an abnormality detection scheme, which comprises the steps of firstly carrying out baseline fitting by utilizing historical data of a system, then detecting a test sample according to the baseline, and determining whether the running state of the system corresponding to the test sample is abnormal or not. During training, firstly, abnormal samples are removed from an initial training sample set to obtain a first training sample set, then unequal sampling is carried out on the first training sample set to obtain a second training sample set, and baseline fitting is carried out on the basis of the second training sample set.

Description

Abnormality detection method, abnormality detection device, and computer-readable medium

Technical Field

The present application relates to the field of information technologies, and in particular, to an anomaly detection method, an anomaly detection device, and a computer readable medium.

Background

The read-write delay is an important index in the operation process of the cloud computing system, and often the problem of the whole system can be reflected from the index, for example, the index is obviously increased when the system is abnormal, so that the accurate abnormality detection is very important.

Many key indicators of the system often have associated changes, for example, the read-write delay indicator is affected by the read-write size and the number of reads/second indicator. The increase in the read/write delay index does not mean that abnormality is necessarily occurring, and there is a possibility that the normal increase is caused by the increase in the index such as the read/write size and the number of reads/writes per second. The interpretable index flush is not an anomaly, while the unexplainable index surge is an anomaly that requires troubleshooting and corresponding operation and maintenance processing. Because of the nonlinear complex linkage relation among multiple indexes, the alarm can hardly be given out in a rule setting mode. Therefore, a scheme capable of accurately detecting system anomalies is lacking at present.

Content of the application

An object of the present application is to provide an abnormality detection scheme to solve the problem that there is currently a lack of a method capable of accurately detecting abnormality of a system.

To achieve the above object, some embodiments of the present application provide an anomaly detection method, including:

Removing abnormal samples from an initial training sample set to obtain a first training sample set, wherein the samples in the initial training sample set comprise historical data of response indexes and pressure indexes of the system;

the first training sample set is subjected to unequal sampling to obtain a second training sample set, wherein the sample entering probability of a sample in the unequal sampling in a high-pressure interval operation is larger than the sample entering probability of a sample in a low-pressure interval operation;

performing baseline fitting based on response indexes and pressure indexes of samples in the second training sample set to obtain a baseline for anomaly detection;

And detecting a test sample according to the baseline, and determining whether the system running state corresponding to the test sample is abnormal, wherein the test sample comprises a response index and a pressure index to be detected when the system runs.

Some embodiments of the present application also provide an abnormality detection apparatus including:

The system comprises a cleaning module, a first training sample set and a second training sample set, wherein the cleaning module is used for removing abnormal samples from an initial training sample set to obtain the first training sample set, and the samples in the initial training sample set comprise historical data of response indexes and pressure indexes of the system;

The sampling module is used for carrying out unequal sampling on the first training sample set to obtain a second training sample set, wherein the sampling probability of a sample in the unequal sampling during high-pressure interval operation is larger than that of a sample in the low-pressure interval operation;

The training module is used for carrying out baseline fitting based on the response index and the pressure index of the samples in the second training sample set, and acquiring a baseline for anomaly detection;

the detection module is used for detecting the test sample according to the baseline and determining whether the system running state corresponding to the test sample is abnormal, wherein the test sample comprises a response index and a pressure index to be detected when the system runs.

Furthermore, some embodiments of the present application provide a computing device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the anomaly detection method.

Still further embodiments of the present application provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement the anomaly detection method.

In the abnormality detection scheme provided by the embodiment of the application, first, baseline fitting is performed by utilizing historical data of a system, then, a test sample is detected according to the baseline, and whether the running state of the system corresponding to the test sample is abnormal or not is determined. During training, firstly, abnormal samples are removed from an initial training sample set to obtain a first training sample set, then unequal sampling is carried out on the first training sample set to obtain a second training sample set, and baseline fitting is carried out on the basis of the second training sample set.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is a process flow diagram of an anomaly detection method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a process for eliminating abnormal samples according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a clustering result in an embodiment of the present application;

FIG. 4 is a flow chart of a process for adjusting super parameters according to an embodiment of the application;

Fig. 5 is a schematic structural diagram of an abnormality detection apparatus according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another computing device for implementing anomaly detection according to an embodiment of the present application;

the same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

The application is described in further detail below with reference to the accompanying drawings.

In one exemplary configuration of the application, the terminal, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, program devices, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

The embodiment of the application provides an anomaly detection method, which removes an anomaly sample in an initial training sample set, ensures that samples for training are normal samples, avoids the influence of the anomaly sample on baseline fitting, and reserves samples with a small number of high pressure intervals as much as possible through unequal sampling, thereby being capable of avoiding samples in the training samples in the absence of the high pressure intervals, and ensuring that a baseline obtained by fitting has better detection capability.

In a practical scenario, the implementation subject of the method may be a network device or an apparatus formed by integrating a user device with the network device through a network. The user equipment comprises, but is not limited to, various terminal equipment such as a personal computer, a smart phone, a tablet personal computer and the like, and the network equipment comprises, but is not limited to, a network host, a single network server, a plurality of network server sets or a computer cluster based on cloud computing and the like.

Fig. 1 shows a process flow of an anomaly detection method according to an embodiment of the present application, where the method may include the following steps:

Step S101, abnormal samples are removed from the initial training sample set, and a first training sample set is obtained. The samples in the initial training sample set include historical data of response indexes and pressure indexes of the system, for example, the response indexes and the pressure indexes of the system in the last half month of running can be used as training samples. The response index may be a key index capable of reflecting the overall operation condition of the system, for example, may be a read-write delay (IoLatency), and the pressure index may be an index which has a certain correlation with the response index and may cause a change of the response index, for example, for the response index IoLatency, the pressure index related to the response index may be a read-write size (IoSize), a read-write number per second (Iops), or a throughput (throughput).

In the actual operation of the system, the increase of the response index does not mean that an abnormality is necessarily generated, and may be caused by the abnormality of the system or the increase of the corresponding pressure index. Because the samples in the initial training sample set are historical data of response indexes and pressure indexes of the system, the samples with higher response indexes may be normal samples (namely, the response indexes are normally increased due to pressure index increase) or abnormal samples (namely, the pressure indexes are increased due to system abnormality), and the abnormal samples in the initial training sample set are removed, so that the samples for training are normal samples, and the influence of the abnormal samples on baseline fitting can be avoided.

In some embodiments of the present application, the abnormal samples may be removed by a processing manner shown in fig. 2, where the processing flow includes the following processing steps:

Step S201, clustering samples in the initial training sample set based on the response index and the pressure index to determine a plurality of sample classes. For example, the samples may be clustered using a clustering algorithm such as K-means. In an actual scene, if the number of indexes is large, the complexity is increased as the processing dimension is also high when the clustering processing is performed, so that the number of pressure indexes can be reduced before the clustering processing is performed, and the complexity of the clustering processing is reduced.

The clustering processing scheme provided by the embodiment of the application is as follows: first, when the number of the pressure indexes is larger than a preset value, determining the association index most relevant to the response index according to the pressure indexes. The number of the associated indexes is smaller than or equal to a preset value, the preset value can be set according to different requirements, for example, the number of the associated indexes can be set to be 1, when the pressure indexes are the read-write size, the number of the read-write times per second and the throughput, the number of the associated indexes exceeds one, and at the moment, the 1 associated indexes most relevant to the response indexes can be determined based on the pressure indexes. Therefore, when K-means clustering is carried out, the clustering is carried out based on 1 response index and 1 association index, and only the clustering is needed in a two-dimensional space, so that the complexity of processing is greatly reduced.

According to different processing modes, the associated index can be selected from the pressure indexes directly, or can be a new index obtained by processing and calculating based on the pressure indexes. For example, principal component analysis may be performed based on a plurality of pressure indexes, which are combined into one principal component as a correlation index, or a Spearman (Spearman) correlation coefficient between each pressure index and a response index may be calculated, and one of the most correlated pressure indexes is selected as the correlation index.

After determining the association indicator, samples in the training sample set may be clustered based on the association indicator and the pressure indicator to determine a plurality of sample classes. When the pressure index is too much, the number of the obtained associated indexes is small after the pressure index is processed, so that the complexity of clustering processing can be reduced, and the processing efficiency is improved.

Step S202, after clustering is completed, eliminating sample classes, the class centers of which do not meet preset conditions, as abnormal samples, and obtaining a first training sample set. The preset condition is used for eliminating the situation that the pressure index is increased due to the system abnormality, for example, in the embodiment of the application, if the response index is considered to be greater than Y and the pressure index or the associated index is less than X, the height of the response index cannot be interpreted as the normal increase due to the increase of the pressure index or the associated index. Therefore, the preset condition can be set to be related to the class center of the response index and the threshold value of the class center of the pressure index or the associated index, and when the class center of the response index is larger than one threshold value and the class center of the pressure index or the associated index is smaller than the other threshold value, the class center of the sample class is not met with the preset condition, and samples contained in the sample class are taken as abnormal samples to be removed.

Fig. 3 shows a schematic diagram of clustering results after clustering based on response index read-write delay and pressure index throughput. The clustering result includes 5 sample classes G0 to G4, and samples of the sample class G2 and the sample class G4 in the Area are determined to be abnormal samples according to a relation between a class center and a threshold Y, X in a preset condition.

In some embodiments of the present application, the samples in the initial sample training set may be normalized prior to the abnormal samples being removed. For example, in this embodiment, by the normalization process, the dimension of each index can be eliminated and the data can be more concentrated to facilitate the subsequent process. For the samples after the standardized processing, when abnormal samples are removed, samples of sample classes with the class center of the response index exceeding 3sigma and the class center of the pressure index within 3 (or 2 or 1.5 more conservative) sigma can be regarded as abnormal samples, so that the samples are removed, and the samples in the rest sample classes are the first training sample set.

Step S102, sampling the first training sample set unequally to obtain a second training sample set. The purpose of this scheme is that detecting the abnormality of system running state, need guarantee certain instantaneity, therefore the sampling time interval of monitoring data is shorter, can produce a large amount of samples, if directly handle based on all samples and need huge computational cost, need sample from this to reduce computational cost.

In the scheme of the embodiment of the application, the unequal sampling is performed, and the sample loading probability of the sample in the high pressure interval operation is larger than the sample loading probability of the sample in the low pressure interval operation, wherein the high pressure interval corresponds to the condition when the pressure index is higher in the system operation, whereas the low pressure interval corresponds to the condition when the pressure index is lower in the system operation, the high pressure interval and the low pressure interval can be determined according to the operation condition of the system in an actual scene, for example, a distinguishing threshold can be manually set, and the operation state of the system is considered to be divided into a plurality of pressure intervals with different heights according to different pressure indexes at the moment.

In an actual scenario, because the system is in a high-pressure state during operation, the number of samples in a high-pressure region in the historical data is often obviously smaller than that in a low-pressure region, if simple random sampling is adopted, the samples in the low-pressure region are likely to be remained in the training sample set, and only the samples in a small number of high-pressure regions are extracted, so that the training samples are unbalanced to influence baseline fitting.

In some embodiments of the present application, in connection with the foregoing clustering of samples, unequal sampling may be performed in the following manner: first, according to the number of samples of each sample class in the first training sample set, a sampling weight for the sample class is determined. The sampling weight of the sample class is inversely related to the number of samples of the sample class and positively related to the sample entering probability of the samples in the sample class, that is, the more the number of samples in the sample class is, the smaller the sampling weight of the sample class is, and the smaller the probability that the samples are pumped is. For example, in this embodiment, the sampling weight may be set to 1/sqrt (group_size), that is, the inverse of the square root of the number of samples, that is, the number of samples of the sample class. Taking sample class G1 with sample number 10000 as an example, its sampling weight is 1/100, and sample class G2 with sample number 100 has a sampling weight of 1/10. The sampling probability may be proportional to the corresponding sampling weight, for example, the sampling probability of the sample class G2 is 10 times that of the sample class G1, that is, the probability that the samples in the sample class G2 are extracted is 10 times that of the samples in the sample class G1 when sampling is performed.

And sampling each sample class in the first training sample set based on the sampling weight to obtain a second training sample set. Table 1 shows the sampling results after unequal sampling of several sample classes in the manner described above:

Class numbering	0	1	2	3	4
						Number of samples	15122	4707	3115	7056	235
Sampling results	3860	1979	1507	2440	214

TABLE 1

Since the result of the clustering can experience the actual situation when the system is running in the actual scene to a certain extent, i.e. the number of samples in the high pressure interval tends to be significantly smaller than the number of samples in the low pressure interval, for example the corresponding higher pressure interval of the samples in the sample class numbered 4, the number is the least in the first training sample set. After the sampling weight is determined in the mode and unequal sampling is carried out according to the corresponding sampling probability, the sample retention proportion in the sample class with the number of 4 is the highest, so that samples corresponding to a high-pressure interval can be retained as much as possible in the second training sample set, and the accuracy of baseline fitting is ensured.

And step S103, performing baseline fitting based on the response index and the pressure index of the samples in the second training sample set, and acquiring a baseline for anomaly detection. The method comprises the steps of training a regression model by using historical data to predict response indexes needing to be detected for abnormality, for example, predicting response index read-write delay by using pressure index read-write size, read-write times per second and throughput, wherein predicted values of corresponding response indexes under different pressure indexes are base lines for abnormality detection.

Step S104, detecting a test sample according to the baseline, and determining whether the system running state corresponding to the test sample is abnormal.

The test sample comprises a response index and a pressure index to be detected when the system operates. Based on the pressure index in the test sample, the predicted value of the response index under the pressure index can be predicted by combining the base line, namely y_hat, and the response index in the test sample is the true value of the response index under the pressure index, namely y, and in the embodiment, the abnormality detection is realized by calculating the difference between the true value and the predicted value.

Therefore, in some embodiments of the present application, when detecting a test sample according to the baseline, and determining whether the system operation state corresponding to the test sample is abnormal, whether a response index corresponding to a pressure index to be detected in the test sample exceeds an alarm threshold of the baseline may be determined according to the baseline, and if the response index exceeds the alarm threshold, the system operation state corresponding to the test sample is determined to be abnormal. Otherwise, if the alarm threshold is not exceeded, the system running state corresponding to the test sample is considered to be normal.

The alarm threshold of the base line can be set according to the requirements of actual application scenes, and the alarm threshold of the base line can be concentrated as follows:

(1) It is seen whether the true value of the response indicator deviates from the baseline by more than a times the conventional jitter amplitude. At this time, the alarm threshold value y_limit_1=y_hat_q3+a×iqr, where y_hat_q3 is the third quartile value of the response index in the second training sample set, that is, the value arranged at the 75% position in the response index, IQR is the difference between the third quartile value and the first quartile value of the response index in the second training sample set, that is, the difference between the value arranged at the 75% position in the response index and the value arranged at the 25% position in the response index,

(2) It is seen whether the true value of the response index exceeds the absolute value b of the baseline. At this time, the alarm threshold y_limit_2=y_hat+b.

(3) It is seen whether the true value of the response index exceeds the baseline ratio c. At this time, the alarm threshold y_limit_3=y_hat×c.

Wherein a, b and c are adjustable super parameters, for example, initial values of the super parameters can be set to be 1.5, 0 and 1, and the target loss function of the baseline model can be minimized through continuous adjustment later, so that the detection accuracy is better.

In actual use, the alarm threshold of the baseline may be determined according to the maximum value of the first threshold y_limit_1, the second threshold y_limit_2, and the third threshold y_limit_3, where the first threshold y_limit_1 is the sum of the third quarter bit value y_hat_q3 of the response index in the second training sample set and the fourth calculated value, the fourth calculated value is the product of the quarter bit distance IQR of the response index in the second training sample set and the first super parameter a, the second threshold y_limit_2 is the sum of the corresponding value y_hat of the pressure index to be detected on the baseline and the second super parameter b, and the third threshold y_limit_3 is the product of the corresponding value y_hat of the pressure index to be detected on the baseline and the third super parameter c. That is, the alarm threshold value y_limit=max (y_limit_1, y_limit_2, y_limit_3) of the baseline actually used for detection corresponds to the state of the system operation corresponding to the test sample being considered abnormal only when all of the three alarm thresholds are triggered by the test sample.

In a practical scenario, the collected sample data often has a problem of unequal variance, such as a large fluctuation of the index in a high pressure interval and a small fluctuation in a low pressure interval. Meanwhile, for the cloud computing system, because different services need to be provided, index fluctuation conditions of all the service clusters are different, for example, fluctuation of some service clusters is large in a high-pressure interval, and fluctuation of some service clusters is large in a low-pressure interval. In order to solve the problem, in the anomaly detection method provided by the embodiment of the application, before the test sample is detected, the test sample is subjected to numerical scaling so as to make the variances of the test samples equal.

In some embodiments of the application, the manner of scaling may be by a Box-Cox transformation on the test samples. The Box-Cox transformation is a generalized power transformation method, is a data transformation commonly used in statistical modeling, and the transformation formula can be set as follows: when λ is not equal to 0, y= (x ζ -1)/λ, and when λ is equal to 0, y=log (x), where λ is a parameter indicating a numerical compression method, and determines whether the numerical conversion compresses a high-numerical point or a low-numerical point, and a degree of compression. The lambda can estimate a most suitable value by using a maximum likelihood method according to the characteristics of index values in different service clusters, so that an optimal lambda can be determined for each different service cluster, and the numerical variances of test samples from different service clusters after Box-Cox conversion are equal. Therefore, when abnormality detection is carried out, the alarm threshold value which is suitable for different service clusters is not required to be set, and the abnormality detection can be carried out by adopting a uniform global alarm threshold value.

Since the anomaly detection by the method provided in the above embodiment is completely unsupervised, the anomaly result detected is statistically anomaly. In an actual scene, the statistically significant anomalies sometimes have certain differences from the knowledge of the operation and maintenance personnel and the system tolerance, and false alarm situations can occur. Additional information can be obtained by introducing artificial knowledge through labeling, and the accuracy of anomaly detection can be optimized by using the information. Therefore, in the anomaly detection method provided by some embodiments of the present application, a manual labeling result of a part of the test samples may be obtained, and then the super-parameters may be adjusted according to the manually labeled test sample detection result and the manual labeling result. Because the user only needs to manually mark part of the samples to be detected, but not all the samples participating in detection, the workload of marking is very limited, and the labor cost can be effectively saved.

When the hyper-parameters are adjusted according to the manually labeled test sample detection result and the manually labeled result, a process flow shown in fig. 4 may be adopted, including:

Step S401, calculating the cost value after the super parameters are adjusted by adopting a search algorithm according to the manually marked test sample detection result, the manually marked test sample detection result and the test sample detection result which is not manually marked.

Let N1 test samples not manually labeled and N2 test samples manually labeled, for example n1=10000, n2=10.

For N1 samples that were not manually labeled, the results of the anomaly detection were considered to be correct. If the super parameters a, b, c are changed, the alarm threshold y_limit is also changed, and the detection result may be changed accordingly, for example, the test sample previously determined to be normal may be determined to be abnormal after the super parameters are changed, or the test sample previously determined to be abnormal may be determined to be normal after the super parameters are changed. In this case, NP1 may be used to represent the number of samples that were detected as abnormal in the samples that were not manually marked before the hyper-parameter adjustment, NN1 may be used to represent the number of samples that were detected as normal in the samples that were not manually marked before the hyper-parameter adjustment, for example, NP1 is 50, NN1 is 9950, and n1=np1+nn1 in this embodiment. After the super parameter is changed, it is possible to change the detection result of a part of the samples, and the sample in which the detection result is changed from abnormal to normal may be represented by FNN1, and the sample in which the detection result is changed from normal to abnormal may be represented by FNP1, for example, FNN1 is 2 and FNP1 is 5 in this embodiment. The cost value c1=fnn1/np1+fnp1/nn1=2/50+5/9950 for unlabeled samples after adjustment of the superparameter can thus be calculated.

For N2 samples marked manually, the detection results obtained by adopting the adjusted super parameters a, b and c may not be consistent with the manual marking results. On the secondary premise, NP2 may be used to represent the number of test samples with normal manual labeling results, and NN2 may represent the number of test samples with abnormal manual labeling results, for example, NP2 is 6, NN2 is 4, and n2=np2+nn2 in this embodiment. After the super-parameters are changed, the detection results of part of the samples may not be consistent with the manual labeling results, and the number of samples with abnormal manual labeling results but normal detection results may be represented by FNN2, and the number of samples with normal manual labeling results but abnormal detection results may be represented by FNP2, for example, FNN2 is 1 and FNP2 is 2 in the embodiment. The cost value c2=fnn2/np2+fnp2/nn2=1/6+2/4 for the marked sample after the adjustment of the superparameter can thus be calculated.

By means of weighted summation, the cost value C after the super parameters are adjusted by adopting a search algorithm can be calculated based on C1 and C2, namely C=C1/N1+C2/N2.

Step S402, setting a target loss function, wherein the target loss function is related to a cost value after the super parameter is adjusted by adopting a search algorithm.

Step S403, determining, according to the target loss function, a hyper-parameter that minimizes the target loss function. For example, in the embodiment of the present application, the objective loss function related to the cost value C after the super parameter adjustment by using the search algorithm may be set as follows:

Wherein w is an adjustment value, which can be set to a constant between 10 ^-10 and 10 ^-6, a0, b0, c0 are super parameters before adjustment, and y_mean is the arithmetic average of the response indexes in all samples.

Based on the objective loss function, a search algorithm may be utilized to suffer from an optimal set of hyper-parameters such that the objective loss function is minimized. The optimal super-parameters found at this time can be used for determining the alarm threshold value of the next detection so as to realize more accurate anomaly detection.

In an actual scenario, various applicable search algorithms may be used, for example, for the objective function set in this embodiment, since the gradient cannot be calculated, the Nelder-Mead algorithm may search for the optimal super-parameters, but since the Nelder-Mead algorithm is easily trapped in local optimization, multiple initial values may be used for optimizing. For example, for each sample defined as normal, including labeled and unlabeled samples, the equation may be solved: the value of a is calculated by y=y_q3+a×iqr, the values of N a can be calculated by N samples, then the maximum value of a is selected as the initial value of a in the search algorithm, and the initial values of the other two super parameters b and c can be set by the same method. When the search algorithm is implemented, the optimum value can be searched each time by taking the initial value of one or two super parameters as the starting point, and taking the default value [1.5,0,1] as the starting point or the last super parameter value as the starting point. Therefore, the optimal super parameters can be more accurately and efficiently searched.

Based on the same inventive concept, an abnormality detection apparatus is also provided in the embodiment of the present application, the method corresponding to the abnormality detection method in the foregoing embodiment, and the principle of solving the problem is similar to that of the method.

When the abnormality detection device provided by the embodiment of the application is used for realizing abnormality detection, abnormal samples in an initial training sample set can be removed, so that the samples for training are normal samples, the influence of the abnormal samples on baseline fitting is avoided, and samples with a small number of high-pressure intervals are reserved as much as possible through unequal sampling, thereby avoiding the samples in the training samples in the absence of the high-pressure intervals, and enabling the baseline obtained by fitting to have better detection capability.

In an actual scenario, the anomaly detection device may be a network device or a device formed by integrating a user device and a network device through a network. The user equipment comprises, but is not limited to, various terminal equipment such as a personal computer, a smart phone, a tablet personal computer and the like, and the network equipment comprises, but is not limited to, a network host, a single network server, a plurality of network server sets or a computer set based on cloud computing and the like. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

Fig. 5 shows a structure of an abnormality detection apparatus according to an embodiment of the present application, including a cleaning module 510, a sampling module 520, a training module 530, and a detection module 540. The cleaning module 510 is configured to reject abnormal samples from the initial training sample set, and obtain a first training sample set. The samples in the initial training sample set include historical data of response indexes and pressure indexes of the system, for example, the response indexes and the pressure indexes of the system in the last half month of running can be used as training samples. The response index may be a key index capable of reflecting the overall operation condition of the system, for example, may be a read-write delay, and the pressure index may be an index which has a certain correlation with the response index and may cause a change of the response index, for example, for the response index read-write delay, the pressure index related to the response index may be a read-write size, a number of times of reading and writing per second, throughput, or the like.

In some embodiments of the present application, the cleaning module 510 may remove the abnormal samples in the processing manner shown in fig. 2, where the processing flow includes the following processing steps:

In step S201, the clustering unit of the cleaning module clusters the samples in the initial training sample set based on the response index and the pressure index, and determines a plurality of sample classes. For example, the samples may be clustered using a clustering algorithm such as K-means. In an actual scene, if the number of indexes is large, the complexity is increased as the processing dimension is also high when the clustering processing is performed, so that the number of pressure indexes can be reduced before the clustering processing is performed, and the complexity of the clustering processing is reduced.

Step S202, after clustering is completed, a cleaning unit of a cleaning module rejects sample classes, the class centers of which do not meet preset conditions, as abnormal samples, and a first training sample set is obtained. The preset condition is used for eliminating the situation that the pressure index is increased due to the system abnormality, for example, in the embodiment of the application, if the response index is considered to be greater than Y and the pressure index or the associated index is less than X, the height of the response index cannot be interpreted as the normal increase due to the increase of the pressure index or the associated index. Therefore, the preset condition can be set to be related to the class center of the response index and the threshold value of the class center of the pressure index or the associated index, and when the class center of the response index is larger than one threshold value and the class center of the pressure index or the associated index is smaller than the other threshold value, the class center of the sample class is not met with the preset condition, and samples contained in the sample class are taken as abnormal samples to be removed.

In some embodiments of the present application, the apparatus provided by the present application may further include a preprocessing module, where the preprocessing module may perform normalization processing on samples in the initial sample training set before performing the abnormal sample rejection. For example, in this embodiment, by the normalization process, the dimension of each index can be eliminated and the data can be more concentrated to facilitate the subsequent process. For the samples after the standardized processing, when abnormal samples are removed, samples of sample classes with the class center of the response index exceeding 3sigma and the class center of the pressure index within 3 (or 2 or 1.5 more conservative) sigma can be regarded as abnormal samples, so that the samples are removed, and the samples in the rest sample classes are the first training sample set.

The sampling module 520 is configured to sample the first training sample set unevenly, and obtain a second training sample set. The purpose of this scheme is that detecting the abnormality of system running state, need guarantee certain instantaneity, therefore the sampling time interval of monitoring data is shorter, can produce a large amount of samples, if directly handle based on all samples and need huge computational cost, need sample from this to reduce computational cost.

In the scheme of the embodiment of the application, the sampling probability of the sample in the high-pressure interval operation is larger than that in the low-pressure interval operation, wherein the high-pressure interval corresponds to the condition when the pressure index is higher in the system operation, and the low-pressure interval corresponds to the condition when the pressure index is lower in the system operation. In an actual scenario, because the system is in a high-pressure state during operation, the number of samples in a high-pressure region in the historical data is often obviously smaller than that in a low-pressure region, if simple random sampling is adopted, the samples in the low-pressure region are likely to be remained in the training sample set, and only the samples in a small number of high-pressure regions are extracted, so that the training samples are unbalanced to influence baseline fitting.

And sampling each sample class in the first training sample set based on the sampling weight to obtain a second training sample set. Table 1 shows the sampling results after unequal sampling of several sample classes in the manner described above.

The training module 530 is configured to perform baseline fitting based on the response index and the pressure index of the samples in the second training sample set, and obtain a baseline for anomaly detection. The method comprises the steps of training a regression model by using historical data to predict response indexes needing to be detected for abnormality, for example, predicting response index read-write delay by using pressure index read-write size, read-write times per second and throughput, wherein predicted values of corresponding response indexes under different pressure indexes are base lines for abnormality detection.

The detection module 540 is configured to detect a test sample according to the baseline, and determine whether a system running state corresponding to the test sample is abnormal.

Therefore, in some embodiments of the present application, when the detection module detects the test sample according to the baseline, and determines whether the system operation state corresponding to the test sample is abnormal, the detection module may determine, according to the baseline, whether the response index corresponding to the pressure index to be detected in the test sample exceeds the alarm threshold of the baseline, and if the response index exceeds the alarm threshold, determine that the system operation state corresponding to the test sample is abnormal. Otherwise, if the alarm threshold is not exceeded, the system running state corresponding to the test sample is considered to be normal.

In a practical scenario, the collected sample data often has a problem of unequal variance, such as a large fluctuation of the index in a high pressure interval and a small fluctuation in a low pressure interval. Meanwhile, for the cloud computing system, because different services need to be provided, index fluctuation conditions of all the service clusters are different, for example, fluctuation of some service clusters is large in a high-pressure interval, and fluctuation of some service clusters is large in a low-pressure interval. In order to solve the problem, in the anomaly detection device provided by the embodiment of the application, the preprocessing module may be also used for performing numerical scaling on the test sample before detecting the test sample, so as to make variances of the test sample equal.

In some embodiments of the application, the scaling may be performed on the test samples in a Box-Cox transformation. The Box-Cox transformation is a generalized power transformation method, is a data transformation commonly used in statistical modeling, and the transformation formula can be set as follows: when λ is not equal to 0, y= (x ζ -1)/λ, and when λ is equal to 0, y=log (x), where λ is a parameter indicating a numerical compression method, and determines whether the numerical conversion compresses a high-numerical point or a low-numerical point, and a degree of compression. The lambda can estimate a most suitable value by using a maximum likelihood method according to the characteristics of index values in different service clusters, so that an optimal lambda can be determined for each different service cluster, and the numerical variances of test samples from different service clusters after Box-Cox conversion are equal. Therefore, when abnormality detection is carried out, the alarm threshold value which is suitable for different service clusters is not required to be set, and the abnormality detection can be carried out by adopting a uniform global alarm threshold value.

Since the anomaly detection by the device provided by the embodiment is completely unsupervised, the detected anomaly result is statistically anomaly. In an actual scene, the statistically significant anomalies sometimes have certain differences from the knowledge of the operation and maintenance personnel and the system tolerance, and false alarm situations can occur. Additional information can be obtained by introducing artificial knowledge through labeling, and the accuracy of anomaly detection can be optimized by using the information. Therefore, in the anomaly detection device provided by some embodiments of the present application, the anomaly detection device may further include a closed-loop optimization module, where the closed-loop optimization module is configured to obtain a manual labeling result of a portion of the test samples, and then adjust the super-parameters according to the manually labeled test sample detection result and the manual labeling result. Because the user only needs to manually mark part of the samples to be detected, but not all the samples participating in detection, the workload of marking is very limited, and the labor cost can be effectively saved.

When the hyper-parameters are adjusted according to the manually marked test sample detection result and the manually marked result, a closed-loop optimization module can adopt a processing flow shown in fig. 4, which comprises the following steps:

In summary, in the anomaly detection scheme provided by the embodiment of the application, the baseline fitting is performed by using the historical data of the system, and then the test sample is detected according to the baseline, so as to determine whether the system running state corresponding to the test sample is abnormal. During training, firstly, abnormal samples are removed from an initial training sample set to obtain a first training sample set, then unequal sampling is carried out on the first training sample set to obtain a second training sample set, and baseline fitting is carried out on the basis of the second training sample set.

Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for carrying out the methods of the present application may be stored on fixed or removable recording media and/or transmitted over a data stream on a broadcast or other signal bearing medium and/or stored in a working memory of a computer device that operates in accordance with the program instructions. Herein, some embodiments according to the present application comprise a computing device as shown in fig. 6, comprising one or more memories 610 storing computer readable instructions and a processor 620 for executing the computer readable instructions, wherein the computer readable instructions, when executed by the processor, cause the device to perform methods and/or aspects in accordance with the various embodiments of the present application as described above.

Furthermore, some embodiments of the present application provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement the methods and/or aspects of the various embodiments of the present application described above.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In some embodiments, the software program of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. An anomaly detection method, the method comprising:

Removing abnormal samples from an initial training sample set to obtain a first training sample set, wherein the samples in the initial training sample set comprise historical data of response indexes and pressure indexes of a system, the response indexes are key indexes capable of reflecting the overall operation condition of the system, and the pressure indexes are indexes which have certain relevance with the response indexes and can cause the change of the response indexes;

2. The method of claim 1, wherein culling outlier samples from the initial training sample set to obtain a first training sample set comprises:

Clustering samples in the initial training sample set based on response indexes and pressure indexes to determine a plurality of sample classes;

And eliminating the sample class which does not meet the preset condition in the class center as an abnormal sample to obtain a first training sample set.

3. The method of claim 2, wherein clustering samples in the initial training sample set based on the response indicator and the pressure indicator, determining a plurality of sample classes, comprises:

when the number of the pressure indexes is larger than a preset value, determining the associated index most relevant to the response index according to the pressure indexes, wherein the number of the associated indexes is smaller than or equal to the preset value;

And clustering samples in the training sample set based on the association index and the pressure index to determine a plurality of sample classes.

4. The method of claim 2, wherein inequality sampling the first training sample set to obtain a second training sample set comprises:

determining sampling weights of sample classes according to the sample numbers of the sample classes in the first training sample set, wherein the sampling weights of the sample classes are inversely related to the sample numbers of the sample classes and positively related to the sample entering probability;

And sampling each sample class in the first training sample set based on the sampling weight to obtain a second training sample set.

5. The method of any one of claims 1 to 4, wherein the method further comprises:

and carrying out standardization processing on the samples in the initial training sample set.

6. The method of claim 1, wherein detecting a test sample from the baseline, determining whether a system operating state corresponding to the test sample is abnormal, comprises:

judging whether a response index corresponding to a pressure index to be detected in a test sample exceeds an alarm threshold of the baseline according to the baseline;

if the alarm threshold value is exceeded, determining that the system running state corresponding to the test sample is abnormal.

7. The method of claim 6, wherein the alarm threshold for the baseline is determined from a maximum of a first threshold, a second threshold, and a third threshold, the first threshold being a sum of a third quartile value of the response indicator in the second training sample set and a quartile calculated value being a product of a quartile distance of the response indicator in the second training sample set and a first superparameter, the second threshold being a sum of a corresponding value of the pressure indicator to be detected on the baseline and a second superparameter, and the third threshold being a product of a corresponding value of the pressure indicator to be detected on the baseline and a third superparameter.

8. The method of claim 7, wherein the method further comprises:

Obtaining a manual labeling result of a part of test samples;

and adjusting the super parameters according to the detection result of the manually marked test sample and the manually marked result.

9. The method of claim 8, wherein adjusting the hyper-parameters based on the manually labeled test sample detection results and the manually labeled results comprises:

calculating the cost value after the super parameters are adjusted by adopting a search algorithm according to the manually marked test sample detection result, the manually marked test sample detection result and the test sample detection result which is not manually marked;

setting a target loss function, wherein the target loss function is related to a cost value after super parameters are adjusted by adopting a search algorithm;

And determining a super parameter which enables the target loss function to be minimum according to the target loss function.

10. The method according to any one of claims 6 to 9, wherein the method further comprises:

The test samples are scaled numerically to equalize the variances of the test samples.

11. An abnormality detection apparatus comprising:

the system comprises a cleaning module, a first training sample set and a second training sample set, wherein the abnormal samples are removed from the initial training sample set, the samples in the initial training sample set comprise historical data of response indexes and pressure indexes of the system, the response indexes are key indexes capable of reflecting the overall operation condition of the system, and the pressure indexes are indexes which have certain relevance with the response indexes and can cause the change of the response indexes;

12. The apparatus of claim 11, wherein the cleaning module comprises:

the clustering unit is used for clustering samples in the initial training sample set based on the response index and the pressure index to determine a plurality of sample classes;

The cleaning unit is used for removing the sample class which does not accord with the preset condition in the class center as an abnormal sample to obtain a first training sample set.

13. The device of claim 12, wherein the clustering unit is configured to determine, according to the pressure index, an associated index most relevant to the response index when the number of the pressure indexes is greater than a preset value, wherein the number of the associated indexes is less than or equal to the preset value; and clustering samples in the training sample set based on the association index and the pressure index to determine a plurality of sample classes.

14. The apparatus of claim 12, wherein the sampling module is configured to determine a sampling weight for each sample class according to a number of samples in each sample class in the first training sample set, where the sampling weight for the sample class is inversely related to the number of samples in the sample class and is positively related to a sampling probability; and sampling each sample class in the first training sample set based on the sampling weight to obtain a second training sample set.

15. The apparatus according to any one of claims 11 to 14, wherein the apparatus further comprises:

and the preprocessing module is used for carrying out standardization processing on the samples in the initial training sample set.

16. The device of claim 11, wherein the detection module is configured to determine, according to the baseline, whether a response indicator corresponding to a pressure indicator to be detected in the test sample exceeds an alarm threshold of the baseline; if the alarm threshold value is exceeded, determining that the system running state corresponding to the test sample is abnormal.

17. The apparatus of claim 16, wherein the alarm threshold for the baseline is determined from a maximum of a first threshold, a second threshold, and a third threshold, the first threshold being a sum of a third quartile value of the response indicator in the second training sample set and a quartile calculated value being a product of a quartile distance of the response indicator in the second training sample set and a first superparameter, the second threshold being a sum of a corresponding value of the pressure indicator to be detected on the baseline and a second superparameter, and the third threshold being a product of a corresponding value of the pressure indicator to be detected on the baseline and a third superparameter.

18. The apparatus of claim 17, wherein the apparatus further comprises:

The closed loop optimization module is used for acquiring a manual labeling result of a sample with an abnormal detection result; and adjusting the super parameters according to the manually marked test sample detection result and the manually marked result.

19. The device of claim 18, wherein the closed-loop optimization module is configured to calculate a cost value after the super parameter is adjusted by using a search algorithm according to the manually labeled test sample detection result and the manually labeled result, and the test sample detection result that is not manually labeled; setting a target loss function, wherein the target loss function is related to a cost value after super parameters are adjusted by adopting a search algorithm; and determining a super parameter which enables the target loss function to be minimum according to the target loss function.

20. The apparatus according to any one of claims 16 to 19, wherein the apparatus further comprises:

and the preprocessing module is used for carrying out numerical scaling on the test samples so as to make the variances of the test samples equal.

21. A computing device, wherein the device comprises a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the processor to perform the method of any one of claims 1 to 10.

22. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any of claims 1 to 10.