CN112882911A

CN112882911A - Abnormal performance behavior detection method, system, device and storage medium

Info

Publication number: CN112882911A
Application number: CN202110137565.4A
Authority: CN
Inventors: 任睿
Original assignee: Cetc Cyberspace Security Research Institute Co Ltd
Current assignee: Cetc Cyberspace Security Research Institute Co Ltd
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-06-01
Anticipated expiration: 2041-02-01
Also published as: CN112882911B

Abstract

The application discloses an abnormal performance behavior detection method, a system, a device and a computer readable storage medium, wherein various abnormal events and corresponding abnormal characteristic data are comprehensively analyzed by utilizing various abnormal detection algorithms in advance, abnormal events and abnormal characteristic data are utilized to construct abnormal incidence relations between the abnormal events and the abnormal events, between the abnormal events and the abnormal characteristic data, between the abnormal characteristic data and the abnormal characteristic data, and the abnormal events and the abnormal characteristic data are closely linked, so that the abnormal events and the abnormal characteristic data can be integrally analyzed, and finally, by utilizing historical abnormal incidence relations, historical abnormal events and historical abnormal characteristic data, the constructed knowledge graph model can effectively and comprehensively analyze the current abnormal events and the corresponding abnormal characteristic data of a data center, and the detection result is more comprehensive and accurate.

Description

Abnormal performance behavior detection method, system, device and storage medium

Technical Field

The present invention relates to the field of distributed storage, and in particular, to a method, a system, an apparatus, and a computer-readable storage medium for detecting abnormal performance behavior.

Background

The operation and maintenance management of the data center infrastructure is to ensure that the data center environment can meet the requirements of various facilities, client SLAs and reliability required by the normal operation of computer equipment. Due to the gradual increase of the scale of the data center, the complex server node types, the numerous operation and maintenance problem types and the unpredictable problem occurrence, the system operation and maintenance also face more and more difficulties, and how to perform intelligent operation and maintenance decision triggering based on the monitoring data in a large-scale operation and maintenance scene to realize the operation and maintenance capability of automatic intelligent operation is the key of the modern operation and maintenance means.

Traditional automatic operation and maintenance is mainly triggered through a rule-based template, but the existing server nodes are complex in type, numerous in operation and maintenance problem types and difficult to quickly locate due to failure, and problems cannot be solved under many conditions based on manual rules. The knowledge graph is used as high-quality structured data, a comprehensive operation and maintenance knowledge base can be constructed by using the knowledge graph, and automatic operation and maintenance can be realized by using a machine learning technology. For example, various state information of server hardware, an operating system, a job scheduling system and a computing application, such as CPU utilization, job load, storage utilization and the like, is analyzed and processed to form service operation data. Meanwhile, the collected user information, the hardware information of the equipment, the virtual machine information and other information are used as node attributes to create entity nodes; and then establishing the relationship among the nodes, namely establishing relationship connection by utilizing the relationship among the entity nodes to form relationship connection as the relationship data of the knowledge graph, thereby constructing the operation and maintenance knowledge graph and realizing intelligent operation and maintenance management.

At present, the existing operation and maintenance knowledge graph mainly includes a Configuration Management Database (CMDB) and an operation and maintenance knowledge base, and an enterprise-specific operation and maintenance knowledge base is formed by automatically enriching operation and maintenance knowledge. However, the configuration change management library constructed by taking the CMDB as the core needs to change the configuration depending on the change process, and cannot adapt to the container and cloud environment (the relationship between the container environment and the cloud environment, and the relationship between resources are completely dynamic) by using a non-real-time update mechanism. Moreover, the topology of the conventional configuration change is not time-sequenced, and the corresponding topology cannot be found out according to the failure time. Meanwhile, the traditional operation and maintenance knowledge base is static and single, and cannot meet the requirement of quick and accurate operation and maintenance.

However, most of the existing operation and maintenance knowledge maps are constructed in a semi-automatic or manual mode, so that two problems exist: (1) the operation and maintenance knowledge is incomplete, and potential relations among a plurality of entities in the knowledge graph are not mined; (2) the extensibility is poor and new entities cannot be automatically added to the knowledge-graph.

Therefore, a detection method capable of more comprehensively and effectively reflecting the abnormal event and the related abnormal data is needed.

Disclosure of Invention

In view of the above, the present invention provides a method, a system, a device and a computer readable storage medium for detecting abnormal performance behavior, which can perform abnormality detection and fault diagnosis more comprehensively and effectively. The specific scheme is as follows:

an abnormal performance behavior detection method, comprising:

acquiring performance data of a data center;

analyzing the performance data by using a pre-constructed knowledge graph model to obtain abnormal parameters of the abnormal event;

the knowledge graph model is a pre-construction process and comprises the following steps:

extracting the characteristics of the historical performance data of the data center to obtain historical characteristic data corresponding to different historical events;

detecting historical characteristic data corresponding to different historical events by using an anomaly detection algorithm set to obtain a plurality of historical anomaly events and corresponding historical anomaly characteristic data;

establishing a historical abnormal association relation between each historical abnormal event and corresponding historical abnormal characteristic data by using each historical abnormal event and corresponding historical abnormal characteristic data; the historical abnormal feature data comprises indexes and performance factors;

and constructing the knowledge graph model by utilizing the historical abnormal incidence relation, the historical abnormal event and the historical abnormal characteristic data.

Optionally, the process of acquiring performance data of the data center includes:

and acquiring hardware layer performance data, system structure layer performance data, system layer performance data and application layer performance data of the data center.

Optionally, the process of performing feature extraction on the historical performance data of the data center to obtain historical feature data corresponding to different historical events includes:

and performing feature extraction on the historical performance data of the data center to obtain historical feature data corresponding to different single scene historical events.

Optionally, the process of detecting historical feature data corresponding to different historical events by using the anomaly detection algorithm set to obtain a plurality of historical anomaly events and corresponding historical anomaly feature data includes:

the abnormal detection algorithm set comprises a load unbalance detection algorithm, a data volume inclination detection algorithm, a data placement unbalance detection algorithm, an abnormal node detection algorithm, an abnormal index detection algorithm, an inter-process interference detection algorithm and a system fault category detection algorithm.

The invention also discloses an abnormal performance behavior detection system, which comprises:

the performance data acquisition module is used for acquiring performance data of the data center;

the knowledge graph analysis module is used for analyzing the performance data by utilizing a pre-constructed knowledge graph model to obtain abnormal parameters of the abnormal event;

wherein the knowledge-graph analysis module comprises:

the characteristic data extraction unit is used for extracting characteristics of historical performance data of the data center to obtain historical characteristic data corresponding to different historical events;

the abnormal event detection unit is used for detecting historical characteristic data corresponding to different historical events by using an abnormal detection algorithm set to obtain a plurality of historical abnormal events and corresponding historical abnormal characteristic data;

the association factor construction unit is used for constructing a historical abnormal association relation between each historical abnormal event and corresponding historical abnormal characteristic data by utilizing each historical abnormal event and corresponding historical abnormal characteristic data; the historical abnormal feature data comprises indexes and performance factors;

and the knowledge graph building unit is used for building the knowledge graph model by utilizing the historical abnormal association relation, the historical abnormal event and the historical abnormal characteristic data.

Optionally, the performance data acquiring module is specifically configured to acquire hardware layer performance data, architecture layer performance data, system layer performance data, and application layer performance data of the data center.

Optionally, the feature data extraction unit is specifically configured to perform feature extraction on the historical performance data of the data center to obtain historical feature data corresponding to different single-scene historical events.

Optionally, the abnormal event detecting unit is specifically configured to detect historical feature data corresponding to different historical events by using an abnormal detection algorithm set, so as to obtain a plurality of historical abnormal events and corresponding historical abnormal feature data;

The invention also discloses an abnormal performance behavior detection device, which comprises:

a memory for storing a computer program;

a processor for executing the computer program to implement the abnormal performance behavior detection method as described above.

The invention also discloses a computer readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the abnormal performance behavior detection method as described above.

The abnormal performance behavior detection method comprises the following steps: acquiring performance data of a data center; analyzing the performance data by using a pre-constructed knowledge graph model to obtain abnormal parameters of the abnormal event; the knowledge graph model is a pre-construction process and comprises the following steps: extracting the characteristics of the historical performance data of the data center to obtain historical characteristic data corresponding to different historical events; detecting historical characteristic data corresponding to different historical events by using an anomaly detection algorithm set to obtain a plurality of historical anomaly events and corresponding historical anomaly characteristic data; establishing a historical abnormal association relation between each historical abnormal event and corresponding historical abnormal characteristic data by using each historical abnormal event and corresponding historical abnormal characteristic data; the historical abnormal characteristic data comprises indexes and performance factors; and constructing a knowledge graph model by using the historical abnormal incidence relation, the historical abnormal events and the historical abnormal characteristic data.

The invention utilizes various abnormal detection algorithms in advance to comprehensively analyze various abnormal events and corresponding abnormal characteristic data thereof, then utilizes the abnormal events and the abnormal characteristic data to construct abnormal event and abnormal event, abnormal event and abnormal characteristic data abnormal association relationship between the abnormal characteristic data and the abnormal event, and closely associates the abnormal event and the abnormal characteristic data, so that the abnormal event and the abnormal characteristic data can be integrally analyzed, and finally, historical abnormal event and historical abnormal characteristic data are utilized, and the constructed knowledge map model can effectively and comprehensively analyze the current abnormal event and corresponding abnormal characteristic data of the data center, and the detection result is more comprehensive and accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for detecting abnormal performance behavior according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for pre-constructing a knowledge graph model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an abnormal indicator detection algorithm disclosed in the embodiments of the present invention;

fig. 4 is a schematic structural diagram of an abnormal performance behavior detection system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a method for detecting abnormal performance behaviors, which is shown in figure 1 and comprises the following steps:

s11: performance data of the data center is obtained.

Specifically, performance data in the data center is multi-dimensional data, and can be roughly divided into hardware layer performance data, system structure layer performance data, system layer performance data and application layer performance data, and the four types of data are acquired from bottom to top by adopting a fine-grained multi-layer performance data acquisition frame, and the performance data on the hardware layer, the system structure layer, the system layer, the large data frame layer and the application load layer on the data center system are simultaneously acquired from bottom to top and are used as original input of performance analysis of the large data system, so that the performance condition of the large data applied to the whole life cycle can be effectively described so as to be used for subsequent correlation analysis and performance diagnosis.

Specifically, the hardware layer performance data of the hardware layer mainly includes hardware parameters such as hardware temperature and power consumption; in the hardware layer, the power consumption and temperature information of the hardware can be acquired mainly by depending on a group of special registers, namely a hardware counter, provided by the existing CPU.

Specifically, the performance data of the architecture layer mainly comprises an IPC, an instruction proportion, a TLB Miss, a Cache Miss, a memory access bandwidth and the like; on the architecture layer, a hardware counter provided by the existing CPU can be utilized, and the hardware counter can record the number of times of occurrence of the micro-architecture layer event.

Specifically, the system layer performance data of the system layer mainly includes CPU utilization, disk I/O, memory management, network and process related information, and the like, as well as system logs. Under the Linux system, data acquisition at a system layer mainly comes from a file system proc carried by the Linux, the proc file system is a file system without storage, when a file in the file system is read, the content of the file system is dynamically generated, when the file is written, a write function associated with the file is called, a kernel component can provide an interface for a user space through the file system to provide query information and modify software behaviors, and the information almost covers all parts of a kernel and key performance parameters of the system. In addition, the system log comprises a system RAS log, a system security audit log and the like, mainly records the information of hardware, software and system problems in the system, and can also be used for monitoring events occurring in the system. For system logs, they may be collected by Rsyslog or syslog or other log collection tools.

Specifically, on the application framework layer of the data center, the performance data of the application layer mainly comprises configuration information and related logs related to the application framework and the like; in the application load layer, the Profile information on the user code layer is mainly used. For the performance data at different levels, different data collection tools can be used to obtain the performance data, for example, a log collection tool is used to collect information such as logs of current applications, tasks and stages output by a data center application framework at runtime, and then the performance-related data can be analyzed from the information.

S12: and analyzing the performance data by using a pre-constructed knowledge graph model to obtain abnormal parameters of the abnormal event.

Specifically, the knowledge graph model of the embodiment of the present invention is a knowledge graph model in which abnormal association relationships between various abnormal events and various abnormal feature data are pre-constructed, and the abnormal association relationships can comprehensively reflect the interaction relationships between various abnormal events and various abnormal feature data, so as to associate an abnormal event and abnormal feature data thereof in each single scene with abnormal events and abnormal feature data thereof in other scenes, and better reflect the abnormal situation of the data center in a complex scene in actual application.

Referring to fig. 2, the knowledge map model is a specific process constructed in advance, and may include S121 to S124:

s121: and performing feature extraction on the historical performance data of the data center to obtain historical feature data corresponding to different historical events.

Specifically, the historical performance data is obtained by obtaining data from the data center for history, and the specific obtaining process is consistent with the obtaining process in S11, which is not described herein again.

Specifically, after the historical performance data is obtained, feature extraction is performed on the historical performance data, in order to facilitate classification and analysis of various data and effective extraction, behaviors and performance expressions of execution subjects applied to different levels of a data center system are uniformly described, and events (Eevnt) and indexes (Metric) are defined.

The events are abstract behaviors of executing main bodies applied to different levels in a data center system into events, and can be divided into three types of events: (1) application layer events: the execution task, execution phase, user operation, etc. of the application; (2) system layer events: processor processes/threads, communication processes/threads, etc.; (3) hardware layer events: processor control instructions, memory access instructions, and the like.

The indexes can be expressed by performance expression values applied to different levels in a data center system and can be divided into three types of indexes: (1) application layer indexes: the most intuitive performance observation indicators loaded on the application layer, such as the amount of data processed per second, can also analyze the applied processing logic, such as: algorithm complexity, data change rule (i.e. change of data in application process); (2) system layer indexes: the behaviors of a software running environment, an operating system, a hardware environment and the like are reflected, and the behaviors mainly comprise the single performance of each component of the system and the interaction condition among the components; (3) the indexes of the system structure layer are as follows: including instruction ratio, and memory-related microarchitecture layer indicators. The common metrics are shown in the table, and the different metrics already contain information about the system components or the execution entity, e.g., CPU utilization shows the processor performance status.

Further, in the face of a great number of events and indexes of the data center, and different events and indexes may have dependence or propagation relations, only a single abnormal event or index is analyzed, so that the performance of the data center cannot be diagnosed comprehensively and accurately. Therefore, the performance problem of the data center is decomposed, the complicated performance problem is decomposed into single-scenario performance problems which can be solved one by one, and then the single-scenario performance problems are correlated, so that the self-adaptive analysis of the performance can be realized. For single scenario performance issues, a performance state (Status) and a performance Factor (Factor) are defined. The performance status refers to a performance status of an event or an index, and mainly includes: a Normal (Normal) state and an Abnormal (Abnormal) state. A performance factor refers to an event or indicator having a certain performance state, and primarily includes normal/abnormal events and normal/abnormal indicators.

S122: and detecting historical characteristic data corresponding to different historical events by using an anomaly detection algorithm set to obtain a plurality of historical anomaly events and corresponding historical anomaly characteristic data.

Specifically, the various targeted anomaly detection algorithms are used to detect different anomaly problems, so that historical anomaly events included in historical characteristic data can be analyzed comprehensively, for example, whether a load imbalance phenomenon exists in a data center, whether a data distribution applied by the data center is inclined, whether an unbalanced phenomenon exists in a data placement position applied by the data center, whether an abnormal node exists in the data center, whether an abnormal index exists in the data center, whether a process mutual interference phenomenon exists in the data center, and a fault category or failure category existing in the data center can be detected, so that a plurality of historical anomaly events can be obtained, and historical anomaly characteristic data corresponding to each historical anomaly event can also be obtained.

It is understood that the embodiments of the present invention are not limited to the above specific anomaly phenomena and anomaly detection methods, that is, the performance anomaly detection techniques based on data driving and model driving are used in combination with specific situations of performance problems at different levels, so that various anomaly conditions can be detected.

S123: and constructing a historical abnormal association relation between the historical abnormal events and the corresponding historical abnormal characteristic data by utilizing each historical abnormal event and the corresponding historical abnormal characteristic data.

Specifically, a single abnormal occurrence may cause a chain reaction, and a plurality of abnormal occurrences are easily caused, so in order to deeply study the relationship between each abnormal event and the abnormal feature data, a historical abnormal association relationship between historical abnormal events and corresponding historical abnormal feature data is constructed.

Specifically, the performance elements of the data center are subjected to correlation analysis, the performance elements include abnormal events and corresponding abnormal characteristic data, namely, the correlation relation between each historical abnormal event and the corresponding historical abnormal characteristic data is subjected to statistics and correlation analysis, wherein the historical abnormal characteristic data include indexes and performance factors. For example, the correlation coefficient, a statistical indicator designed by the statistician karl pearson, is a measure of the degree of linear correlation between the study variables. The correlation analysis may analyze not only a correlation between two performance factors but also a correlation between a plurality of performance factors. Meanwhile, frequent association patterns which may exist among different performance elements can be discovered by utilizing an association mining algorithm.

Specifically, in an application scenario of a data center, the following four types of correlation relationships are defined: the correlation between the abnormal event and the abnormal event, the correlation between the index and the index, the correlation between the event and the index, and the correlation between the performance factor and the performance factor.

Further, to analyze the causal relationship between different performance elements, it can be represented by a probability or distribution function from a statistical point of view: in the case that the occurrence of all other events is fixed, if the occurrence of one event a has an influence on the occurrence probability of another event B, and the two events are in chronological order (event a occurs before event B, i.e. a and B have a preamble relationship), then a can be said to be the cause of B. For example, the granger's causal relationship theory can be used to determine whether one of the two variables has a correct effect on the prediction of the other variable by statistical hypothesis testing. Or the causal relationship among different performance elements in the system is established through causal path mining and a probability model, and a causal chain of the performance problem is further deduced.

Specifically, based on different performance elements, nine types of causal relationships may be defined: 1. events have a causal relationship with each other, i.e. the occurrence of one event causes another event to occur. 2. There is a causal relationship between events and indicators, i.e. the occurrence of an event causes a change in an indicator. 3. There is a causal relationship between events and performance factors, i.e. the occurrence of an event causes a change in an indicator. 4. Indicators have a causal relationship with indicators, i.e. a change in one indicator results in a change in another indicator. 5. There is a causal relationship between events and indicators, i.e. a change in an indicator causes an event to occur. 6. There is a causal relationship between events and performance factors, i.e. a change in a certain indicator causes a certain event to occur. 7. There is a causal relationship between performance factors and performance factors, i.e. a change in one performance factor results in a change in another. 8. There is a causal relationship between the performance factors and the events, i.e. a change in a certain performance factor causes a certain event to occur. 9. The performance factors and the indexes have a causal relationship, namely, a certain index is changed due to the change of a certain performance factor.

S124: and constructing a knowledge graph model by using the historical abnormal incidence relation, the historical abnormal events and the historical abnormal characteristic data.

Specifically, the obtained historical abnormal association relationship, historical abnormal events and historical abnormal feature data are integrated to obtain multidimensional information of data center server hardware, an operating system, an operation scheduling system and calculation application, and triples (entity-relationship-entity) of different operation and maintenance events are abstracted, so that the knowledge graph model can analyze performance data in multiple dimensions, and more comprehensive and accurate abnormal detection results are obtained. The entities refer to abstracted performance events, indexes and factors, and the relationship refers to the correlation relationship among the entities.

Therefore, the embodiment of the invention utilizes a plurality of anomaly detection algorithms in advance to comprehensively analyze various abnormal events and corresponding abnormal characteristic data thereof, then utilizes the abnormal events and the abnormal characteristic data to construct abnormal event and abnormal event, abnormal event and abnormal characteristic data abnormal association relationship, closely associates the abnormal events and the abnormal characteristic data, enables the abnormal events and the abnormal characteristic data to be integrally analyzed, and finally utilizes historical abnormal association relationship, historical abnormal events and historical abnormal characteristic data, and the constructed knowledge graph model can effectively and comprehensively analyze the current abnormal events and the corresponding abnormal characteristic data of the data center, so that the detection result is more comprehensive and accurate.

The embodiment of the invention discloses a specific abnormal performance behavior detection method, and compared with the previous embodiment, the embodiment further explains and optimizes the technical scheme. Specifically, the method comprises the following steps:

specifically, the anomaly detection algorithm set may specifically include a load imbalance detection algorithm, a data volume inclination detection algorithm, a data placement imbalance detection algorithm, an abnormal node detection algorithm, an abnormal index detection algorithm, an inter-process interference detection algorithm, a system fault category detection algorithm, and other algorithms.

Specifically, the following specific application scenarios are provided for the data placement imbalance detection algorithm: data placement is another important factor that affects task runtime and load balancing. To determine whether Data placement is balanced, consideration is mainly given to Data Locality (Data Locality): the data locality represents the spatial proximity of data and executing codes, and if the data and the codes are not in the same node or frame, the overhead of remote data transmission is generated, so that the data processing speed of a task is influenced; if the data is as close to the processing code as possible, the expenses of long-distance data copying and data migration can be reduced, and therefore the performance of big data application is improved.

On the Spark framework, the priority of data locality includes:

(1) PROCESS _ LOCAL, data and code are on the same JVM;

(2) NODE _ LOCAL, data and code are on the same NODE;

(3) NO _ PREF (NO difference), NO difference when data is processed anywhere, meaning it has NO local performance;

(4) RACK _ LOCAL, data and code are on the same RACK;

(5) ANY, and ANY, data and code are in different machine-interleaved racks.

Where from PROCESS _ LOCAL to ANY means from high priority to low priority.

And the priority of data locality on the Hadoop framework comprises:

(1) NODE _ LOCALITY;

(2) RACK _ LOCALITY;

(3) OFF _ SWITCH (data center locality).

From NODE _ LOCALITY to OFF _ SWITCH, again in order of high priority to low priority.

Since the data locality of different priorities may have different effects on the running time of the task, the influence of the data locality on the running time of the task is mainly judged. Firstly, the running time of a task is divided into two categories: (1) normal operation duration, (2) abnormal operation duration, and use

Representing those that are much longer than normal operation. Then, in order to evaluate the influence of different data locality on task running time, the data is divided into each type of dataLocality sets an impact weight.

The weight setting of the data locality priority on the Spark frame and the weight setting of the data locality priority on the Hadoop frame in the table 1 and the table 2 respectively list the weight values set for the data locality priorities of various types in the Spark frame and the Hadoop frame, wherein the larger the set weight value is, the larger the influence of the data locality on the running time length is, and if the weight of the data locality priority is 0, the influence of the data locality on the running time length is represented.

TABLE 1

Data locality	ANY	RACK_LOCAL	NODE_LOCAL	PROCESS_LOCAL	NO_PREF
						Priority weighting	2	2	1	0	0

TABLE 2

Data locality	OFF_SWITCH	RACK_LOCALITY	NODE_LOCALITY
				Priority weighting	2	2	1

Specifically, the data placement imbalance detection Algorithm based on the euclidean distance by Algorithm 2 is as follows:

the algorithm provides a data placement imbalance detection algorithm based on Euclidean distance. Firstly, by calculating the distance dis between the running time length of each task and the average value of the running time length_jCombining the mean distance mean (dis) and the standard deviation std (D) of the operating duration^Si) By the formula | | dis_j|-mean(dis_j)|>std(D^Si) And 1.96, judging whether the running time of the task j is the abnormal running time, and adding the abnormal running time into an abnormal running time list. Then according to the task with abnormal operation duration, the node operated by the task and the data locality category can be found, and the locality of each type of data can be calculated^tHas a differenceNumber of constant running time

Further, a priority weight for data locality is introduced, by the Ratio (locality) defined as above^tK) to represent the proportion of the existence of data placement imbalance on node k. When Ratio (locality)^t,k)>0, then an unbalanced data placement on node k is considered to exist.

Specifically, the following is a specific application scenario of the abnormal index detection algorithm: in order to detect the abnormal indexes existing on each stage Si, the performance indexes on the nodes k are constructed into a performance index matrix X_si,kThe size of the matrix is m × n, where n denotes the number of collected performance indicators, and m denotes the performance indicators at the m timestamps of the collection stage si

Then, based on the constructed performance index matrix, the existing abnormal indexes are found through principal component analysis, time series transformation, standardization and outlier detection algorithms. Fig. 3 shows an exemplary diagram of an abnormal index detection process.

According to observation, not all indexes have strong correlation with abnormal performance, and different big data applications and execution behaviors of different stages have different degrees of influence on different performance indexes. To reduce the dimensionality of the dataset to reduce complexity while maintaining the features in the dataset that contribute most to variance, Principal Components Analysis (PCA) is used for dimensionality reduction. Principal component analysis is a statistical process that uses orthogonal transformation to transform a set of possibly correlated variable observations into a set of linearly independent variables (principal components), and the number of principal components is less than or equal to the number of original variables.

In the specific implementation, the covariance matrix of the performance index matrix is calculated, then the eigenvalue eigenvector of the covariance matrix is obtained, and the eigenvector corresponding to the first d characteristics with the largest eigenvalue (i.e. the largest variance) is selected to form a new matrix, thereby realizing the dimension reduction of the data characteristics, namely the principal component characteristic in the performance index matrix Xsi is the covariance matrix

N x feature vectors.

Then, the first d principal component indexes PCd are selected through the cumulative contribution rate CCRated, that is, the feature vectors with the cumulative contribution rate exceeding a certain threshold are selected as the principal component vectors. In the experiment, 0.95 was selected as the cumulative contribution rate of the selected principal component index.

Then, the performance index matrix is subjected to dimensionality reduction through principal component analysis, and the original performance index matrix X_si,kConversion into principal component index matrix of size mxd

(2) Time series transformation

D principal component indexes can be obtained by principal component analysis. For each principal component index, the principal component indexes of the big data application stage Si on each node k form a group of time series, for example, the time series of the first principal component index PC1 is

The second principal component index PC2 has a time sequence of

The time series of the d-th principal component index PCd is

And (3) adopting mean value transformation: and averaging the performance index values in the time sequence, wherein the calculation method of mean value transformation refers to a mean value transformation formula. Because if some nodes have a large difference from others in the average value of a certain performance index, it can be inferred that the performance index is possibly a potential abnormal index on the node.

Mean transformation formula:

(3) standardization

Different performance indexes in the system usually have different units, and the value range may have larger difference. For example, CPU utilization and memory usage are typically in units of percentages (%), taking values between 0 and 1. The unit of the disk read-write bandwidth and the network transceiving bandwidth is MB/s or KB/s and the like. Thus, the performance index values on different scales are adjusted to a uniform range by a standardization method.

In this section, the time-series transformed index value is converted to between 0 and 1 using a linear Min-Max normalization method. In particular, Min-Max normalized expression

Wherein y represents

Max is the maximum value of the index, and min is the minimum value of the index. Although the Min-Max normalization method is simple and effective, it has the disadvantage that it may require recalculation of the maximum and minimum values as additional new data is input. And the index value after Min-Max standardization is used

And (4) showing.

(4) Outlier detection based on distance and dimension

The main purpose of outlier detection is to detect abnormal data or behaviors that differ significantly from the characteristic attributes or behaviors of normal data, and generally, outlier data is usually smaller in size than normal data, but the influence of these outliers cannot be ignored.

In this section, it is detected whether there is an abnormal index on the cluster node at each stage of the big data application. Then, the normalized index value on all the computing nodes in the cluster is obtained

A set of index vectors is formed,

outlier detection algorithms are then used to find the anomaly indicators.

Specifically, an unsupervised outlier detection algorithm combining distance and dimension is proposed. In general, in distance-based outlier detection, if an object in data set D has at least a part of pct as a distance from object o greater than dmin, object o is said to be a distance-based outlier with pct and dmin as parameters, i.e., a DB (pct, dmin) outlier. The determination of the pct and dmin parameter values and the evaluation of validity (determining whether a DB (pct, dmin) outlier is a true outlier) require expert experience for guidance. By setting appropriate parameters for the normalized performance index data, most abnormal values in the data set can be detected by using the distance-based outlier detection algorithm, but some abnormal values are still missed. For example, a set of values for the cpu _ use index are obtained, respectively [ hw073: 0.006838, hw106: 0.15604399, hw114: 0.17810599], when dmin is set to 0.5 and pct to 1, no abnormal value can be detected; in fact 0.006838 can be intuitively considered as an anomaly.

Then, using a logarithmic approach (e.g., using log (10) to transform the normalized raw data to obtain a dimension on the order of the value, e.g., [ hw073:2, hw106:0, hw114:0], then based on the logarithmized value, the cpu _ use on the hw073 node can be considered an outlier.

In a pseudo code algorithm of the distance and dimension based anomaly index detection algorithm, a default value of the parameter pct is set to 1, and a value of the parameter dmin is adjustable. The steps of the algorithm are as follows:

(1) and obtaining the dimension of the magnitude order of the index value by a logarithm method, and then detecting the abnormal index by using an outlier detection algorithm based on the dimension. That is, the median of all index value dimensions is calculated, then the distance dis between each index value dimension and the median is calculated, and if the distance dis of a certain index value dimension is greater than the mean avg (dis) of the distances, the index value dimension is added to the suspicion group SuspG. And comparing the distances dis (SuspG) between all index value dimensions in the suspected group and the median of the dimensions, and if the difference between dis (SuspG) and avg (dis) is greater than the variance (variance), considering the index value dimensions as outliers.

(2) And detecting abnormal indexes by using an outlier detection algorithm based on distance. Specifically, the index value is divided into two categories: one of the classes A and B is a larger class (including a class with a larger number of index values) and the other is a smaller class (including a class with a smaller number of index values). Wherein the representative points of the larger class are calculated using two methods, respectively, one is to calculate the maximum/minimum value of the larger class, and the other is to calculate the median value of the larger class. And then calculating the distances between all indexes in the small class and the representative points in the large class, and if the calculated distance value is greater than a threshold dmin, considering the corresponding index value as an outlier. In subsequent experimental evaluations, outlier detection results representing a large class using the maximum/minimum and median values, respectively, at different dmin values were compared.

Correspondingly, the embodiment of the present invention further discloses an abnormal performance behavior detection system, as shown in fig. 4, the system includes:

the performance data acquisition module 11 is used for acquiring performance data of the data center;

the knowledge graph analysis module 12 is configured to analyze the performance data by using a pre-constructed knowledge graph model to obtain an abnormal parameter of the abnormal event;

the knowledge graph analysis module 12 includes:

the association factor construction unit is used for constructing a historical abnormal association relation between each historical abnormal event and corresponding historical abnormal characteristic data by utilizing each historical abnormal event and corresponding historical abnormal characteristic data; the historical abnormal characteristic data comprises indexes and performance factors;

and the knowledge graph building unit is used for building a knowledge graph model by utilizing the historical abnormal association relation, the historical abnormal event and the historical abnormal characteristic data.

Specifically, the performance data acquiring module may be specifically configured to acquire hardware layer performance data, architecture layer performance data, system layer performance data, and application layer performance data of the data center.

Specifically, the feature data extraction unit may be specifically configured to perform feature extraction on historical performance data of the data center to obtain historical feature data corresponding to historical events of different single scenes.

Specifically, the abnormal event detection unit may be specifically configured to detect historical feature data corresponding to different historical events by using an abnormal detection algorithm set, so as to obtain a plurality of historical abnormal events and corresponding historical abnormal feature data;

In addition, the embodiment of the invention also discloses an abnormal performance behavior detection device, which comprises:

a memory for storing a computer program;

a processor for executing a computer program to implement the abnormal performance behavior detection method as described above.

In addition, the embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when being executed by a processor, the computer program realizes the abnormal performance behavior detection method.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The technical content provided by the present invention is described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the above description of the examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An abnormal performance behavior detection method, comprising:

acquiring performance data of a data center;

2. The abnormal performance behavior detection method according to claim 1, wherein the process of obtaining performance data of the data center comprises:

3. The abnormal performance behavior detection method according to claim 2, wherein the process of extracting the features of the historical performance data of the data center to obtain the historical feature data corresponding to different historical events comprises:

4. The abnormal performance behavior detection method of claim 3, wherein the step of detecting historical feature data corresponding to different historical events by using the abnormal performance detection algorithm set to obtain a plurality of historical abnormal events and corresponding historical abnormal feature data comprises:

5. An abnormal performance behavior detection system, comprising:

wherein the knowledge-graph analysis module comprises:

6. The abnormal performance behavior detection system of claim 5, wherein the performance data obtaining module is specifically configured to obtain hardware layer performance data, architecture layer performance data, system layer performance data, and application layer performance data of a data center.

7. The abnormal performance behavior detection system according to claim 6, wherein the feature data extraction unit is specifically configured to perform feature extraction on historical performance data of the data center to obtain historical feature data corresponding to historical events of different single scenes.

8. The system according to claim 7, wherein the abnormal performance behavior detection unit is specifically configured to detect historical feature data corresponding to different historical events by using an abnormal detection algorithm set, so as to obtain a plurality of historical abnormal events and corresponding historical abnormal feature data;

9. An abnormal performance behavior detection apparatus, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the abnormal performance behavior detection method of any of claims 1 to 4.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the abnormal performance behavior detection method of any one of claims 1 to 4.