CN107682319B

CN107682319B - Enhanced angle anomaly factor-based data flow anomaly detection and multi-verification method

Info

Publication number: CN107682319B
Application number: CN201710823063.0A
Authority: CN
Inventors: 首照宇; �田�浩; 邹风波; 张彤; 程夏威; 文辉; 赵晖; 莫建文; 汪延国; 曾情; 李希成
Original assignee: Guilin Yuhui Information Technology Co ltd; Guilin University of Electronic Technology
Current assignee: Guilin Yuhui Information Technology Co ltd; Guilin University of Electronic Technology
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2020-07-03
Anticipated expiration: 2037-09-13
Also published as: CN107682319A

Abstract

The method is characterized by comprising the following steps of 1) processing a real-time data stream, 2) setting a data set S in a sliding window, 3) initializing parameters k, r and ξ, 4) obtaining a distance matrix dist, 5) obtaining a r neighborhood point set, and 6) obtaining an angle factor of the r neighborhood point set

And local density

7) Obtaining dissimilarity degrees; 8) acquiring a cluster center factor of each data point; 9) acquiring an attribution matrix; 10) determining a cluster center and clustering; 11) respectively carrying out anomaly detection on each clustered cluster; 12) and (5) performing multiple verification. The method applies sliding window and basic window technologies, constructs an efficient data stream processing model, reduces the occupancy rate of the memory, and has good real-time performance, high accuracy of abnormal detection and low time complexity.

Description

Enhanced angle anomaly factor-based data flow anomaly detection and multi-verification method

Technical Field

The invention relates to data flow anomaly detection and data clustering, in particular to a data flow anomaly detection and multiple verification method based on enhanced angle anomaly factors.

Background

The rapid development of network technology and the continuous improvement of social informatization lead to the explosive increase of information quantity, so that various industries generate massive, high-speed and dynamic stream data, such as network intrusion monitoring, commercial transaction management and analysis, video monitoring, sensor network monitoring and the like. Due to the characteristics of real-time infinite dynamic data flow and the like, the traditional static data anomaly detection method cannot accurately and effectively analyze and process the large-scale dynamically-increased flow data, so that the construction of a real-time effective anomaly detection method suitable for the data flow becomes particularly important.

For the practical problems faced by different stages, different data stream anomaly detection methods are provided by scientific and technological workers. The conventional data flow anomaly detection methods can be roughly classified into density-based data flow anomaly detection methods, angle-based data flow anomaly detection methods, and cluster-based data flow anomaly detection methods. The density-based data flow anomaly detection method applies density as the most basic anomaly measurement mode and constructs an anomaly factor which can be dynamically updated and is used for measuring the data anomaly degree, Pokrajac et al quotes a static data anomaly detection method LOF into a data flow and researches an incremental local anomaly detection method INCLOF which can be applied to the dynamic data flow, and the INCLOF deletes historical data and dynamically updates the anomaly factor of each data point along with the insertion of new data; the method of improving INCLOF by Ke Gao et al introduces the idea of sliding window, and proposes an n-INCLOF method, wherein the n-INCLOF method only updates the abnormal factors of each data object in the sliding window at the current moment; in some cases, some data points show abnormality at a certain moment, but do not show abnormality at the next moment, based on the problem, Karimian S H et al propose an I-IncLOF method, the I-IncLOF method introduces a multiple verification idea, the I-IncLOF method judges data objects which show abnormality all the time in the whole sliding process of a window as determined abnormal points, the I-IncLOF method greatly reduces the false judgment rate, but the I-IncLOF method is poor in effectiveness under a multi-dimensional condition; xinjielilu et al propose an INCLOCI method, which introduces a multi-granularity abnormal factor MDEF, and can detect not only scattered abnormal points but also abnormal clusters. In order to solve the problem that the effectiveness of similarity measurement modes such as distance and density is reduced in a high-dimensional data space, some scientific researchers provide angle measurement modes, the basic idea of the angle similarity measurement is that the angle formed by an abnormal point and other points is generally small and the fluctuation range is small, the angle formed by a conventional point and other points is large and the fluctuation range is large, HP Kriegel et al provide an angle-based abnormality detection method ABOD, the ABOD method takes the variance of the angle as an abnormality factor ABOF for measuring the abnormality degree of a data point, and the ABOD method still has high detection accuracy in the high-dimensional space; YeH provides an angle-based data stream anomaly detection method DSABOD, the DSABOD dynamically updates an anomaly factor of each data point relative to a neighborhood point of the data point along with the continuous flow of the data point of the data stream into a memory, the DSABOD provides a new idea for anomaly detection in a high-dimensional data stream, but the traditional angle-based data stream anomaly detection method has the problem of low anomaly detection rate. The data flow abnormity detection method based on clustering comprises two stages of clustering data points and carrying out abnormity detection on the data points in each cluster, Elahi M et al propose a data flow abnormity detection method based on clustering, a method combining K-Means and LOF is adopted, abnormity factors are defined by regions in the method, and the accuracy of the method on abnormity detection is improved; thakran Y et al also propose a method combining the DBSCAN method with the W-K-Means method, the method clusters the data block at the current moment by using a DBSCAN method to obtain candidate abnormal points and initial clusters, the method combines the candidate abnormal points to be subjected to multiple verification obtained at the previous moment, uses a W-K-Means method to perform clustering again to obtain the candidate abnormal points and the conventional point clusters at the current moment, meanwhile, the method adopts multiple verification to delete the misjudged abnormal points to the candidate abnormal points to release the memory, the method dynamically adjusts the attribute weight of the parameters MinPts, Epsilon and W-K-Means method required by the DBSCAN method in the whole process, the method has higher accuracy on anomaly detection, but too many manually set parameters are needed, manual intervention is serious, the complexity of the method is higher, and the effectiveness of the method in a multi-dimensional space is poorer.

Data flow anomaly detection is a research hotspot and difficulty in the field of data mining nowadays, and the main aim is to accurately detect information which does not conform to a conventional mode in real time from a complex data environment which is dynamically changed.

Disclosure of Invention

Aiming at the problems of high time complexity, large memory occupation, low use efficiency, excessive human parameter intervention, low effectiveness in a multi-dimensional data environment and the like of the traditional method, the invention provides a data flow abnormity detection and multi-verification method based on an enhanced angle abnormity factor. The method can reduce the occupancy rate of the memory, and has good real-time performance, high accuracy rate of abnormal detection and low time complexity.

The technical scheme for realizing the purpose of the invention is as follows:

a method for data flow abnormity detection and multiple verification based on an enhanced angle abnormity factor comprises the following steps:

1) processing the real-time data stream: for various real-time data streams acquired by a data acquisition terminal, according to the minimum unit forming data and the time sequence during acquisition, forming data blocks from data acquired in a time period, forming a data set S processed by a sliding window from a plurality of data blocks, and preparing for the processing in the step 2);

2) setting the result obtained after the processing in the step 1) to obtain a data set S in the current sliding window, wherein S is { X ═ X₁,X₂,...,X_nN data points, each data point being represented by its attribute

Wherein

Represents the data point x_iD attributes are used for subsequent clustering and anomaly detection;

3) setting initialization parameters k, r and ξ, wherein k represents the number of k nearest neighbors of a data point, r is the spatial neighborhood radius of the data point, ξ is an abnormal decision threshold adjustment coefficient, and an abnormal decision threshold theta is mu + ξ. delta, wherein mu and delta correspond to the mean value and standard deviation of all data point enhanced angle abnormal factors;

4) obtaining a distance matrix dist, namely combining the data set S in the step 2), calculating the distances among all data points, and obtaining a distance matrix dist of n × n, wherein the dist is [ d_ij]_n×nThe calculation formula is formula (1):

wherein X_iAnd Y_jAre all data points in the set S, and x_ikRepresents the data point x_iThere are k attributes, y_jkRepresents the data point y_jThere are k attributes;

5) obtaining a r neighborhood point set: according to the spatial neighborhood radius r, obtaining an r neighborhood point set of each data point, namely a set of all circled data points at the point by taking the neighborhood radius r as the radius;

6) obtaining an angle factor of a r neighborhood point set

And local density

Obtaining an angle factor of the r neighborhood point set by combining the distance matrix dist

And local density of r neighborhood point set

Wherein N is_rData point x_iR neighborhood of (a);

7) obtaining a dissimilarity degree delta (x)_i): according to the local density of the r neighborhood point set obtained in the step 6)

After sorting, the corresponding dissimilarity degree delta (x) is calculated_i)；

8) Obtaining a cluster heart factor τ (x) for each data point_i): combining the step 6) and the step 7) to obtain the cluster heart factor tau (x)_i) The calculation formula is formula (5):

cluster heart factor τ (x)_i) To measure how well the data points are at the cluster center;

9) acquiring an attribution matrix: sorting all the data point cluster heart factors obtained in the step 8) in a descending order to obtain tau (p)₁)≥τ(p₂)≥…≥τ(p_n) Wherein p is_nOriginal serial numbers representing corresponding data points, resulting in a home matrix F ═ F for clustering₁,f₂,...,f_n]；

10) Determining cluster centers and clustering: performing cluster center determination and clustering on the data set S by using the cluster center factor and the attribution matrix, forming a set, namely a cluster, from all data points with the same class label to obtain m-C_{center_id}An individual cluster C₁,C₂,...,C_mIn which C is_{center_id}The cluster center is the serial number of the cluster center, and the clustering of the data set S is completed;

11) and respectively carrying out anomaly detection on each clustered cluster: obtaining each cluster in step 10)C_i(

i

1,2, …, m), each cluster C in the clustered data set S is first sorted₁,C₂,...,C_mRespectively carrying out anomaly detection to obtain a cluster of anomaly point set O_iFinally, all abnormal point sets O ═ O { O in the data set S are obtained₁,...,O_mThe formula involved in anomaly detection is: intra cluster angle factor

Is formula (7):

where A, B, C are the data points in the data set,

a represents the ith cluster and each data point in the cluster has a d-dimensional attribute,

is the number of data points in the neighborhood of data point q,

local increment value H (X)_j) Is formula (8):

distance sum of k nearest neighbors L (X)_j) Is of formula (9):

wherein the content of the first and second substances,

represents the data point X_jK neighborhoods consisting of k nearest neighbors in the cluster to which the neighbors belong;

enhanced angular anomaly factor EAOF (X)_j) Is formula (10):

wherein o is the data point X_jCluster center of the cluster, dist (o, X)_j) Is a data point X_jThe distance from the cluster center of the cluster,

represents a cluster C_i(i-1, 2, …, m) angle factor of each data point relative to the cluster, H (X)_j) Is a local delta value;

12) and (3) multiple verification: and verifying all candidate abnormal points for multiple times, judging the candidate abnormal points which are still shown to be abnormal after limited verification as determined abnormal points, outputting and storing the determined abnormal points, and directly discarding the abnormal points if the candidate abnormal points are shown to be normal points in the verification process.

The processing in the step 1) means that the data acquired by the data acquisition terminal is cached in a stream form, and the cached data is divided into E₀,E₁,E₂,... the data blocks, each data block represents a basic window, each sliding window W contains 2 basic windows, the insertion and deletion of data are realized by combining the basic window and the sliding window, and the process of combining the basic window and the sliding window is as follows: at T_iTime of day transition to T_i+1At the moment, the sliding window is formed by W_iSlide to W_i+1Accompanied by a new basic window E_i+1Merging and history base window E_i-1While removing T_iTime W_iIncorporation of detected candidate outliers into W_i+1In (3) performing multiple validations.

The angle factor calculation formula of the r neighborhood point set in the step 6) is a formula (2):

the local density calculation formula of the r neighborhood point set in the step 6) is a formula (3):

wherein N is_r(p) is the r neighborhood of the data point p, q is any one data point in the r neighborhood set of the data point p, the local density is related to the number of neighborhood data points and the position of the neighborhood data points, and the more the number of neighborhood data points is, the more the neighborhood data points are located in the center of the data set, the larger the local density is.

Dissimilarity δ (x) described in step 7)_i) The local densities of all data points are sorted in descending order, and the dissimilarity degree delta (x)_i) The calculation formula of (2) is formula (4):

wherein p is_iAnd p_jIs the serial number of the corresponding data point, when i is 1, j is more than or equal to 2; when i is larger than or equal to 2, j is smaller than i.

The attribution matrix F ═ F) in step 9)₁,f₂,...,f_n]The formula is used for recording the attribution relationship between data points, and the expression formula of each element is formula (6):

wherein, { p_iDenotes the cluster heart factor τ (x)_i) The original subscript numbers sorted in descending order.

The data flow abnormity detection method is divided into 2 processes, namely a data flow processing process and a data flow abnormity detection process. In the data flow processing process, dynamic data flow is converted into static data blocks, so that subsequent abnormal detection is facilitated, and the real-time performance and the high efficiency of the whole detection are ensured; the data flow abnormity detection process is used for carrying out abnormity detection on the static data set processed in the data flow processing process, and in order to improve the abnormity detection accuracy, a method of clustering firstly and then carrying out abnormity detection is adopted. In the technical scheme, the real-time data stream processing method combining the sliding window and the basic window is the core of the data stream processing process, the memory occupancy rate is reduced, the quality of subsequent abnormal detection is improved, the cluster center factor and the attribution matrix are two parameters which are newly introduced in the technical scheme and used for determining the cluster center and clustering, the cluster center of the multidimensional data space can be rapidly and effectively determined, and accurate clustering is carried out according to the determined cluster center; the enhanced angle anomaly factor is another important parameter in the technical scheme, makes up for partial defects of the traditional anomaly factor, retains the effectiveness of an angle measurement mode in a multi-dimensional space, and is the core of an anomaly detection part.

The method applies sliding window and basic window technologies, constructs an efficient data stream processing model, reduces the occupancy rate of the memory, and has good real-time performance, high accuracy of abnormal detection and low time complexity.

Drawings

FIG. 1 is a schematic flow chart of the method in the example;

FIG. 2 shows example t₁A schematic diagram of a data point distribution diagram in a time sliding window;

FIG. 3 shows example t₂A schematic diagram of a data point distribution diagram in a time sliding window;

FIG. 4 is a diagram illustrating the combination of the sliding window and the base window to process the real-time data stream and the multiple verification processes in one embodiment;

FIG. 5 is a graph illustrating an exemplary angular measure of data points;

FIG. 6 is a schematic diagram illustrating a data point distribution of the U-shaped cluster data based on the conventional angle measurement method in the embodiment;

FIG. 7 is a schematic diagram illustrating a data point distribution of multi-cluster data misjudged based on a conventional angle measurement method in an embodiment;

FIG. 8 is a diagram illustrating the distribution of original coordinates of a data set in an embodiment;

FIG. 9 is a schematic diagram showing a local density-degree of dissimilarity distribution in the example;

FIG. 10 is a diagram showing the distribution of the cluster cofactors in the example;

FIG. 11a is a schematic diagram of the distribution of the data set 1 in the example;

FIG. 11b is a diagram showing the distribution of outliers in the data set 1 according to the example;

FIG. 11c is a schematic diagram showing the abnormal point identifiers detected by the abnormal detection of the data set 1 in the embodiment;

FIG. 11d is a schematic diagram illustrating the data set 1 in the embodiment where the abnormal detection is performed by using the normal point as the abnormal point identifier;

FIG. 12a is a schematic diagram of the distribution of the data set 2 in the example;

FIG. 12b is a diagram illustrating an actual abnormal distribution of the data set 2 in the embodiment;

FIG. 12c is a schematic diagram showing the abnormal point identifiers detected by the abnormal detection of the data set 2 in the embodiment;

FIG. 12d is a diagram illustrating the data set 2 with abnormal points detected as normal point identifiers by abnormal detection in the embodiment.

Detailed Description

The invention will be further illustrated, but not limited, by the following description of the embodiments with reference to the accompanying drawings.

Referring to fig. 1, a method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factors includes the following steps:

1) processing the real-time data stream: processing various real-time data streams acquired by a data acquisition terminal, wherein the real-time data streams have dynamic and changeable characteristics, and some data objects are represented as abnormal in a current sliding window but are represented as normal points in a sliding window at the next moment, as shown in fig. 2 and 3, and t is t in fig. 2₁A profile of the time-of-day sliding-window data points, where point P 'appears abnormal, but as data points continue to flow in, more and more data points accumulate around point P', fig. 3, t₂The distribution diagram of the data points of the time sliding window shows that the point P' is normal at the moment;

2) setting a data set S in a sliding window: step 1), processing to obtain a data set S in the current sliding window: let S ═ X₁,X₂,...,X_nN data points, each data point being represented by its attribute

For subsequent clustering and anomaly detection;

3) setting initialization parameters k, r and ξ, wherein k represents the k nearest neighbor numbers of data points, r is the spatial neighborhood radius of the data points, ξ is an abnormal decision threshold adjusting coefficient, and an abnormal decision threshold theta is mu + ξ delta, wherein mu and delta correspond to the mean value and standard deviation of all data point enhanced angle abnormal factors;

6) obtaining an angle factor of a r neighborhood point set

And local density

And local density of r neighborhood point set

Wherein N is_rData point x_iR neighborhood of (a); as shown in FIG. 5, the method is based on the angle measurement idea, which calculates the angle between the data point and each other pair of data points, and then takes the variance to find the core region point A₁The angle change range formed by the point pair and other points is large, so the variance is large; for anomaliesPoint A₃The angle change range formed by the point pair and other points is very small, so the variance is small; and for the boundary point A₂The angle between it and other point pairs is in the range of A₁And A₃The variance is between the range of variation, so the variance is between the core region point and the outlier, but this has some defects, as shown in fig. 6 and 7, the outlier B in fig. 6₁Located in the center of the U-shaped cluster, and the angle formed by the U-shaped cluster and the surrounding point pair has a large change range, namely, the variance is large, and the edge point B₂The angle change range formed by the point pairs with other points is small, namely the variance is small; similarly, the abnormal point D in FIG. 7₁Located in the middle of the two clusters, the angle formed by the point pair between the point and the two clusters is wide, and the edge point D₂The angle change range formed by other points is small; the obtained result is just opposite to the actual result, and missing and misjudgment occur;

cluster heart factor τ (x)_i) The method is used for measuring the degree of a data point in a cluster center, the cluster center factor is an improved parameter factor for quickly and effectively determining the cluster center of a multidimensional data space in the embodiment method, and is a crucial step in clustering, the implementation process is shown in fig. 8, 9 and 10, and it can be seen that the data set is composed of two clusters, wherein a point 13 and a point 25 are the cluster centers of the two clusters respectively; fig. 9 is a graph showing ρ - δ (local density-dissimilarity) distributions of points in the data set obtained by the equations (3) and (4), and it can be seen that the local densities and dissimilarities of the points 13 and 25 are large; FIG. 10 shows the cluster centers of respective points obtained by the formula (5)The distribution diagram after the descending sorting of the factors shows that the cluster center factor of the point 13 and the point 25 is the largest, and therefore the cluster center is most likely to be the cluster center;

10) Determining cluster centers and clustering: performing cluster center determination and clustering on the data set S by using the cluster center factor and the attribution matrix, forming a set of all data points with the same class label, namely clustering to obtain m (m is C)_{center_id}) An individual cluster C₁,C₂,...,C_mCompleting the clustering of the data set S;

11) and respectively carrying out anomaly detection on each clustered cluster: obtaining each cluster C in step 10)_i(

i

Is formula (7):

where A, B, C are the data points in the data set,

is the number of data points in the neighborhood of data point q, the local increment value H (X)_j) Is formula (8):

distance sum of k nearest neighbors L (X)_j) Is of formula (9):

wherein the content of the first and second substances,

enhanced angular anomaly factor EAOF (X)_j) Is formula (10):

is represented by C_i(i-1, 2, …, m) angle factor of each data point within a cluster relative to the cluster, H (X)_j) Is a local delta value;

12) and (3) multiple verification: and verifying all candidate abnormal points for multiple times, judging the candidate abnormal points which are still shown to be abnormal after limited verification as determined abnormal points, outputting and storing the determined abnormal points, and directly discarding the abnormal points if the candidate abnormal points are shown to be normal points in the verification process, so that the effect of the accuracy rate of abnormal detection can be improved.

The processing in the step 1) means that the data acquired by the data acquisition terminal is cached in a stream form, and the cached data is divided into E₀,E₁,E₂,... The data blocks each represent a basic window, each sliding window W contains 2 basic windows, the basic window and the sliding window are combined to realize the insertion and deletion of data, and the basic window and the sliding window are combinedThe process of (2) is shown in fig. 4: at T_iTime of day transition to T_i+1At the moment, the sliding window is formed by W_iSlide to W_i+1Accompanied by a new basic window E_i+1Merging and history base window E_i-1While removing T_iTime W_iIncorporation of detected candidate outliers into W_i+1In (3) performing multiple validations.

Dissimilarity δ (x) described in step 7)_i) The local densities of all data points are sorted in descending order, and the dissimilarity degree delta (x)_i) The calculation formula of (2) is formula (4): the dissimilarity is a measure of the probability of different clusters between data points, and is obtained by sorting the local densities obtained in step 6) in descending order from a given data set S

Wherein, { p_iDenotes local density

One of the descending original subscript numbers, d (p)_i,p_j) Representing a data point p_iAnd p_jThe Euclidean distance between them, a certain data point p_iThe degree of dissimilarity of (a) can be defined as follows:

The step 10) of determining cluster centers and clustering refers to defining the serial number of the cluster centers as C first_{center_id}Data points are labeled C_{cluster_label}And initializes the cluster core number to 1, i.e., C_{center_id}1 is ═ 1; the data point with the largest cluster center factor obtained in step 8) is labeled as 1, that is

Then according to the descending subscript number { p) obtained in the step 8)_iFourthly, the condition traversal is carried out on the whole data set S, if yes, the condition traversal is carried out

And

the distances of all points satisfy

(wherein r is the initial parameter value neighborhood radius), redefining the point as a new cluster center, increasing the class label of the point by 1, and accordingly obtaining all cluster centers; then, according to the obtained cluster center, reusing the attribution matrix F ═ F) in the step 9)₁,f₂,...,f_n]Pasting the same points belonging to the same cluster centerA label (i.e., class label) by the following method: by the descending subscript number { p) obtained in step 9)_iFourthly, the condition traversal is carried out on the whole data set S, if p is_iNon-clustered centers, based on the home matrix

Corresponding label is assigned to p_iElse p_iThe label of (a) is itself, and finally all data points with the same class label are grouped into a set, i.e. a cluster, to obtain m (m ═ C)_{center_id}) An individual cluster C₁,C₂,...,C_mAnd finishing clustering the data set S.

Step 11) is to perform anomaly detection on each clustered cluster, and the anomaly detection specifically includes the following steps:

① for any cluster C_i(i-1, 2, …, m), calculating an angle factor for each data point within the cluster relative to the cluster

Is formula (7):

wherein, C_i(i ═ 1,2, …, m) represents any cluster after clustering;

② calculate the local delta value H (X) for each data point in the cluster relative to its neighborhood in space r_j) Is formula (8):

the local increment is to reflect how dense the data points are within the spatial neighborhood of the cluster to which they belong, wherein,

data points X are represented_jIn the r neighborhood of its cluster

Number of data points in

③) calculating the distance dist (o, X) between each data point and the cluster center of the cluster according to the cluster centers confirmed in the step 10)_j)；

④ calculate the sum of the distance L (X) of each data point from its k nearest neighbors_j) Is of formula (9):

wherein the content of the first and second substances,

represents the data point X_jK neighborhoods consisting of k nearest neighbors in the cluster to which the neighbor belongs, and the sum of distances L (X) of the k nearest neighbors_j) Reflecting how far and near the data point is from the surrounding data points, so as to avoid the angle-based abnormality factor appearing similarly to B in FIG. 6₁The presence of defects;

⑤ calculate an enhanced angular anomaly factor EAOF (X) for each data point_j) Is formula (10):

is represented by C_i(i-1, 2, …, m) angle factor of each data point within a cluster relative to the cluster, H (X)_j) Is a local delta value; the enhanced angle anomaly factor EAOF not only has excellent measurement performance of an angle measurement mode in a multi-dimensional space, but also introduces the ideas of distance and density, and makes up the defects of the traditional angle anomaly factor-based method;

⑥, calculating a mean value mu and a standard deviation delta of all data point enhanced angle anomaly factors obtained from ⑤, and calculating an anomaly decision threshold value theta by using the mean value and the standard deviation, wherein theta is mu + ξ. delta, and ξ is an initially set anomaly decision threshold value adjustment coefficient;

⑦ enhancing the angle anomaly factor EAOF (X) obtained from ⑤_j) Comparing with the decision threshold theta obtained in ⑥, if EAOF (X) is satisfied_j) If > theta, marking the point as a candidate abnormal object in the cluster and storing a candidate abnormal point set O of the cluster_iIn (1).

The embodiment provides a data stream anomaly detection and multiple verification method based on enhanced angle anomaly factors, which adopts a technology of combining a sliding window and a basic window, constructs a high-efficiency real-time data stream processing technology, and introduces the enhanced angle anomaly factors, thereby solving the problems of high memory occupancy rate and low data processing efficiency of the traditional method, and simultaneously ensuring the advantages of high real-time performance, high anomaly detection accuracy and low time complexity.

In order to verify the effectiveness of the method of the present embodiment, the following will be further explained by comparing the simulation results:

in this embodiment, verification is performed on both a manually generated data set and a real data set, and the verification is compared with a weighted clustering-based data flow unsupervised anomaly detection method (abbreviated as method I) proposed by the traditional methods I-IncLOF, Thakran and the like, experimental data set information is shown in table 1, table 1 is experimental data set information, and the three data sets are data sets with different dimensions, different data amounts and different data characteristics.

The data distribution of the artificial data set 1 is shown in FIG. 11a, which has 1615 data points in total, and consists of 5 clusters and 15 discrete points, wherein the cluster 1 is a Gaussian distribution N₁(u₁,Σ₁) The 500 data points generated are composed, and the cluster 2 is a Gaussian distribution N₂(u₂,Σ₂) The 500 data points generated are composed, and the cluster 3 is a Gaussian distribution N₃(u₃,Σ₃) 500 data points are generated, and the cluster 4 and the cluster 5 are respectively composed of Gaussian distribution N₄(u₄,Σ₄) And N₅(u₅,Σ₅) 50 data points generated are composed, and N is₄And N₅The number of data points is very small and is therefore considered an outlier cluster. Meanwhile, according to the distribution characteristics of the data set, 15 discrete abnormal points are randomly generated, so the data set contains 115 abnormal points in total, the distribution situation is shown in fig. 11b, the abnormal points are marked by circles, in the experimental process, the abnormal clusters and the discrete abnormal points are randomly mixed into the normal clusters, and the following parameters are used for generating the data set 1 by gaussian distribution:

μ₁＝[+1 +1]，μ₂＝[-1 -1],μ₃＝[+1 -1],μ₄＝[-1 +1],μ₅＝[0 0]

the data distribution of the artificial data set 2 is shown in fig. 12a, and there are 860 data points, which are composed of 3 normal clusters and 1 abnormal cluster, and 48 discrete abnormal points, wherein the abnormal cluster is composed of 21 abnormal points. Therefore, the data set has 69 abnormal points, and the distribution of the abnormal points is shown in fig. 12 b.

The real data set Breast Cancer is shown in Table 1, the data set is derived from a UCI machine learning library, comprises 699 data points, and consists of two normal clusters, wherein in order to verify the validity of the method, 34 abnormal points are added to the real data set according to statistical characteristics such as mean, variance and the like, and are used for comparison and verification of abnormal detection.

In the verification experiment of the method of this embodiment, the length of a basic window is set to be 20, two basic windows form a sliding window, the number k of nearest neighbor points is 3, the radius of a spatial neighborhood is determined as the mean value of the first 20% distance values of descending order of the distance values between data points in the sliding window at the current moment, the adjustment coefficient of an anomaly decision threshold is 2.5, the number of times of multiple verification is determined as 3, and meanwhile, the detection rate and the false decision rate which can most reflect the effectiveness of the anomaly detection method are selected for comparison, as shown in fig. 11a to 11d and fig. 12a to 12d, which are visualization experiment results of a data set 1 and a data set 2.

For the artificial data set 1, as can be seen from fig. 11a to 11d, 2 abnormal clusters and 15 discrete abnormal points can be effectively detected by using the method, and the effect of zero missing detection is achieved, as can be seen from fig. 11d, 3 normal points are mistakenly detected as abnormal points because the normal points are generated by normal gaussian distribution, but are slightly far away from the normal clusters and are all represented as abnormal in 3 consecutive times of multiple verifications, so that the abnormal points are determined as abnormal points;

for the artificial data set 2, as can be seen from fig. 12a to 12d, the method still maintains good effectiveness in the three-dimensional data space, and as can be seen from fig. 12b, 12c, and 12d, all the points in the abnormal cluster can be detected, and 47 of the 48 discrete abnormal points are detected, and one discrete abnormal point is missed, and the reason for the missed detection is that the missed detection point is closer to the normal cluster, so that a certain time appears normal in the multi-verification, and therefore the point is determined to be the normal point.

While the effectiveness of the method of the present embodiment is verified, the method of the present embodiment is compared with a conventional method, and the advantages of the method of the present embodiment are further verified, as shown in table 2, table 2 is statistical information of experimental results, and detailed statistical results of comparative experiments on three data sets are obtained. As can be seen from table 2, the method provided by this embodiment has high detection rate, low false positive rate, and effectiveness is significantly better than the other two methods, and the superiority of the method is more significant when the dimension of the data set is higher, method I combines W-K-Means and DBSCAN methods, and dynamically updates parameters and weights of each dimension required by DBSCAN, so method I has good adaptability to dynamic data streams, but because it uses a conventional distance and density-based abnormal measurement mode, the effectiveness is reduced when the dimension increases; the I-IncLOF method uses the idea based on local density, is also influenced by dimension disasters, and has better performance when the data dimension is lower, but has poorer effectiveness when the dimension is increased.

Through the verification of different data sets and the comparative analysis with the traditional method, it can be seen that the method for data stream anomaly detection and multi-verification based on the enhanced angle anomaly factor provided by the embodiment has better effectiveness and feasibility.

TABLE 1

TABLE 2

Claims

1. A method for data flow abnormity detection and multiple verification based on an enhanced angle abnormity factor is characterized by comprising the following steps:

1) processing the real-time data stream: forming data obtained in a time period into data blocks according to the minimum unit forming the data and the time sequence during acquisition of various real-time data streams acquired by a data acquisition terminal, and forming a data set S processed by a sliding window by using a plurality of data blocks;

Wherein

6) obtaining an angle factor of a r neighborhood point set

And local density

And local density of r neighborhood point set

Wherein N is_rData point x_iR neighborhood of (a);

8) Obtaining a cluster heart factor τ (x) for each data point_i): combining the step 6) and the step 7) to obtain the cluster heart factor tau (x)_i) Is formula (5):

11) and respectively carrying out anomaly detection on each clustered cluster: obtaining each cluster C in step 10)_i(i 1,2, …, m), each cluster C in the clustered data set S is first sorted₁,C₂,...,C_mRespectively carrying out anomaly detection to obtain a cluster of anomaly point set O_iFinally, all abnormal point sets O ═ O { O in the data set S are obtained₁,...,O_mThe formula involved in anomaly detection is: intra cluster angle factor

Is formula (7):

where A, B, C are the data points in the data set,

is within the neighborhood of the data point qThe number of data points of (a),

local increment value H (X)_j) Is formula (8):

distance sum of k nearest neighbors L (X)_j) Is of formula (9):

wherein the content of the first and second substances,

enhanced angular anomaly factor EAOF (X)_j) Is formula (10):

2. The method for data stream anomaly detection and multi-validation based on enhanced angle anomaly factor as claimed in claim 1, wherein said processing in step 1) refers to data collected by a data collection terminalBuffering in stream form, and dividing buffered data into E₀,E₁,E₂,... each data block represents a basic window, each sliding window W contains 2 basic windows, and the process of inserting and deleting data is realized by combining the basic window and the sliding window, and the process of combining the basic window and the sliding window is as follows: at T_iTime of day transition to T_i+1At the moment, the sliding window is formed by W_iSlide to W_i+1Accompanied by a new basic window E_i+1Merging and history base window E_i-1While removing T_iTime W_iIncorporation of detected candidate outliers into W_i+1In (3) performing multiple validations.

3. The method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factor as claimed in claim 1, wherein the calculation formula of the angle factor of the r neighborhood point set in step 6) is formula (2):

4. the method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factor as claimed in claim 1, wherein said local density calculation formula of r neighborhood point set in step 6) is formula (3):

5. The enhanced angle anomaly factor based data flow anomaly according to claim 1The method for detection and multiple verification, characterized in that the dissimilarity degree delta (x) in the step 7)_i) The local densities of all data points are sorted in descending order, and the dissimilarity degree delta (x)_i) The calculation formula of (2) is formula (4):

6. The method for enhanced angle anomaly factor based data stream anomaly detection and multi-verification according to claim 1, wherein said home matrix F ═ F in step 9)₁,f₂,...,f_n]The formula is used for recording the attribution relationship between data points, and the expression formula of each element is formula (6):