CN110941542A - Sequence integration high-dimensional data anomaly detection system and method based on elastic network - Google Patents

Sequence integration high-dimensional data anomaly detection system and method based on elastic network Download PDF

Info

Publication number
CN110941542A
CN110941542A CN201911076540.7A CN201911076540A CN110941542A CN 110941542 A CN110941542 A CN 110941542A CN 201911076540 A CN201911076540 A CN 201911076540A CN 110941542 A CN110941542 A CN 110941542A
Authority
CN
China
Prior art keywords
abnormal
data
anomaly
layer
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911076540.7A
Other languages
Chinese (zh)
Other versions
CN110941542B (en
Inventor
陈南
钱偲书
张晶
张露维
宋轶慧
刘文意
陈晨
邵佳炜
李科心
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
State Grid Corp of China SGCC
State Grid Shanghai Electric Power Co Ltd
Original Assignee
Nanjing University of Aeronautics and Astronautics
State Grid Corp of China SGCC
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics, State Grid Corp of China SGCC, State Grid Shanghai Electric Power Co Ltd filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201911076540.7A priority Critical patent/CN110941542B/en
Publication of CN110941542A publication Critical patent/CN110941542A/en
Application granted granted Critical
Publication of CN110941542B publication Critical patent/CN110941542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an integrated high-dimensional data anomaly detection system based on an elastic network, which comprises a single-layer system corresponding to each dimension in high-dimensional data and an assembly integration module connected with the single-layer system of each dimension; the single layer system comprises: a data module; the first input end of the abnormity scoring module is connected with the data module; the input end of the selection module is connected with the first output end of the abnormity scoring module; the input end of the elastic network module is connected with the selection module, and the output end of the elastic network module is connected with the second input end of the abnormity scoring module; the single-layer integrated module is connected with the second output end of the abnormity scoring module; the assembly integrated module is connected with the single-layer integrated module of each dimension. The method solves the problems of large individual prediction error, low detection precision and poor stability of high-dimensional data anomaly detection, realizes small error and high precision of the high-dimensional data individual prediction model, and ensures the stability of anomaly detection.

Description

Sequence integration high-dimensional data anomaly detection system and method based on elastic network
Technical Field
The invention relates to the technical field of high-dimensional data anomaly detection, in particular to a sequence integration high-dimensional data anomaly detection system and method based on an elastic network.
Background
Anomalous data detection typically identifies data objects that do not meet a general data distribution or identify data objects that have significant deviations from the majority of the data objects. The abnormal data detection can provide important reference basis for wide application in a series of fields such as medical diagnosis, fraud detection, information security and the like. Data generated in these application fields are high-dimensional numerical data, such as thousands of molecular or gene expression characteristics in bioinformatics, thousands of data characteristics in transaction fraud, various complex information characteristics in network attacks, and the like.
By high-dimensional data is meant data of higher dimensions, which typically can reach hundreds of thousands or even higher. There are two major difficulties in analyzing and processing high-dimensional numerical data: one is the problem of the unavailability of euclidean distances. In a low-dimensional space, euclidean distance is meaningful and can be used to measure similarity between data, but in a high-dimensional space distance is not significant. The second is dimension disaster problem. As dimensions increase, the computational load increases rapidly, and the complexity and cost of analyzing and processing high dimensional data increases exponentially. Therefore, the following challenges are faced in the detection of anomalous data in high dimensional numerical data:
(1) high-dimensional numerical data typically contains features and noisy data that are not related to outlier data. These extraneous features and noisy data can contribute to anomaly detection in high-dimensional numerical data.
(2) As the dimension of data increases, the related concepts in the low-dimensional space such as the neighborhood, the distance and the nearest neighbor cannot be used, so that the conventional abnormal data detection method based on the distance, the density and the like cannot be used.
(3) The method for extracting the features is used for reducing the dimensions of high-dimensional data, and how to measure the accuracy of the extracted features is a problem.
There are also many methods for abnormal data detection, such as distance-based methods, density-based methods, tree-based methods, etc. However, due to the problems of computational complexity and efficiency of these methods, it takes a large cost to detect abnormal data in high-dimensional data, and the method does not perform particularly well in terms of the abnormal detection effect of the high-dimensional data. Therefore, these methods cannot be applied to anomaly detection of high-dimensional data simply, and it is necessary to process the high-dimensional data and then detect the high-dimensional data by using these methods.
For anomalous data detection of high-dimensional numerical data, the high-dimensional data is typically mapped into a low-dimensional space, thereby retaining information related to the anomalous data for detection of the anomalous data in the low-dimensional space. Later, techniques based on unsupervised representation learning began to emerge, such as subspace feature selection methods, neural networks, and stream learning methods.
The subspace-based feature selection method is to find feature subsets related to abnormal data to reduce the influence of irrelevant features, and then perform conventional abnormal data detection on the feature subsets. This approach typically separates subset selection from anomalous data detection, which may result in features unrelated to the anomalous data being used to perform the detection of the anomalous data. This method may therefore result in a reduced accuracy and a greater deviation in the detection of anomalous data.
Neural network and flow learning based methods focus on preserving the regularity information (e.g., data structure, neighborhood information) of the data, which is then used for learning tasks such as clustering and data compression. Therefore, the information they retain often contains redundant data.
Aiming at the limitations of the above methods and the challenges faced by anomaly detection of high-dimensional numerical data, an anomaly data detection method based on ensemble learning later appears. These methods aim to combine multiple predictive models together to exploit "the power of numerous" to enable detection of anomalous data. Although the ensemble learning-based method can reduce the detection error of the entire prediction model to some extent, it cannot improve the error of each prediction model. Although the CARE method based on reduction of the error of the individual prediction model solves the problem that the individual prediction model has the error, the method has undesirable performance when dealing with the abnormal detection problem of high-dimensional data. The CINFO method based on sequence integration realizes the feature extraction and abnormal data detection of high-dimensional data by constructing an abnormal data detection model of a sequence. However, this method uses a fixed threshold value when selecting abnormal data by using a sequence ensemble learning method, and such a method is suitable for a data set in which the abnormal data proportion and the threshold value correspond to each other. In addition, when the method utilizes Lasso regression (Lasso) to extract the characteristics of the variables or the characteristics, only one of the variables or the characteristics is selected from any variables or characteristics when the variables or the characteristics with multiple collinearity are faced, so that the variables or the characteristics are selected too randomly and the stability cannot be guaranteed.
Disclosure of Invention
The invention aims to provide a sequence integration high-dimensional data anomaly detection system and method based on an elastic network. The system and the method aim to solve the problems of large individual prediction error, low detection precision and poor stability of high-dimensional data anomaly detection, realize small error and high precision of a high-dimensional data individual prediction model and ensure the stability of anomaly detection.
The dimensionality of the high-dimensional data is high, the calculation amount can rise rapidly when the dimensionality is more and more, and in order to simplify the calculation amount, anomaly detection is performed in each dimensionality of the high-dimensional data. In order to achieve the above object, the present invention provides an integrated high-dimensional data anomaly detection system based on elastic network, which includes a single-layer system corresponding to each dimension in the high-dimensional data and an assembly integration module connected to the single-layer system of each dimension;
the single layer system comprises:
the data module is used for receiving single-layer initial data of each dimension in the high-dimensional data;
the first input end of the anomaly scoring module is connected with the data module and used for performing first anomaly scoring on the single-layer initial data to obtain an anomaly score vector in the single-layer initial data;
the input end of the selection module is connected with the first output end of the abnormity scoring module and used for selecting the single-layer initial data according to the abnormity score vector to obtain an abnormity data set;
the input end of the elastic network module is connected with the selection module, the output end of the elastic network module is connected with the second input end of the abnormity scoring module, and the elastic network module is used for extracting the characteristics of the abnormal data set according to the abnormity score vector to generate a characteristic vector and a mean square error;
the anomaly scoring module is further used for performing second anomaly scoring on the feature vectors and the mean square error to obtain abnormal feature vectors with abnormal scores;
the single-layer integration module is connected with the second output end of the abnormity scoring module and is used for performing first integration on the output abnormal characteristic vectors with the mean square error and the fraction abnormity to obtain a single-layer abnormity result;
and the assembly integration module is connected with the single-layer integration module of each single-layer system, and is used for carrying out secondary integration on the single-layer abnormal results output by each single-layer system to obtain final abnormal results.
The invention also provides an integrated high-dimensional data anomaly detection method based on the elastic network, which comprises the following steps:
receiving single-layer initial data of each dimension in the high-dimensional data, and performing first anomaly scoring on the single-layer initial data to obtain an anomaly score vector in the single-layer initial data;
selecting single-layer initial data according to the abnormal score vector to obtain an abnormal data set;
extracting the features of the abnormal data set according to the abnormal score vector to generate a feature vector and a mean square error;
performing second anomaly scoring according to the eigenvector and the mean square error to obtain an abnormal eigenvector with abnormal scores;
comparing the mean square error with a mean square error initial value set by the elastic network module, and outputting the mean square error when the mean square error is greater than the mean square error initial value; when the mean square error is smaller than the mean square error initial value, the single-layer system repeatedly circulates the operation on the single-layer initial data of the dimensionality until the mean square error is larger than the last mean square error, and the mean square error of the time is output;
performing first integration on the output abnormal feature vectors of the mean square error and the fraction abnormality to obtain a single-layer abnormality result of each dimension;
and performing second integration on the single-layer abnormal result of each dimension in the high-dimensional data to obtain a final abnormal result.
Most preferably, the single layer initial data is XiN, and satisfies:
Xi=(x1,x2,…,xM)
wherein M is the number of features in the single-layer initial data; the high-dimensional data is X and satisfies the following conditions:
X={X1,X1,…,XN}
wherein N is the number of dimensions in the high dimensional data.
Most preferably, the first and/or second anomaly scoring is based on a forest isolation approach, which includes sampling, building an isolation tree, calculating a path length, and normalizing the path length.
Most preferably, the selecting of the single-layer initial data according to the abnormal score vector to obtain the abnormal data set comprises the following steps:
calculating an anomaly score vector SiDesired E (S)i) μ and variance D (S)i)=σ2
According to expectation E (S)i) Sum variance D (S)i) Calculating an outlier candidate function; the outlier candidate function is H (S)iα), and satisfies:
H(Si,α)=Si-μ-ασ
wherein α is the threshold value set by the selection module in each dimension, and sigma is the square root of the variance;
according to expectation E (S)i) Sum variance D (S)i) Scoring the anomaly vector S using the Chebyshev inequalityiAnd (4) carrying out selection judgment, judging that the result P (S is more than or equal to mu + α sigma) meets the following conditions:
Figure BDA0002262650200000041
wherein epsilon is any positive number small enough and satisfies the condition of α sigma;
according to the judgment result P (S is more than or equal to mu + α sigma), the abnormal score vector S is calculatediAnd carrying out selective differentiation, generating an abnormal data set C, and meeting the following conditions:
Figure BDA0002262650200000051
most preferably, the feature extraction further comprises the steps of:
vector S of abnormal scoreiAs target characteristics, the abnormal data set C is used as a prediction factor, a sparse regression model is constructed, and a regression coefficient omega is solved; the sparse regression model is ElN (C, λ), and satisfies:
Figure BDA0002262650200000052
wherein, λ is a nonnegative regularization parameter, and K is the number of data in the abnormal data set C; t is the cycle number when the cycle operation is finished;
extracting features which are most relevant to the regression coefficient omega from the abnormal data set C as a feature vector F and a mean square error mse; the feature vector F satisfies:
F={Xii≠0,1<i<K}
wherein ,ωiThe regression coefficient of the ith abnormal data in the abnormal data set C.
Most preferably, the calculation of the mean square error mse further comprises the following steps:
when the mean square error mse is smaller than a preset mean square error initial value mse0The operation is repeatedly executed on the single-layer initial data of the dimension until the mean square error mse of T times of circulationTMean square error mse greater than lastT-1Outputting cyclic T-times mean square error mseT
Eigenvector F and output mean square error msetAnd (T is more than or equal to 1 and less than or equal to T), scoring according to the second abnormity, and acquiring an abnormal characteristic vector Q with abnormal scores through the steps of sampling, establishing an isolation tree, calculating the path length and normalizing the path length.
Most preferably, the first integration comprises the steps of:
mean square error mse for t cyclestAnd summing, wherein T is more than or equal to 1 and less than or equal to T, obtaining the mean square error and the SUM, and satisfying the following conditions:
Figure BDA0002262650200000053
wherein T is the cycle number at the end of the cycle operation;
subtracting the mean squared error mse of t cycles from the SUM of the mean squared error SUM SUMtObtaining an error term MSEtAnd satisfies the following conditions:
MSEt=SUM-mset,1≤t≤T;
for error term MSEtCarrying out normalization operation to obtain weights gamma under different cycle timestAnd satisfies the following conditions:
Figure BDA0002262650200000061
for abnormal feature vector Q of cycle t timestUnitization is carried out to obtain unit abnormal feature vector tautAnd satisfies the following conditions:
Figure BDA0002262650200000062
according to the weight gammatAnd unit anomaly feature vector τtComputing single-layer anomaly results for the ith dimension
Figure BDA0002262650200000063
And satisfies the following conditions:
Figure BDA0002262650200000064
most preferably, the second integration is by averaging the N-dimensional single layer anomaly results; the final abnormal result is Z, and the following conditions are met:
Figure BDA0002262650200000065
by applying the method, the problems of large individual prediction error, low detection precision and poor stability of high-dimensional data anomaly detection are solved, small error and high precision of a high-dimensional data individual prediction model are realized, and the stability of anomaly detection is ensured.
Compared with the prior art, the invention has the following beneficial effects:
1. the system provided by the invention has the advantages that a multi-level sequence ensemble learning Model (MRENSE) based on an elastic network is used for detecting the data abnormality of each dimension, so that the calculated amount is simplified, and the abnormality detection of high-dimensional numerical data is realized.
2. The system extracts the characteristics of the data through the elastic network module, and then performs abnormity scoring on the extracted characteristic vectors, so that the problem of large individual prediction error of high-dimensional data abnormity detection is solved, and small error of an individual prediction model in the high-dimensional data abnormity detection system is realized.
3. The system disclosed by the invention obtains the abnormal characteristic vector by performing abnormal scoring on the data of each dimension twice, solves the problems of low precision and poor stability of high-dimensional data abnormal detection, and ensures the high precision and stability of the high-dimensional data abnormal detection system.
Drawings
FIG. 1 is a schematic structural diagram of an integrated high-dimensional data anomaly detection system according to the present invention;
FIG. 2 is a flowchart of the integrated high-dimensional data anomaly detection method provided by the present invention.
Detailed Description
The invention will be further described by the following specific examples in conjunction with the drawings, which are provided for illustration only and are not intended to limit the scope of the invention.
Example 1
The dimensionality of the high-dimensional data is high, the calculation amount can rise rapidly when the dimensionality is more and more, and in order to simplify the calculation amount, anomaly detection is performed in each dimensionality of the high-dimensional data.
The invention provides an integrated high-dimensional data anomaly detection system based on an elastic network, which comprises a single-layer system 1 corresponding to each dimension in high-dimensional data and an assembly integration module 2 connected with the single-layer system of each dimension, as shown in figure 1.
The single-layer system 1 comprises a data module 3, an abnormity scoring module 4, a selection module 5, an elastic network module 6 and a single-layer integration module 7; the data module 3 is used for receiving single-layer initial data of each dimension in the high-dimensional data; the first input end of the anomaly scoring module 4 is connected with the data module 3 and is used for carrying out first anomaly scoring on the single-layer initial data to obtain an anomaly score vector in the single-layer initial data; the input end of the selection module 5 is connected with the first output end of the anomaly scoring module 4 and is used for selecting single-layer initial data according to the anomaly score vector to obtain an anomaly data set; the input end of the elastic network module 6 is connected with the selection module 5, the output end of the elastic network module is connected with the second input end of the anomaly scoring module 4, and the elastic network module is used for extracting the features of the anomaly data set according to the anomaly score vector to generate a feature vector and a mean square error; the anomaly scoring module 4 is further used for performing second anomaly scoring on the feature vectors and the mean square error to obtain abnormal feature vectors with abnormal scores; the single-layer integration module 7 is connected with the second output end of the anomaly scoring module 4, and is used for performing first integration on the output abnormal feature vectors with the mean square error and the fraction anomaly to obtain a single-layer anomaly result.
The assembly integration module 2 is connected with the single-layer integration module 7 of each single-layer system, and is used for carrying out secondary integration on single-layer abnormal results output by each single-layer system to obtain final abnormal results.
Example 2
Based on the same inventive concept, the invention also provides an integrated high-dimensional data anomaly detection method based on the elastic network, as shown in fig. 2, the method comprises the following steps:
receiving single-layer initial data of each dimension in high-dimensional data X, wherein the single-layer initial data is XiN, and satisfies:
Xi=(x1,x2,…,xM)
wherein M is the number of features in the single-layer initial data; the high-dimensional data is X and satisfies the following conditions:
X={X1,X1,…,XN}
wherein N is the dimension number in the high-dimensional data; and for single layer initial data XiTransmitting to an abnormity scoring module for carrying out first abnormity scoring to obtainObtaining single-layer initial data XiAbnormal score vector S in (1)i(ii) a The first abnormal scoring is based on a forest isolation mode, and the forest isolation mode comprises the steps of sampling, establishing an isolation tree, calculating the path length and normalizing the path length. Based on the anomaly score vector SiFor the single layer initial data XiSelecting to obtain an abnormal data set C; selecting single-layer initial data according to the abnormal score vector to obtain an abnormal data set C, and the method comprises the following steps:
calculating an anomaly score vector SiDesired E (S)i) μ and variance D (S)i)=σ2
Based on the anomaly score vector SiDesired E (S)i) Sum variance D (S)i) Calculating an outlier candidate function; the outlier candidate function is H (S)iα), and satisfies:
H(Si,α)=Si-μ-ασ
α is a threshold value set by a selection module in each dimension, α values take different values in each dimension and can be specified by a user, and sigma is the square root of the variance;
the outlier data set C is data that differs from the distribution of the majority of the high dimensional data or is significantly biased from the majority of the high dimensional data objects, and is only a small portion of the entire data set, therefore, we control the number of elements K in the outlier data set C by setting the selection module 5 threshold α.
In each dimension, the values of the selection module 5 threshold α are different, so that the number K of abnormal data sets C in each dimension is different, and the single-layer abnormal results in each dimension are integrated for the first time, so that the final abnormal results are more reliable.
Based on the anomaly score vector SiDesired E (S)i) Sum variance D (S)i) Scoring the anomaly vector S using the Chebyshev inequalityiAnd carrying out selection judgment, wherein the judgment result is P (S is more than or equal to mu + α sigma), and the following conditions are met:
Figure BDA0002262650200000081
wherein epsilon is any sufficiently small positive number, and epsilon is α sigma.
According to the judgment result P (S is more than or equal to mu + α sigma), the abnormal score vector S is calculatediAnd carrying out selective differentiation, generating an abnormal data set C, and meeting the following conditions:
Figure BDA0002262650200000082
based on the anomaly score vector SiExtracting the characteristics of the abnormal data set C to generate a characteristic vector F and a mean square error mse; the feature extraction further comprises the following steps:
vector S of abnormal scoreiAs target characteristics, the abnormal data set C is used as a prediction factor, a sparse regression model is constructed, and a regression coefficient omega is solved; the sparse regression model is ElN (C, λ), and satisfies:
Figure BDA0002262650200000091
wherein, λ is a nonnegative regularization parameter, and K is the number of data in the abnormal data set C; t is the number of cycles at the end of the cycling operation.
With the gradual increase of the regularization parameter lambda, the number of nonzero coefficients in the regression coefficient omega is gradually reduced, so that sparse regression on high-dimensional data is completed.
The regularization parameter λ is selected in the elastic network module 6, and an inappropriate regularization parameter λ may cause over-fitting or under-fitting. And selecting an optimal regularization parameter lambda on the abnormal data set C in a mode of 10 times of cross validation, so that the mean square error mse is minimum.
Extracting features which are most relevant to the regression coefficient omega from the abnormal data set C as a feature vector F and a mean square error mse; the feature vector F satisfies:
F={Xii≠0,1<i<K}
wherein ,ωiThe regression coefficient of the ith abnormal data in the abnormal data set C.
Transmitting the feature vector F and the mean square error mse back to the anomaly scoring module for secondary anomaly scoring to obtain an anomaly feature vector Q with abnormal scores; and the feature vector F is scored according to the second abnormity, and an abnormal feature vector Q with abnormal scores is obtained through the steps of sampling, establishing an isolation tree, calculating the path length and normalizing the path length.
Through the feature extraction of the elastic network module 6, the dimension of the high-dimensional data is reduced to a certain degree, and the second abnormal scoring of the forest isolation method is easier than the first abnormal scoring.
The calculation of the mean square error mse further comprises the following steps: when the mean square error mse is smaller than a preset mean square error initial value mse0The operation is repeatedly executed on the single-layer initial data of the dimension until the mean square error mse of T times of circulationTMean square error mse greater than lastT-1Outputting cyclic T-times mean square error mseT
Mean square error mse to t cycles of outputt(T is more than or equal to 1 and less than or equal to T) and abnormal feature vectors Q of fractional abnormality are integrated for the first time to obtain a single-layer abnormality result of each dimension
Figure BDA0002262650200000092
The first integration comprises the following steps:
mean square error mse for t cyclestAnd (T is more than or equal to 1 and less than or equal to T) summing to obtain the mean square error and the SUM, and satisfying the following conditions:
Figure BDA0002262650200000101
wherein T is the cycle number at the end of the cycle operation;
subtracting the mean squared error mse of t cycles from the SUM of the mean squared error SUM SUMtObtaining an error term MSEtAnd satisfies the following conditions:
MSEt=SUM-mset,1≤t≤T;
for error term MSEtCarrying out normalization operation to obtain weights gamma under different cycle timestAnd satisfies the following conditions:
Figure BDA0002262650200000102
for abnormal feature vector Q of cycle t timestUnitization is carried out to obtain unit abnormal feature vector tautAnd satisfies the following conditions:
Figure BDA0002262650200000103
according to the weight gammatAnd unit anomaly feature vector τtComputing single-layer anomaly results for the ith dimension
Figure BDA0002262650200000104
And satisfies the following conditions:
Figure BDA0002262650200000105
single-layer abnormal result for each dimension in high-dimensional data X
Figure BDA0002262650200000106
Transmitting the data to an assembly integration module for second integration to obtain a final abnormal result; the second integration is carried out by averaging single-layer abnormal results of N dimensionalities; the final abnormal result is Z, and the following conditions are met:
Figure BDA0002262650200000107
the working principle of the invention is as follows:
receiving single-layer initial data of each dimension in the high-dimensional data, and performing first anomaly scoring on the single-layer initial data to obtain an anomaly score vector in the single-layer initial data; selecting single-layer initial data according to the abnormal score vector to obtain an abnormal data set; extracting the features of the abnormal data set according to the abnormal score vector to generate a feature vector and a mean square error; performing second anomaly scoring on the eigenvectors and the mean square error to obtain abnormal eigenvectors with abnormal scores; performing first integration on the output abnormal feature vectors of the mean square error and the fraction abnormality to obtain a single-layer abnormality result; and performing second integration on the single-layer abnormal result of each dimension in the high-dimensional data to obtain a final abnormal result.
In conclusion, the method and the device solve the problems of large individual prediction error, low detection precision and poor stability of high-dimensional data anomaly detection, realize small error and high precision of the high-dimensional data individual prediction model, and ensure the stability of anomaly detection.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (10)

1. An integrated high-dimensional data anomaly detection system based on an elastic network is characterized by comprising a single-layer system corresponding to each dimension in high-dimensional data and an assembly integration module connected with the single-layer system of each dimension;
the single layer system comprises:
the data module is used for receiving single-layer initial data of each dimension in the high-dimensional data;
the first input end of the anomaly scoring module is connected with the data module and used for performing first anomaly scoring on the single-layer initial data to obtain an anomaly score vector in the single-layer initial data;
the input end of the selection module is connected with the first output end of the abnormity scoring module and used for selecting the single-layer initial data according to the abnormity score vector to obtain an abnormity data set;
the input end of the elastic network module is connected with the selection module, the output end of the elastic network module is connected with the second input end of the abnormity scoring module, and the elastic network module is used for performing characteristic extraction on the abnormal data set according to the abnormity score vector to generate a characteristic vector and a mean square error;
the anomaly scoring module is further used for performing second anomaly scoring on the feature vectors and the mean square error to obtain abnormal feature vectors with abnormal scores;
the single-layer integration module is connected with the second output end of the abnormity scoring module and is used for performing first integration on the output mean square error and the abnormal characteristic vector to obtain a single-layer abnormity result;
and the assembly integration module is connected with the single-layer integration module of each single-layer system, and is used for carrying out secondary integration on the single-layer abnormal results output by each single-layer system to obtain final abnormal results.
2. An integrated high-dimensional data anomaly detection method based on an elastic network is characterized by comprising the following steps:
receiving single-layer initial data of each dimension in high-dimensional data, and performing first anomaly scoring on the single-layer initial data to obtain an anomaly score vector in the single-layer initial data;
selecting the single-layer initial data according to the abnormal score vector to obtain an abnormal data set;
extracting the features of the abnormal data set according to the abnormal score vector to generate a feature vector and a mean square error;
performing second anomaly scoring according to the eigenvector and the mean square error to obtain an abnormal eigenvector with abnormal scores;
performing first integration on the output mean square error and the abnormal feature vector to obtain a single-layer abnormal result of each dimension;
and performing second integration on the single-layer abnormal result of each dimension to obtain a final abnormal result.
3. The method for integrated anomaly detection of high-dimensional data based on elastic network as claimed in claim 2, wherein said single layer of initial data is XiN, and satisfies:
Xi=(x1,x2,…,xM)
wherein M is the number of features in the single-layer initial data; the high-dimensional data is X and satisfies the following conditions:
X={X1,X1,…,XN}
wherein N is the number of dimensions in the high dimensional data.
4. An integrated elastic network-based high-dimensional data anomaly detection method according to claim 2, characterized in that said first scoring and/or said second scoring of anomalies is based on an isolated forest approach comprising: sampling, establishing an isolation tree, calculating the path length and normalizing the path length.
5. The method for detecting the anomaly of the integrated high-dimensional data based on the elastic network as claimed in claim 2, wherein the step of selecting the single-layer initial data according to the anomaly score vector to obtain an anomaly data set comprises the following steps:
calculating the abnormality score vector SiDesired E (S)i) μ and variance D (S)i)=σ2
According to the expectation E (S)i) Sum variance D (S)i) Calculating an outlier candidate function; the outlier candidate function is H (S)iα), and satisfies:
H(Si,α)=Si-μ-ασ
wherein α is the threshold value set by the selection module in each layer, and sigma is the square root of the variance;
according to the expectation E (S)i) Sum variance D (S)i) Using Chebyshev inequality to score the abnormal vector SiAnd (4) carrying out selection judgment, judging that the result P (S is more than or equal to mu + α sigma) meets the following conditions:
Figure FDA0002262650190000021
wherein epsilon is any positive number small enough and satisfies the condition of α sigma;
according to the judgment result P (S is more than or equal to mu + α sigma), the abnormal score vector S is subjected toiAnd carrying out selective differentiation, generating an abnormal data set C, and meeting the following conditions:
Figure FDA0002262650190000022
6. the method for detecting the anomaly of the integrated high-dimensional data based on the elastic network as claimed in claim 2, wherein the said method for extracting the features of the said anomaly data set according to the said anomaly score vector to generate feature vector and mean square error comprises the following steps:
vector S of abnormal scoreiAs target characteristics, the abnormal data set C is used as a prediction factor, a sparse regression model is constructed, and a regression coefficient omega is solved; the sparse regression model is ElN (C, λ), and satisfies:
Figure FDA0002262650190000031
wherein λ is a nonnegative regularization parameter, and K is the number of data in the abnormal data set C; t is the cycle number when the cycle operation is finished;
extracting features which are most relevant to the regression coefficient omega from the abnormal data set C, wherein the features are a feature vector F and the mean square error mse; the feature vector F satisfies:
F={Xii≠0,1<i<K}
wherein ,ωiAnd the regression coefficient is the regression coefficient of the ith abnormal data in the abnormal data set C.
7. The method for integrated high-dimensional data anomaly detection based on elastic network according to claim 2, characterized in that said calculation of mean square error mse further comprises the following steps:
when the mean square error mse is smaller than a preset mean square errorInitial value mse0Repeatedly performing the above operations on the single-layer initial data of the dimension until the mean square error mse circulating for T timesTThe mean square error mse greater than the last timeT-1Outputting said mean square error mse for T cyclesT
8. The method of claim 2, wherein the eigenvector F and the outputted mean square error msetAnd (T is more than or equal to 1 and less than or equal to T), scoring according to the second abnormity, and acquiring an abnormal characteristic vector Q with abnormal scores through the steps of sampling, establishing an isolation tree, calculating the path length and normalizing the path length.
9. The method for integrated high-dimensional data anomaly detection based on elastic network according to claim 2, characterized in that said first integration comprises the following steps:
mean square error mse for t cyclestAnd summing, wherein T is more than or equal to 1 and less than or equal to T, obtaining the mean square error and the SUM, and satisfying the following conditions:
Figure FDA0002262650190000032
wherein T is the cycle number at the end of the cycle operation;
subtracting the SUM of the mean square error mse of t cycles from the SUM of the mean square error SUM SUMtObtaining an error term MSEtAnd satisfies the following conditions:
MSEt=SUM-mset,1≤t≤T;
for the error term MSEtCarrying out normalization operation to obtain weights gamma under different cycle timestAnd satisfies the following conditions:
Figure FDA0002262650190000041
for the abnormal feature vector Q of t times of circulationtUnitization is carried out to obtain unit abnormal feature vector tautAnd satisfies the following conditions:
Figure FDA0002262650190000042
according to the weight gammatAnd the unit anomaly feature vector tautComputing the single-layer anomaly result for the ith dimension
Figure FDA0002262650190000045
And satisfies the following conditions:
Figure FDA0002262650190000043
10. the elastic network-based sequence integration high-dimensional data anomaly detection method according to claim 2, wherein the second integration is performed by averaging the single-layer anomaly results in N dimensions; the final abnormal result is Z and satisfies the following conditions:
Figure FDA0002262650190000044
CN201911076540.7A 2019-11-06 2019-11-06 Sequence integration high-dimensional data anomaly detection system and method based on elastic network Active CN110941542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911076540.7A CN110941542B (en) 2019-11-06 2019-11-06 Sequence integration high-dimensional data anomaly detection system and method based on elastic network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911076540.7A CN110941542B (en) 2019-11-06 2019-11-06 Sequence integration high-dimensional data anomaly detection system and method based on elastic network

Publications (2)

Publication Number Publication Date
CN110941542A true CN110941542A (en) 2020-03-31
CN110941542B CN110941542B (en) 2023-08-25

Family

ID=69906630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911076540.7A Active CN110941542B (en) 2019-11-06 2019-11-06 Sequence integration high-dimensional data anomaly detection system and method based on elastic network

Country Status (1)

Country Link
CN (1) CN110941542B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112014785A (en) * 2020-08-06 2020-12-01 三峡大学 Error compensation method for air-core coil current transformer based on elastic network
WO2022151843A1 (en) * 2021-01-13 2022-07-21 徐培亮 Method for calculating speed and acceleration on basis of regularization algorithm, and measurement device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2921054A1 (en) * 2015-04-10 2016-10-10 Pankaj Malhotra Anomaly detection system and method
US20180103052A1 (en) * 2016-10-11 2018-04-12 Battelle Memorial Institute System and methods for automated detection, reasoning and recommendations for resilient cyber systems
CN108304851A (en) * 2017-01-13 2018-07-20 重庆邮电大学 A kind of High Dimensional Data Streams Identifying Outliers method
CN109858509A (en) * 2018-11-05 2019-06-07 杭州电子科技大学 Based on multilayer stochastic neural net single classifier method for detecting abnormality

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2921054A1 (en) * 2015-04-10 2016-10-10 Pankaj Malhotra Anomaly detection system and method
US20180103052A1 (en) * 2016-10-11 2018-04-12 Battelle Memorial Institute System and methods for automated detection, reasoning and recommendations for resilient cyber systems
CN108304851A (en) * 2017-01-13 2018-07-20 重庆邮电大学 A kind of High Dimensional Data Streams Identifying Outliers method
CN109858509A (en) * 2018-11-05 2019-06-07 杭州电子科技大学 Based on multilayer stochastic neural net single classifier method for detecting abnormality

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余立苹;李云飞;朱世行;: "基于高维数据流的异常检测算法" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112014785A (en) * 2020-08-06 2020-12-01 三峡大学 Error compensation method for air-core coil current transformer based on elastic network
CN112014785B (en) * 2020-08-06 2023-07-11 三峡大学 Error compensation method for air core coil current transformer based on elastic network
WO2022151843A1 (en) * 2021-01-13 2022-07-21 徐培亮 Method for calculating speed and acceleration on basis of regularization algorithm, and measurement device

Also Published As

Publication number Publication date
CN110941542B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
US10956779B2 (en) Multi-distance clustering
CN108709745B (en) Rapid bearing fault identification method based on enhanced LPP algorithm and extreme learning machine
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
CN110287983B (en) Single-classifier anomaly detection method based on maximum correlation entropy deep neural network
Cateni et al. A hybrid feature selection method for classification purposes
Park et al. Data compression and prediction using machine learning for industrial IoT
CN111476100B (en) Data processing method, device and storage medium based on principal component analysis
US20050114382A1 (en) Method and system for data segmentation
US11688403B2 (en) Authentication method and apparatus with transformation model
Nguyen et al. Asymmetric mixture model with simultaneous feature selection and model detection
Nurhopipah et al. Dataset splitting techniques comparison for face classification on CCTV images
Ammu et al. Review on feature selection techniques of DNA microarray data
CN110602120A (en) Network-oriented intrusion data detection method
CN110941542B (en) Sequence integration high-dimensional data anomaly detection system and method based on elastic network
Shi et al. Dynamic barycenter averaging kernel in RBF networks for time series classification
Sivasankar et al. Feature reduction in clinical data classification using augmented genetic algorithm
Do et al. Multiple Metric Learning for large margin kNN Classification of time series
CN114003900A (en) Network intrusion detection method, device and system for secondary system of transformer substation
Saez et al. KSUFS: A novel unsupervised feature selection method based on statistical tests for standard and big data problems
Liu et al. A weight-incorporated similarity-based clustering ensemble method
CN112287036A (en) Outlier detection method based on spectral clustering
WO2023250322A1 (en) Image embeddings via deep learning and adaptive batch normalization
Gogebakan et al. Mixture model clustering using variable data segmentation and model selection: a case study of genetic algorithm
Hsu et al. Linear dynamics: Clustering without identification
Lv et al. Determination of the number of principal directions in a biologically plausible PCA model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant