CN109462521B - Network flow abnormity detection method suitable for source network load interaction industrial control system - Google Patents

Network flow abnormity detection method suitable for source network load interaction industrial control system Download PDF

Info

Publication number
CN109462521B
CN109462521B CN201811415563.1A CN201811415563A CN109462521B CN 109462521 B CN109462521 B CN 109462521B CN 201811415563 A CN201811415563 A CN 201811415563A CN 109462521 B CN109462521 B CN 109462521B
Authority
CN
China
Prior art keywords
data
module
flow
training
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811415563.1A
Other languages
Chinese (zh)
Other versions
CN109462521A (en
Inventor
吴克河
张晓良
何辉
张明
朱红勤
余刚刚
吴屹浩
杨东锴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
North China Electric Power University
Nanjing Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
North China Electric Power University
Nanjing Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd, North China Electric Power University, Nanjing Power Supply Co of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN201811415563.1A priority Critical patent/CN109462521B/en
Publication of CN109462521A publication Critical patent/CN109462521A/en
Application granted granted Critical
Publication of CN109462521B publication Critical patent/CN109462521B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/022Capturing of monitoring data by sampling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • H04L43/0894Packet rate
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Algebra (AREA)
  • Environmental & Geological Engineering (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a network flow abnormity detection method suitable for a source network load interaction industrial control system, which adopts a two-layer classification mechanism, namely, an OCSVM model is firstly used for carrying out first classification, the classifier can detect most normal flows, abnormal flows are detected as far as possible by adjusting the model, then data (possibly including part of normal flows) which are judged to be abnormal by the OCSVM model are secondly classified by a GBDT algorithm, the second classification is used for detecting the normal flows which are falsely detected in the first classification, and the part of flows are added into a sample for retraining, so that the detection accuracy is improved. The invention has the advantages of ensuring the flow detection accuracy, having higher detection efficiency and meeting the flow detection requirement of the industrial control system with source network load interaction.

Description

Network flow abnormity detection method suitable for source network load interaction industrial control system
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a network flow abnormity detection method suitable for a source network load interaction industrial control system.
Background
With the construction of global energy Internet, the rapid development of extra-high voltage power grids and distributed energy, novel loads with dual characteristics of source and load, such as electric vehicles and controllable users, continuously emerge, the time-space distribution characteristic of power grid tide is gradually complex, and the importance and the urgency of interaction and cooperative control between the power grid and the users are continuously improved.
Under the background of source network load interaction, industrial control systems are widely distributed in power supply companies, power plants and transformer substations and continuously extend to a new energy power generation side and a user side, safety control has the characteristics of multiple levels, multiple types, frequent interaction of monitoring control information and the like, risks of eavesdropping, tampering, interruption and the like exist in the processes of acquisition, transmission and execution of various operation information and control instructions, and the difficulty of system safety precaution is increased due to access of a large number of new energy power generation equipment and user equipment which are distributed dispersedly. How to monitor the network flow of the source network load system in real time and discover the network abnormality in time has important significance on the stability and safety of the system.
At present, the detection method of abnormal flow mainly comprises: and training the flow data with the marks to obtain a classifier for distinguishing normal flow data and abnormal flow data, and detecting abnormal flow by using the classifier.
The method uses specific historical flow data for training, and once the historical data is out of date, huge errors can occur in the judgment of the real-time network. In practical application, the detection accuracy is low. Meanwhile, the accuracy and efficiency of detection are difficult to be considered, and the method cannot be directly used for detecting the network flow abnormity of the industrial control system with source network load interaction.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a network flow abnormity detection method which adopts a self-learning double-layer detection model, enables the detection model to be self-updated through self-learning, adapts to the change of the environment, and improves the detection accuracy and the detection efficiency and is suitable for a source network load interaction industrial control system.
The technical scheme is as follows: in order to solve the technical problem, the invention provides a network flow abnormity detection method suitable for a source network load interaction industrial control system, which comprises the following steps:
(1) the method comprises the steps that flow data in a source network load interaction industrial control system are collected in real time through a data collection module, data characteristics in the flow data are counted, and the characteristic data of the flow are input into a data processing module to be processed;
(2) the data processing module processes off-line sample data or on-line test data and applies the processed data to the first classification module;
(3) forming new sample data by the sample data 1 after training and the flow data in the self-learning module obtained in the step (6), preprocessing the data by a data preprocessing module, and applying the data processed by the data preprocessing module to a first training module;
(4) taking the data processed in the step (2) as input, training the data through the first training module obtained in the step (3), entering the trained data into a first classification module, detecting whether the flow is normal through the first classification module, if so, outputting the flow normally, and if not, entering the step (6);
(5) the data processing module is used for processing the data of the sample data 2 after training processing, and the data processed by the data processing module is applied to a second training module;
(6) and (3) training data in the second training module, entering the trained data and the abnormal flow data obtained in the step (4) into a second classification module, detecting whether the flow is normal or not through the second classification module, adding the data into the self-learning module and entering the step (3) if the flow is normal, and outputting abnormal flow and giving an alarm if the flow is abnormal.
Further, when the data processed by the data preprocessing module in the step (3) is applied to the first training module, dimension reduction processing is required, which specifically includes the following steps:
(3.1) according to a formula for the original d-dimensional sample data set
Figure RE-GDA0001961035730000021
Decentralized processing, wherein the sample data is: { (X)(1),y(1)),(X(2),y(2)),…,(X(n),y(n))};
(3.2) constructing a covariance matrix of the sample; wherein the covariance formula is
Figure RE-GDA0001961035730000022
(3.3) calculating the eigenvalue of the covariance matrix and the corresponding eigenvector; the eigenvector of the covariance matrix represents the principal component, and the importance of the eigenvector is determined according to the magnitude of the eigenvalue;
(3.4) selecting k eigenvectors corresponding to the first k eigenvalues;
(3.5) constructing a mapping matrix W through the k eigenvectors;
(3.6) dimensionality reduction of the d-dimensional data to a k-dimensional vector Z by the mapping matrix W: z(i)=UTX(i)
Further, the specific steps of detecting whether the flow rate is normal through the first classification module in the step (4) are as follows:
(4.1) selecting an adjustable parameter v and a kernel function
Figure RE-GDA0001961035730000023
(4.2) training by sample data: solving for
Figure RE-GDA0001961035730000024
Figure RE-GDA0001961035730000025
Figure RE-GDA0001961035730000031
Figure RE-GDA0001961035730000032
Choose to satisfy arbitrarily
Figure RE-GDA0001961035730000033
Alpha of (A)*Calculating
Figure RE-GDA0001961035730000034
Wherein is satisfied with
Figure RE-GDA0001961035730000035
Alpha of (A)*Namely the support vector;
(4.3) obtaining a decision function f (x): integrating decision functions
Figure RE-GDA0001961035730000036
If f (x) is greater than 0, the data is proved to be normal data, and f (x) is less than 0, the data is proved to be abnormal data, wherein NsvIs the number of support vectors.
Further, the specific step of detecting whether the flow rate is normal through the second classification module in the step (6) is as follows:
(6.1) begin generating GBDT classifier
(6.2) initializing the loss function
Figure RE-GDA0001961035730000037
Loss function L (y, F) log (1+ e)-2yF),y∈{-1,1}
(6.3) initializing the classification model from said loss function, having
Figure RE-GDA0001961035730000038
(6.4) calculating the value of the negative gradient of the loss function in the current model, and taking the value as the estimation of the residual error;
Figure RE-GDA0001961035730000039
(6.5) in the residual
Figure RE-GDA00019610357300000310
The CART tree is constructed, each training sample is finally divided into corresponding leaf nodes, and the predicted values of the leaf nodes are as follows:
Figure RE-GDA00019610357300000311
update Fm(x)=Fm-1(x)+γjmObtaining a stronger classification model;
(6.6) judging whether an ending condition is met; namely the gradient reaches the minimum value or the iteration number reaches the set value;
(6.7) generating a GBDT classification model F (x); a threshold value theta is set, and when F (x) < theta, the flow rate is a normal flow rate, and when F (x) > theta, the flow rate is an abnormal flow rate.
Further, the specific steps for constructing the CART tree in the step (6.5) are as follows:
(6.5.1) starting to generate the CART tree;
(6.5.2) selecting a feature, and dividing all samples into left and right subtrees according to the feature value;
(6.5.3) calculating the variance of the left subtree and the right subtree respectively; is provided with
Figure RE-GDA0001961035730000041
Is the average of the labels of the nodes in the left sub-tree,
Figure RE-GDA0001961035730000042
the average value of the node labels in the right subtree; the variance is calculated as
Figure RE-GDA0001961035730000043
(6.5.4) selecting the feature point which meets the minimum sum of the variances of the left subtree and the right subtree to perform primary division;
(6.5.5) dividing the above steps down in sequence;
(6.5.6) determining whether an end condition is satisfied; and if the ending condition is met, outputting the generated CART tree and ending, and if the ending condition is not met, returning to the step (6.5.4).
Further, the ending condition in the step (6.5.6) is: the nodes are pure nodes, namely the target variable values of all records are the same; the depth of the tree reaches a pre-specified maximum value; the maximum drop-off value of the degree of clutter is less than a pre-specified value; the record quantity of the nodes is less than the pre-specified minimum node record quantity; all records in a node have the same predictor variable value.
Further, the self-learning module in the step (3) self-learns the following steps:
and (3.1) initializing a self-learning module, and setting the capacity of the sample and the condition of training triggering. The trigger condition may be set to a specific time or a specific state, such as a timed trigger;
(3.2) monitoring whether the flow is detected by mistake;
and (3.3) if the flow is detected by mistake, forming a new training sample by the new training sample and the original training sample. And if the training sample does not reach the set sample capacity size, directly adding the training sample. Otherwise, replacing the earliest training sample;
and (3.4) judging whether the trigger condition is met. If the training triggering condition is not met, returning to the step (3.2);
and (3.5) if the triggering condition is met, retraining the classification model.
Compared with the prior art, the invention has the advantages that:
the invention adopts a two-layer classification mechanism, namely, firstly, an OCSVM model is used for carrying out first classification, the classifier can detect most of normal flow, abnormal flow is detected as much as possible by adjusting the model, then, data (possibly comprising part of normal flow) which is judged to be abnormal by the OCSVM model is subjected to second classification by a GBDT algorithm, the second classification is used for detecting the normal flow which is detected by mistake in the first classification, and the part of flow is added into a sample for retraining, so that the detection accuracy is improved. The method has the advantages that under the condition of ensuring the flow detection accuracy, the detection efficiency is high, and the flow detection requirement of the industrial control system with source network load interaction is met. The method can be used for detecting the network flow abnormity of the industrial control system with source network load interaction and other industrial control systems.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a flow chart of the application of the data processed by the data preprocessing module to the first training module in FIG. 1;
FIG. 3 is a flowchart illustrating the first classification module in FIG. 1 detecting whether the flow rate is normal;
FIG. 4 is a flowchart illustrating the second classification module of FIG. 1 detecting whether the flow rate is normal;
FIG. 5 is a flow diagram of constructing the CART tree of FIG. 4;
fig. 6 is a flow chart of the self-learning in fig. 1.
Detailed Description
The invention is further elucidated with reference to the drawings and the detailed description.
As shown in fig. 1 to 6, the present invention provides a method for detecting network traffic anomaly suitable for a source network load industrial control system, which includes the following steps:
(1) the method comprises the steps that flow data in a source network load interaction industrial control system are collected in real time through a data collection module, data characteristics in the flow data are counted, and the characteristic data of the flow are input into a data processing module to be processed;
(2) the data processing module processes off-line sample data or on-line test data and applies the processed data to the first classification module;
(3) forming new sample data by the sample data 1 after training and the flow data in the self-learning module obtained in the step (6), preprocessing the data by a data preprocessing module, and applying the data processed by the data preprocessing module to a first training module;
(4) taking the data processed in the step (2) as input, training the data through the first training module obtained in the step (3), entering the trained data into a first classification module, detecting whether the flow is normal through the first classification module, if so, outputting the flow normally, and if not, entering the step (6);
(5) the data processing module is used for processing the data of the sample data 2 after training processing, and the data processed by the data processing module is applied to a second training module;
(6) and (3) training data in the second training module, entering the trained data and the abnormal flow data obtained in the step (4) into a second classification module, detecting whether the flow is normal or not through the second classification module, adding the data into the self-learning module and entering the step (3) if the flow is normal, and outputting abnormal flow and giving an alarm if the flow is abnormal.
The self-learning training system is composed of a data acquisition module, a data processing module, a first training module, a second training module, a first classification module, a second classification module and a self-learning module.
The data acquisition module acquires flow data in the industrial control system with source network load interaction in real time, counts data characteristics in the flow data, and inputs the characteristic data of the flow into the data processing module for processing.
When data are collected, the invention adopts a mechanism of sliding time window to carry out statistics on data characteristics. And setting the size of a time window as W, and the sliding step length as L (L < W), wherein W and L represent the number of data packets. In the collecting process, the earliest L data packets in the W are replaced by the new L data packets, statistics is carried out, and finally obtained characteristic values serve as a characteristic vector.
The data processing module processes off-line sample data or on-line test data, and the preprocessed data can be applied to the next module.
And the first training module obtains the classification model of the first classification module through the sample data 1 after training processing.
And the second training module obtains a classification model of the second classification module through the sample data 2 after training processing.
The first classification module takes the data processed by the data processing module as input and performs primary classification on the flow corresponding to the data. In the first classification, the problem of recognition efficiency is mainly solved, and the requirement on accuracy is low.
The flow judged to be abnormal by the first classification module is more accurately judged by the second classification module. The second classification module mainly solves the problem of judgment accuracy and has low requirement on detection efficiency. And if the judgment result of the second classification module is normal, the first classifier is judged to have false detection, and the partial flow data is input into the self-learning module.
And the self-learning module combines the flow data and the original sample data 1 into new sample data, and retrains the new sample data through the first training module under the preset triggering condition, so that the accuracy of the first type of module is improved. The predetermined trigger condition may be set to a fixed time or the like.
The data acquisition module acquires flow data and extracts characteristic data of the flow. The industrial control system with source network load interaction is mainly used for monitoring and controlling a source network load system, a specific acquisition and control protocol is adopted for data transmission, such as a 104 protocol, a 61850 protocol and the like, and besides conventional network attack behaviors, such as abnormal burst flow, port scanning, vulnerability or service scanning, the source network load also faces special attack threats, namely flow with attack load conforming to a source network load system protocol standard format.
Normal traffic is very different from abnormal traffic, which typically causes changes in the source IP address, destination IP address, source port, destination port, and protocol type distribution. In a source network load industrial control system, the flow with an attack load conforming to the standard format of a source network load system protocol generally causes the change of type identification, application service type and message distribution of different formats.
In order to realize the abnormal detection of the network flow of the source network load system and detect different attacks, the data acquisition module acquires the flow data and comprises the following parts: source address, destination address, source port, destination port, transport layer protocol type, application layer protocol type, TCP flag ACK, SYN fields. In order to detect a special attack threat of the source network load system, it is further required to collect feature data of an application layer of the source network load industrial control system, taking a 104 protocol as an example, including: type identification (including monitoring, control and file transmission type identification), message length (according to the length, the message can be divided into different levels), message transmission reason, application service type, frame type (including S frame, U frame and I frame)
After the data are collected, statistics is carried out through a sliding time window, and the obtained characteristic attributes comprise the following parts: the entropy value of the source IP address, the entropy value of the destination IP address, the entropy value of the source port, the entropy value of the destination port, the proportion of each transmission layer protocol, the proportion of each application layer protocol, the ratio of SYN number to ACK number, and the ratio of the packet sending number and the packet receiving number of the destination port in the window W. The characteristic attributes suitable for the source network load industrial control system further comprise: entropy of type identification, proportion of each length message, proportion of each transmission reason, entropy of application service type, and proportion of each format frame.
In the first training module training stage, the data processing module performs data standardization and data feature selection.
And (3) selecting the data characteristics, and performing dimensionality reduction on the characteristic data by adopting a Principal Component Analysis (PCA). The sample data is set as follows: { (X)(1),y(1)),(X(2),y(2)),…,(X(n),y(n))}. The method for reducing the dimension by using the principal component analysis method comprises the following steps:
step 201: according to a formula for an original d-dimensional sample data set
Figure RE-GDA0001961035730000071
And (5) performing decentralized processing.
Step 202: a covariance matrix of the samples is constructed. The covariance formula is
Figure RE-GDA0001961035730000072
Step 203: eigenvalues of the covariance matrix and corresponding eigenvectors are computed. The eigenvectors of the covariance matrix represent principal components, and the importance of the eigenvectors is determined according to the magnitude of the eigenvalues.
Step 204: and selecting k eigenvectors corresponding to the first k eigenvalues.
Step 205: and constructing a mapping matrix W by the k eigenvectors.
Step 206: dimensionality reduction of d-dimensional data to a k-dimensional vector Z by a mapping matrix W: z(i)=UTX(i)
The single-class support vector machine (OCSVM) can train an anomaly detection model by only one class of samples, can accurately detect anomalies, and has high calculation efficiency. For solving some cases where only one type of sample is available for training the classifier. The idea of the standard SVM is to construct a generalized optimal classification surface, so that two types of data points of a training data set are positioned at two sides of the classifier as much as possible, and the interval between the two types of data points is designed as much as possible. And the OCSVM assumes that the coordinate origin is an abnormal sample, and an optimal hyperplane is constructed in the feature space to realize the maximum interval between the data target and the coordinate origin. The task of OCSVM classification is to find out a function f (x), if the value of f (x) is positive, the data x is considered to be normal, and if the value of f (x) is negative, the data x is considered to be abnormal.
In the industrial control system with source network load interaction, most data are normal data, and the single-type support vector machine has higher efficiency in detection. Therefore, in the first training module, a single-class support vector machine (OCSVM) is adopted for training, and a first classification module for classification is obtained. Namely: the OCSVM module comprises an OCSVM training module and an OCSVM classification module.
The single-class support vector machine solves the following quadratic programming problem:
Figure RE-GDA0001961035730000081
s.t. f(x)=φ(xi)ω-ρ≥-ξi,ξi≥0
wherein x isiThe method comprises the steps of taking samples in an original space, 1 is the number of training samples, phi is the mapping from the original space to a feature space, omega and rho are a normal vector and compensation of a required hyperplane in the feature space respectively, and the method can be used for solving the problems that the prior art is complex and has high cost and low costThe tuning parameter v belongs to (0, 1) as an upper limit for controlling the proportion of error samples in the total number of samples, and the relaxation variable xiiIs the degree to which some training samples are misclassified.
Selecting radial kernel functions
Figure RE-GDA0001961035730000082
And an adjustable parameter v, solving the following optimization problem and solving
Figure RE-GDA0001961035730000083
Figure RE-GDA0001961035730000084
Figure RE-GDA0001961035730000085
Figure RE-GDA0001961035730000086
Choose to satisfy arbitrarily
Figure RE-GDA0001961035730000087
Alpha of (A)*Calculating
Figure RE-GDA0001961035730000088
Wherein is satisfied with
Figure RE-GDA0001961035730000089
Alpha of (A)*I.e. the support vector.
Integrating decision functions
Figure RE-GDA00019610357300000810
If f (x) > 0, return +1, meaning the data is normal data, f (x) < 0, return-1, meaning the data is abnormal data. N is a radical ofsvIs the number of support vectors.
In order to ensure the detection efficiency and the accuracy of the whole method, a fault tolerance factor zeta is defined to be larger than 0 in an OCSVM classification module, when f (x) is larger than zeta, the fault tolerance factor returns +1, and otherwise, the fault tolerance factor returns-1. By properly adjusting ζ, a small amount of false detection exists in normal data (namely, normal flow is determined as abnormal flow), and abnormal flow is detected as far as possible.
When the flow data is classified for the first time through the first classification module, if the classification result is abnormal, the flow data needs to be classified more accurately through the second classifier. The first classification module judges that a small amount of false detection (normal data are judged as abnormal flow) possibly exists in abnormal flow, the second classification module has higher accuracy than the first classification module, and the second classification module is used for identifying the false detection flow and providing the partial flow for the self-learning module to learn. The first classification module filters a large amount of normal flow for the second classification module, and reduces the burden of the second classification module.
Selecting a Gradient Boosting Decision Tree (GBDT)) method at a second training module to train the classification model. The GBDT is formed by combining a series of integrated weak classification models, each weak classifier respectively gives a predicted value, and the predicted values are combined to form a final predicted value according to certain weight. Generally speaking, the goal of training is to find a model to make its predicted value f (x) of the input variable approach to its true value y, and the GBDT algorithm only trains a weak base model (weak classifier) each time, i.e. let the predicted value of each base model approach to the partial true value it needs to predict, and then combines the predicted values of these base models by weighting. GBDT is sensitive to abnormal data and has good classification effect.
For sample { (X)(1),y(1)),(X(2),y(2)),…,(X(n),y(n)) The purpose of the GBDT algorithm is to find a mapping F (X) that satisfies the least of the penalty functions L (y, F (X)), i.e.
Figure RE-GDA0001961035730000091
In the GBDT algorithm, gradient lifting requires a total of M iterations, each iteration produces a model, and the model generated by each iteration is required to minimize the loss function of the training set. By adopting a gradient descent method, the loss function is made to be smaller and smaller by moving to the negative gradient direction of the loss function at each iteration, so that a more and more accurate model can be obtained.
The GBDT algorithm includes the following steps:
step 401: start of Generation GBDT classifier
Step 402: a loss function is initialized.
Figure RE-GDA0001961035730000092
In the present invention, the loss function is selected such that L (y, F) is log (1+ e)-2yF),y∈{-1,1}
Step 403: initializing a classification model from said loss function, having
Figure RE-GDA0001961035730000093
Step 404: the value of the negative gradient of the loss function at the current model is calculated as an estimate of the residual error.
Figure RE-GDA0001961035730000101
Step 405: in the residual error
Figure RE-GDA0001961035730000102
The CART tree is constructed, each training sample is finally divided into corresponding leaf nodes, and the predicted values of the leaf nodes are as follows:
Figure RE-GDA0001961035730000103
update Fm(x)=Fm-1(x)+γjmAnd obtaining a stronger classification model.
Step 406: and judging whether the ending condition is met. I.e. the gradient reaches a minimum or the number of iterations reaches a set value.
Step 407: a GBDT classification model f (x) is generated. A threshold value theta is set, and when F (x) < theta, the flow rate is a normal flow rate, and when F (x) > theta, the flow rate is an abnormal flow rate.
When GBDT generates weak classifiers, CART tree is adopted. The generation method of the CART tree comprises the following steps:
step 501: the CART tree starts to be generated.
Step 502: a feature is selected, and all samples are divided into a left subtree and a right subtree according to the feature value.
Step 503: and respectively calculating the variances of the left subtree and the right subtree. Is provided with
Figure RE-GDA0001961035730000104
Is the average of the labels of the nodes in the left sub-tree,
Figure RE-GDA0001961035730000105
is the mean of the node labels in the right subtree. The variance is calculated as
Figure RE-GDA0001961035730000106
Step 504: and selecting the characteristic point which meets the minimum sum of the variances of the left subtree and the right subtree to perform primary division.
Step 505: according to the above method, the division is carried out downwards
Step 506: and judging whether the ending condition is met. The end conditions are as follows: the nodes are pure nodes, namely the target variable values of all records are the same; the depth of the tree reaches a pre-specified maximum value; the maximum drop-off value of the degree of clutter is less than a pre-specified value; the record quantity of the nodes is less than the pre-specified minimum node record quantity; all records in a node have the same predictor variable value.
Step 507: and if the ending condition is met, outputting the generated CART tree.
Step 508: and ending the flow.
And the data processing module selects the real-time flow data characteristics in the detection stage and performs data standardization processing.
The first classification module is obtained by training the first training module, takes the data processed by the data processing module as input, and performs primary classification on the flow corresponding to the data. The OCSVM classifier obtained by training of the first training module is adopted for carrying out flow abnormity detection.
The second classification module is obtained by training a second training module. The second classifier mainly considers the problem of detection accuracy and has higher detection precision. The second classification module performs classification by a GBDT algorithm. If the abnormal flow is detected, alarming is carried out or the detection result is submitted to other systems for security defense.
The self-learning module stores the data judged to be normal by the second classification module (the first classifier false detection), and under a preset trigger condition, the data and the original sample data 1 are recombined into a new sample to retrain the first classifier. Through continuous learning improvement of the self-learning module, the accuracy of the first classification module can be improved, the method is suitable for different network environments, and the detection efficiency of the whole model is improved.
Specifically, the method comprises the following steps: the self-learning training system comprises a data processing module, a first training module, a second training module, a data acquisition module, a first classification module, a second classification module and a self-learning module. The data processing module processes off-line sample data or on-line test data, and the preprocessed data can be applied to the next module. And the first training module obtains the classification model of the first classification module through the sample data 1 after training processing. And the second training module obtains a classification model of the second classification module through the sample data 2 after training processing. The data acquisition module acquires flow data in the source network load industrial control system in real time and inputs the characteristic data of the flow into the data processing module for processing. The first classification module takes the data processed by the data processing module as input and performs primary classification on the flow corresponding to the data. In the first classification, in order to ensure the detection efficiency and the accuracy of the whole method, a small amount of false detection of normal data is allowed (namely, normal traffic is determined as abnormal traffic), and abnormal traffic is detected as far as possible. The traffic which is judged to be abnormal by the first classification module is subjected to more accurate classification by the second classification module. And if the judgment result of the second classification module is that the flow is normal, the first classifier is subjected to false detection, and the data is input into the self-learning module. The self-learning module adds the flow data into the sample data 1, and the training is carried out again through the first training module at a fixed time, so that the accuracy of the first classification module is improved.
FIG. 2 is a schematic view of a PCA dimension reduction process according to the present invention. The sample data is set as follows: { (X)(1),y(1)), (X(2),y(2)),…,(X(n),y(n))}. The method for reducing the dimension by using the principal component analysis method comprises the following steps:
step 201: feature dimensionality reduction is initiated.
Step 202: according to a formula for an original d-dimensional sample data set
Figure RE-GDA0001961035730000111
And (5) performing decentralized processing.
Step 203: a covariance matrix of the samples is constructed. The covariance formula is
Figure RE-GDA0001961035730000112
Step 204: eigenvalues of the covariance matrix and corresponding eigenvectors are computed. The eigenvectors of the covariance matrix represent principal components, and the importance of the eigenvectors is determined according to the magnitude of the eigenvalues.
Step 205: and selecting k eigenvectors corresponding to the first k eigenvalues.
Step 206: and constructing a mapping matrix W by the k eigenvectors.
Step 207: dimensionality reduction of d-dimensional data to a k-dimensional vector Z by a mapping matrix W: z(i)=UTX(i)
Step 208: and completing feature dimension reduction.
Fig. 3 is a schematic diagram of an OCSVM learning process according to the present invention. The process of OCSVM training is as follows:
step 301: and selecting an adjustable parameter v and a kernel function K (x, y). In this problem, radial kernel functions are chosen
Figure RE-GDA0001961035730000121
Step 302: and training the OCSVM classifier through the processed sample data.
Step 303: a classifier is obtained, and the decision function of the classifier can be expressed as
Figure RE-GDA0001961035730000122
Figure RE-GDA0001961035730000123
Wherein
Figure RE-GDA0001961035730000124
Is a support vector.
Fig. 4 is a schematic diagram of a process of generating a GBDT tree by a GBDT according to the present invention. The GBDT generation prediction model includes the following steps:
step 401: start of Generation GBDT classifier
Step 402: a loss function is initialized.
Figure RE-GDA0001961035730000125
In this invention, the loss function is chosen such that L (y, F) is log (1+ e)-2yF),y∈{-1,1}
Step 403: initializing a classification model from said loss function, having
Figure RE-GDA0001961035730000126
Step 404: the value of the negative gradient of the loss function at the current model is calculated as an estimate of the residual error.
Figure RE-GDA0001961035730000127
Step 405: in the residual error
Figure RE-GDA0001961035730000128
Constructing the CART tree, and finally dividing each training sample into corresponding leaf nodes, wherein the leaf nodes are pre-arranged at the momentMeasured values are:
Figure RE-GDA0001961035730000129
update Fm(x)=Fm-1(x)+γjmAnd a stronger learner is obtained.
Step 406: and judging whether the ending condition is met. I.e. the gradient reaches a minimum or the number of iterations reaches a set value.
Step 407: a GBDT classification model f (x) is generated. A threshold value theta is set, and when F (x) < theta, the flow rate is a normal flow rate, and when F (x) > theta, the flow rate is an abnormal flow rate.
FIG. 5 is a schematic diagram of generating a CART tree according to the present invention. The generation method of the CART tree comprises the following steps:
step 501: the CART tree starts to be generated.
Step 502: a feature is selected, and all samples are divided into a left subtree and a right subtree according to the feature value.
Step 503: and respectively calculating the variances of the left subtree and the right subtree. Is provided with
Figure RE-GDA0001961035730000131
Is the average of the labels of the nodes in the left sub-tree,
Figure RE-GDA0001961035730000132
is the mean of the node labels in the right subtree. The variance is calculated as
Figure RE-GDA0001961035730000133
Step 504: and selecting the characteristic point which meets the minimum sum of the variances of the left subtree and the right subtree to perform primary division.
Step 505: according to the above method, the division is carried out downwards
Step 506: and judging whether the ending condition is met. The end conditions are as follows: the nodes are pure nodes, namely the target variable values of all records are the same; the depth of the tree reaches a pre-specified maximum value; the maximum drop-off value of the degree of clutter is less than a pre-specified value; the record quantity of the nodes is less than the pre-specified minimum node record quantity; all records in a node have the same predictor variable value.
Step 507: and if the ending condition is met, outputting the generated CART tree.
Step 508: and ending the flow.
FIG. 6 is a flow chart of the self-learning method of the present invention.
Step 601: and initializing a self-learning module, and setting the capacity of the sample and the condition of training triggering. The trigger condition may be set to a specific time or a specific state, such as a timed trigger.
Step 602: and monitoring whether the flow is detected by mistake.
Step 603: and if the false detection flow exists, forming a new training sample with the original training sample. And if the training sample does not reach the set sample capacity size, directly adding the training sample. Otherwise, the earliest training sample is replaced.
Step 604: and judging whether the triggering condition is met. If the training trigger condition is not met, then return to step 602.
Step 605: and if the triggering condition is met, the classification model is retrained.
In summary, the method for realizing the network traffic anomaly detection of the industrial control system with source network load interaction adopts a double-layer detection structure, gives consideration to the detection accuracy and the detection efficiency, and simultaneously adopts a self-learning method to improve the detection accuracy. The technology of the industrial control system with source-grid-load removal interaction can also be applied to a plurality of fields, and has high popularization value.
The above description is only an example of the present invention and is not intended to limit the present invention. All equivalents which come within the spirit of the invention are therefore intended to be embraced therein. Details not described herein are well within the skill of those in the art.

Claims (2)

1. A network flow abnormity detection method suitable for a source network load interaction industrial control system is characterized by comprising the following steps:
(1) the method comprises the steps that flow data in a source network load interaction industrial control system are collected in real time through a data collection module, data characteristics in the flow data are counted, and the characteristic data of the flow are input into a data processing module to be processed;
(2) the data processing module processes off-line sample data or on-line test data and applies the processed data to the first classification module;
(3) forming new sample data by the sample data 1 after training and the flow data in the self-learning module obtained in the step (6), preprocessing the data by a data preprocessing module, and applying the data processed by the data preprocessing module to a first training module;
(4) taking the data processed in the step (2) as input, training the data through the first training module obtained in the step (3), entering the trained data into a first classification module, detecting whether the flow is normal through the first classification module, if so, finishing the normal flow of output flow, and if not, entering the step (5);
(5) the data processing module is used for processing the data of the sample data 2 after training processing, and the data processed by the data processing module is applied to a second training module;
(6) and (3) training data in the second training module, entering the trained data and the abnormal flow data obtained in the step (4) into a second classification module, detecting whether the flow is normal or not through the second classification module, adding the data into the self-learning module and entering the step (3) if the flow is normal, and outputting abnormal flow and giving an alarm if the flow is abnormal.
2. The method for detecting the network traffic abnormality applicable to the source network load interaction industrial control system according to claim 1, wherein the self-learning module in the step (3) has the following specific steps:
(3.1) initializing a self-learning module, and setting the volume of the sample and the condition of training triggering; the trigger condition may be set to a specific time or a specific state;
(3.2) monitoring whether the flow is detected by mistake;
(3.3) if the flow is detected by mistake, forming a new training sample with the original training sample; if the training sample does not reach the set sample capacity, directly adding the training sample; otherwise, replacing the earliest training sample;
(3.4) judging whether the trigger condition is met, if the trigger condition is not met, returning to the step (3.2);
and (3.5) if the triggering condition is met, retraining the classification model.
CN201811415563.1A 2018-11-26 2018-11-26 Network flow abnormity detection method suitable for source network load interaction industrial control system Expired - Fee Related CN109462521B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811415563.1A CN109462521B (en) 2018-11-26 2018-11-26 Network flow abnormity detection method suitable for source network load interaction industrial control system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811415563.1A CN109462521B (en) 2018-11-26 2018-11-26 Network flow abnormity detection method suitable for source network load interaction industrial control system

Publications (2)

Publication Number Publication Date
CN109462521A CN109462521A (en) 2019-03-12
CN109462521B true CN109462521B (en) 2020-11-20

Family

ID=65611591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811415563.1A Expired - Fee Related CN109462521B (en) 2018-11-26 2018-11-26 Network flow abnormity detection method suitable for source network load interaction industrial control system

Country Status (1)

Country Link
CN (1) CN109462521B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032596B (en) * 2019-04-17 2021-07-27 中国联合网络通信集团有限公司 Method and system for identifying abnormal traffic user
CN110381079B (en) * 2019-07-31 2021-10-22 福建师范大学 Method for detecting network log abnormity by combining GRU and SVDD
CN110807468B (en) * 2019-09-19 2023-06-20 平安科技(深圳)有限公司 Method, device, equipment and storage medium for detecting abnormal mail
CN111262750B (en) * 2020-01-09 2021-08-27 ***股份有限公司 Method and system for evaluating baseline model
CN111404911B (en) * 2020-03-11 2022-10-14 国网新疆电力有限公司电力科学研究院 Network attack detection method and device and electronic equipment
CN111711538B (en) * 2020-06-08 2021-11-23 中国电力科学研究院有限公司 Power network planning method and system based on machine learning classification algorithm
CN114070899B (en) * 2020-07-27 2023-05-12 深信服科技股份有限公司 Message detection method, device and readable storage medium
CN114363005A (en) * 2021-12-08 2022-04-15 北京六方云信息技术有限公司 ICMP detection method, system, equipment and medium based on machine learning
CN115766227A (en) * 2022-11-16 2023-03-07 国网福建省电力有限公司 Flow abnormity detection method based on single support vector machine OCSVM
CN117729137A (en) * 2024-02-08 2024-03-19 金数信息科技(苏州)有限公司 Feature generation method, device and equipment of network traffic data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090083767A (en) * 2008-01-30 2009-08-04 성균관대학교산학협력단 Network abnormal state detection device using hmm(hidden markov model) and method thereof
CN202119467U (en) * 2011-05-12 2012-01-18 北京工业大学 Self-adaptive wavelet neural network categorizing system of anomaly detection and fault diagnosis
CN108173708A (en) * 2017-12-18 2018-06-15 北京天融信网络安全技术有限公司 Anomalous traffic detection method, device and storage medium based on incremental learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101060444A (en) * 2007-05-23 2007-10-24 西安交大捷普网络科技有限公司 Bayesian statistical model based network anomaly detection method
CN103281293A (en) * 2013-03-22 2013-09-04 南京江宁台湾农民创业园发展有限公司 Network flow rate abnormity detection method based on multi-dimension layering relative entropy
CN106506556B (en) * 2016-12-29 2019-11-19 北京神州绿盟信息安全科技股份有限公司 A kind of network flow abnormal detecting method and device
CN106982230B (en) * 2017-05-10 2020-11-13 深信服科技股份有限公司 Flow detection method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090083767A (en) * 2008-01-30 2009-08-04 성균관대학교산학협력단 Network abnormal state detection device using hmm(hidden markov model) and method thereof
CN202119467U (en) * 2011-05-12 2012-01-18 北京工业大学 Self-adaptive wavelet neural network categorizing system of anomaly detection and fault diagnosis
CN108173708A (en) * 2017-12-18 2018-06-15 北京天融信网络安全技术有限公司 Anomalous traffic detection method, device and storage medium based on incremental learning

Also Published As

Publication number Publication date
CN109462521A (en) 2019-03-12

Similar Documents

Publication Publication Date Title
CN109462521B (en) Network flow abnormity detection method suitable for source network load interaction industrial control system
O'Reilly et al. Anomaly detection in wireless sensor networks in a non-stationary environment
US11699080B2 (en) Communication efficient machine learning of data across multiple sites
CN111027058A (en) Method for detecting data attack in power system, computer equipment and storage medium
CN112149609A (en) Black box anti-sample attack method for electric energy quality signal neural network classification model
US20200334578A1 (en) Model training apparatus, model training method, and program
Nakhodchi et al. Steeleye: An application-layer attack detection and attribution model in industrial control systems using semi-deep learning
Dawoud et al. Deep learning for network anomalies detection
CN112367303B (en) Distributed self-learning abnormal flow collaborative detection method and system
US20210027167A1 (en) Model structure extraction for analyzing unstructured text data
CN114330544A (en) Method for establishing business flow abnormity detection model and abnormity detection method
Viegas et al. A reliable semi-supervised intrusion detection model: One year of network traffic anomalies
CN117113218A (en) Visual data analysis system and method thereof
CN117113262A (en) Network traffic identification method and system
Poudel et al. Circuit topology estimation in an adaptive protection system
CN117156442A (en) Cloud data security protection method and system based on 5G network
CN117411703A (en) Modbus protocol-oriented industrial control network abnormal flow detection method
CN114615088A (en) Terminal service flow abnormity detection model establishing method and abnormity detection method
CN110770753B (en) Device and method for real-time analysis of high-dimensional data
Nichiforov et al. Learning dominant usage from anomaly patterns in building energy traces
Liu et al. Adaptive robustness evaluation for complex system prognostics and health management software platform
Kassan et al. Robustness analysis of hybrid machine learning model for anomaly forecasting in radio access networks
Gouveia et al. Deep Learning for Network Intrusion Detection: An Empirical Assessment
Fahad et al. Applying one-class classification techniques to IP flow records for intrusion detection
Zong et al. Online intrusion detection mechanism based on model migration in intelligent pumped storage power stations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201120

CF01 Termination of patent right due to non-payment of annual fee