CN108540451A - A method of classification and Detection being carried out to attack with machine learning techniques - Google Patents
A method of classification and Detection being carried out to attack with machine learning techniques Download PDFInfo
- Publication number
- CN108540451A CN108540451A CN201810202552.9A CN201810202552A CN108540451A CN 108540451 A CN108540451 A CN 108540451A CN 201810202552 A CN201810202552 A CN 201810202552A CN 108540451 A CN108540451 A CN 108540451A
- Authority
- CN
- China
- Prior art keywords
- data
- value
- training
- training data
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of methods carrying out classification and Detection to attack with machine learning techniques, belong to field of information security technology.Specially:1. acquisition network data is simultaneously pre-processed, training data is obtained.2. building and training multistage classifier.3. carrying out classification and Detection to test data with trained multistage classifier.Method compare with the existing technology proposed by the present invention, advantage are:1. data scale can be reduced by the preprocess method to gathered data, while part extraneous data is removed, improves whole efficiency.2. using the thought of multistage classifier and integrated study, solves the problems, such as that single grader fitting precision is not high, substantially increase the accuracy of detection of system.3. the design of the deblocking method based on improved random forests algorithm, which can detect different types of attack, is embodied as parallel algorithm, the overall detection speed of system is improved.
Description
Technical field
The present invention relates to a kind of methods carrying out classification and Detection to attack with machine learning techniques, belong to information
Security technology area.
Background technology
As network and computer technology are while developing into people’s lives and offer convenience, the safety of network system is asked
Topic also brings new puzzlement.Since the value volume and range of product of network attack is exponentially increased, network and information system are faced with sternly
The security threat of weight.There is important theory and practice to be worth for the guard method of the safety of research network in this context.
In order to protect the safety of network system, differentiate and prevent the row of going beyond one's commission of the attack and user inside and outside system
For researchers propose the active monitoring technology of network system.The technology by monitor node active to monitored network
It generates and loads and the data of acquisition are analyzed, and then obtain the status information of monitored network and generate corresponding decision.
The main research of active monitoring system is that the suitable model logarithm monitoring information of structure carries out classification and Detection, classification and Detection skill
The evaluation index of art includes detection time, Detection accuracy, detection rate of false alarm etc..When the number for monitoring node increases to network
Active monitoring it is also more complicated, it is therefore desirable to propose the analysis method more outstanding in accuracy of detection and detection time.
Machine learning be by the learning activities of the computer simulation mankind, by building learning machine from existing experience,
Further unknown data are predicted by learning machine, and the learning machine constructed by constantly improve in this course.It will
Machine learning techniques introduce the levels of precision that active monitoring technology improves data analysis, are commonly used in active monitoring system at present
Machine learning model include SVM, neural network, logistic regression, Bayesian network etc..
Invention content
The purpose of the present invention is for accurate to the detection of attack existing for active monitoring technology in large scale network
The problem that rate is low, the response time is long, detection rate of false alarm is high, proposition is a kind of to divide attack with machine learning techniques
The method of class detection.The method of the present invention improves quality and the reduction data rule of data to be tested by complete Feature Engineering method
Mould builds the ensemble machine learning model based on random forests algorithm and algorithm of support vector machine, after processing on this basis
Data carry out classification and Detection to predict the attack in network system.
The purpose of the present invention is what is be achieved through the following technical solutions.
A kind of method carrying out classification and Detection to attack with machine learning techniques proposed by the present invention is specific
Operation is:
Step 1: acquiring network data and being pre-processed, training data is obtained.The training data is divided into normal data
With attack data;The attack data are divided into according to different attack types as plurality of classes, the quantity symbol N tables of attack type
Show;N is positive integer.The quantity of each type training data is no less than 3000.
It is described to obtain training data concrete operations and be:
Step 1.1:Network data is acquired from network system.The network data includes Web content correlated characteristic, net
Network flow correlated characteristic connects correlated characteristic with network.
Step 1.2:The network data is pre-processed, specially:
Step 1.2.1:Data cleansing is carried out to the network data, removal characteristic item has the data and characteristic item of missing
Value is the data in improper value range.
Step 1.2.2:Network data after cleaning is standardized.Specially:To the data of character types into
Line number value maps or carries out the numerical transformation of dualization.Network data representation after standardization is feature vector
Form.
Step 1.2.3:The network data after standardization is normalized by formula (1), makes network number
According to each characteristic item value in [0,1] range.
Wherein, new_v indicates the value after any one characteristic item (being indicated with symbol V) standardization in network data,
new_v∈[0,1];V indicates the original value of characteristic item V in network data;Max indicates the original of characteristic item V in overall network data
Maximum value in initial value;Min indicates the minimum value in the original value of characteristic item V in overall network data.
Step 1.3:It is peeled off the factor (LOF) algorithm using the part based on distance, calculates and pass through pretreated network data
Part peel off the factor.Specially:
Step 1.3.1:Every network data representation is a m dimensional feature vector, and described eigenvector is indicated with symbol s, s
={ x1,x2,x3…xm, m is the characteristic item quantity that a network data includes, and m is positive integer;x1,x2,x3…xmM is indicated respectively
A characteristic item.Then described eigenvector is mapped in the feature space of m dimensions, then in each feature vector character pair space
A point.
Step 1.3.2:Any one m dimensional feature vector corresponding point in feature space is indicated with symbol p, passes through formula
(2) part for calculating point p peels off the factor.
Wherein, LOFk(p) indicate that the part of the kth neighborhood of point p peels off the factor;The value of k is by artificially specifying, k>10;Nk(p)Table
Show the kth neighborhood point set of point p, the kth neighborhood point of the point p is all the points of the kth of point p within;lrdk(o) it indicates
The local reachability density of point o, o ∈ Nk(p);lrdk(p) local reachability density of point p, lrd are indicatedk(p) it is calculated by formula (3)
It obtains.
Wherein, | Nk(p) | indicate the kth distance of point p;Dist (p, o) indicates the distance of point p to point o.
Step 1.4:According to the local factor LOF that peels offk(p), judge whether point p is outlier.
Indicate the threshold value of outlier with symbol ε, ε is artificial setting value, the value range of ε be (1,2].Work as LOFk(p)>ε
When, point p is labeled as abnormal point, then the corresponding feature vectors of point p are abnormal data, and delete processing is done to abnormal data.Otherwise,
Execute the operation of step 1.5.
By the operation of step 1.4, using the data obtained after rejecting abnormalities data as basic training data.
Step 1.5:If the dimension m of propaedeutics data>Q thens follow the steps 1.6 operation.Otherwise, by grounding
Data are as training data.Wherein, Q is artificial setting value, Q >=40.
Step 1.6:Dimension-reduction treatment is carried out to grounding data using feature selecting algorithm, makes the dimension of grounding data
Degree drops to T, and T is artificial setting value, T≤40.
The feature selecting algorithm includes:Variance selection method, Information Gain Method, mutual information method, Chi-square Test side
Method and feature selection approach based on tree-model.
Step 1.7:Analyze and confirm the validity of feature selecting algorithm.Concrete operations are:
Step 1.7.1:A selected machine learning model for classifying to grounding data.The engineering
Habit model is two-value disaggregated model, for grounding data to be divided into normal data or attack data.
The machine learning model includes Logic Regression Models, decision tree and perceptron.
Step 1.7.2:The grounding data that step 1.4 is obtained are obtained as the input of the machine learning model
Then classification results calculate the classification accuracy of grounding data, with symbol L1It indicates.
Step 1.7.3:Using the grounding data after the dimensionality reduction that step 1.6 obtains as the defeated of the machine learning model
Enter, obtain classification results, and calculate the classification accuracy of the grounding data after dimensionality reduction, with symbol L2It indicates.
Step 1.7.3:The threshold value of assessment result, δ ∈ (0.9,1) are indicated with symbol δ.IfT values are then tuned up,
Then operation of the step 1.6 to step 1.7 is repeated.Otherwise, the operation for terminating this step, by the grounding number after dimension-reduction treatment
According to as training data.
Step 2: building and training multistage classifier, concrete operation step is:
Step 2.1:The voting principle of random forests algorithm is improved, structure improves random forest grader.
Specially:
Step 2.1.1:N decision tree is constructed, n is artificial setting value, n>500.
Step 2.1.2:With symbol theta presentation class threshold value, θ is artificial setting value, θ ∈ (0,1).
Step 2.1.3:The a certain type of network data is indicated with symbol y, y ∈ Y, Y are the type set of network data.
Step 2.1.4:The classification results of a training data are obtained by formula (4).
Wherein, x indicates that a training data, f (x) indicate that training data x belongs to the classification function value of type y;xiTable
Show the input of i-th decision tree, xiFor the sampling of training data x;fi(xi) indicate i-th decision built by random sampling
The value of the classification function of tree.
If f (x) values are 1, the starting type of training data x is labeled as classification y.
Step 2.1.5:The training data that step 1 is obtained is as the input for improving random forest grader, by classification
Operation is completed to mark the starting type of training data.Since attack type quantity is N, in addition normal data type, one is shared
(N+1) type.
Step 2.2:Use support vector machines (SVM) of the alternative manner structure based on adaptive enhancing (AdaBoost) thought
Integrated classifier.The weighing computation method of each training data and integrated classifier is specially in iterative process:
Step 2.2.1:For m training data, with symbol w1,w2,w3…wm1 to m-th training data is indicated respectively
Weight, and w is set1,w2,w3…wmInitial value beCurrent iteration number is indicated with symbol t, and current iteration number is set
Initial value t=0.
Step 2.2.2:T-th is built using Gaussian function as the base grader of the support vector machines of kernel function, uses symbol
gtIt indicates, shown in the classification function such as formula (5) of the grader.Using training dataset as base grader gtInput, instructed
Practice result.
Wherein, g (x) indicates base grader gtClassification function value;Sgn () indicates sign function;A is a training
Data;For supporting vector;B is bias, and the initial value of b is artificial setting value, and value range is [0,0.5];K(xr,
xs) it is gaussian kernel function.
Step 2.2.3:Select quadratic loss function as base grader gtLoss function.Calculate base grader gtDamage
Lose the value of function, setting base grader gtThe threshold value of loss function be For artificial setting value,
Step 2.2.4:If base grader gtThe value of loss function is less than threshold valueThen obtain the multistage classifier, institute
State the classification function such as formula (6) of multistage classifier;Then, base grader g is calculated by formula (7)tModel Weight, with symbol
Number dtIt indicates;End operation.If base grader gtThe value of loss function is not less than threshold valueThen follow the steps the behaviour of 2.2.5
Make.
Wherein, g indicates the value of the classification function of multistage classifier;T is total iterations.
Wherein, a indicates a articles training data;wt(a) indicate that weight of a articles training data in the t times iteration takes
Value;gt(a) base grader g is indicatedtTo the training result of training data a;yt(a) legitimate reading of training data a is indicated.
Step 2.2.5:Use the weight w of formula (8) adjusting training data1,w2,w3…wm, and to the power of each training data
Be standardized again, ensure all training datas weight and be 1.Then make iterations from increasing 1, return to step
2.2.2, the operation of step 2.2.2 is executed.
Wherein, wt+1(a) the weight value in training data a (t+1) secondary iteration is indicated;β is regulation coefficient, works as training
When the training result of data a is identical with legitimate reading, β=1, when the training result of training data a and legitimate reading difference, β
=-1;
Step 2.3:Support vector machines (SVM) of the training based on adaptive enhancing (AdaBoost) thought.
The each type of training data that step 2.1 obtains is separately input to one based on adaptive enhancing
(AdaBoost) input of the support vector machines (SVM) of thought.By training, it is a trained based on adaptive to obtain (N+1)
Enhance the support vector machines (SVM) of (AdaBoost) thought.
By the operation of step 2, a trained multistage classifier is obtained.
Step 3: carrying out classification and Detection to test data with trained multistage classifier.
On the basis of step 2 operates, test data is input to multistage classifier, obtains final classification result.
Advantageous effect
The method and prior art proposed by the present invention for carrying out classification and Detection to attack with machine learning techniques
It compares, has the following advantages:
1. data scale can be reduced by the preprocess method to gathered data, while part extraneous data is removed, carried
High whole efficiency
2. using the thought of multistage classifier and integrated study, solve the problems, such as that single grader fitting precision is not high,
Substantially increase the accuracy of detection of system.
3. the design of the deblocking method based on improved random forests algorithm can be by different types of attack
Detection is embodied as parallel algorithm, improves the overall detection speed of system.
Description of the drawings
Fig. 1 is the side for carrying out classification and Detection in the specific embodiment of the invention to attack with machine learning techniques
The operational flowchart of method.
Specific implementation mode
Technical solution of the present invention is described further in the following with reference to the drawings and specific embodiments.
Using a kind of method pair carrying out classification and Detection to attack with machine learning techniques proposed by the present invention
Network data is classified, operating process as shown in Figure 1, the specific steps are:
Step 1: acquiring network data and being pre-processed, training data is obtained.
KDD CUP99 data are obtained, and 10% sampling, the network as the present embodiment are carried out to KDD CUP99 data sets
Data.
The training data is divided into normal (Normal) data and attack data;Attack data according to different attack types again
It is divided into 4 kinds of classifications.The quantity of each type training data is no less than 3000.Wherein attack type is respectively:Refusal service
(DOS), monitoring or probe (Probing), long-range unauthorized access (R2L) and user right illegally promote (U2R).
It is described to obtain training data concrete operations and be:
Step 1.1:KDD CUP99 data are obtained, and 10% sampling is carried out to KDD CUP99 data sets, as this implementation
The network data of example, including 41 attributes, as shown in table 1.
The characteristic item of 1 KDD CUP00 data of table
Step 1.2:The network data is pre-processed, KDD CUP99 data are concentrated with the category of 38 value types
The attribute of property and 3 character types, is Protocol_type, Flag and Service respectively.It specifically handles and is:
Step 1.2.1:Data cleansing is carried out to the network data, removal characteristic item has the data and characteristic item of missing
Value is the data in improper value range.
Step 1.2.2:Network data after cleaning is standardized.Specially:By the data of character types into
The mapping of line number value or the numerical transformation that dualization is carried out to character string.Wherein, character attibute Protocol_type looks like
It is protocol type, value range is TCP, ICMP and UDP, and TCP, ICMP and UDP numeralization are handled, correspond to 0,1,2 respectively.It is special
Levying the value of attribute Flag, correspondence is as shown in table 2 with quantizing that treated.Characteristic attribute Service have tens not
Same value corresponds to 1,2,3 ... according to the transformation that data dictionary sequence directly quantizes.
The numeralization of 2 Flag attributes of table maps
Network data representation after standardization is characterized vector form.
Step 1.2.3:The network data after standardization is normalized by formula (1), makes network number
According to each characteristic item value in [0,1] range.
Wherein, new_v indicates the value after any one characteristic item (being indicated with symbol V) standardization in network data,
new_v∈[0,1];V indicates the original value of characteristic item V in network data;Max indicates the original of characteristic item V in overall network data
Maximum value in initial value;Min indicates the minimum value in the original value of characteristic item V in overall network data.
Step 1.3:It is peeled off the factor (LOF) algorithm using the part based on distance, calculates and pass through pretreated network data
Part peel off the factor.Specially:
Step 1.3.1:Every network data representation is a m dimensional feature vector, m=41.Described eigenvector symbol s
It indicates, s={ x1,x2,x3…xm};x1,x2,x3…xmM characteristic item is indicated respectively.Then described eigenvector is mapped to m dimensions
Feature space in, then a point in each feature vector character pair space.
Step 1.3.2:Any one m dimensional feature vector corresponding point in feature space is indicated with symbol p, passes through formula
(2) part for calculating point p peels off the factor.
Wherein, LOFk(p) indicate that the part of the kth neighborhood of point p peels off the factor;The value of k is by artificially specifying, k=50;Nk(p)
Indicate the kth neighborhood point set of point p, the kth neighborhood point of the point p is all the points of the kth of point p within;lrdk(o) table
Show the local reachability density of point o, o ∈ Nk(p);lrdk(p) local reachability density of point p, lrd are indicatedk(p) it is counted by formula (3)
It obtains.
Wherein, | Nk(p) | indicate the kth distance of point p;Dist (p, o) indicates the distance of point p to point o.
Step 1.4:According to the local factor LOF that peels offk(p), judge whether point p is outlier.
Indicate that the threshold value of outlier, ε are artificial setting value, ε=1.2 with symbol ε.Work as LOFk(p)>When ε, point p is marked
For abnormal point, then the corresponding feature vectors of point p are abnormal data, and delete processing is done to abnormal data.Otherwise, step 1.5 is executed
Operation.
By the operation of step 1.4, using the data obtained after rejecting abnormalities data as basic training data.
Step 1.5:If the dimension m of propaedeutics data>Q thens follow the steps 1.6 operation.Otherwise, by grounding
Data are as training data.Wherein, Q=40.
Step 1.6:Dimension-reduction treatment is carried out to grounding data using Information Gain Method, makes the dimension of grounding data
Degree drops to T, and T is artificial setting value, T=20.
Step 1.7:Analyze and confirm the validity of feature selecting algorithm.Concrete operations are:
Step 1.7.1:The choosing machine learning model that then Logic Regression Models classify to grounding data.Logic is returned
It is two-value disaggregated model to return model, for grounding data to be divided into normal data or attack data.
Step 1.7.2:The grounding data that step 1.4 is obtained are classified as the input of Logic Regression Models
As a result, the classification accuracy of grounding data is then calculated, with symbol L1It indicates.
Step 1.7.3:Using the grounding data after the dimensionality reduction that step 1.6 obtains as the input of Logic Regression Models,
Classification results are obtained, and calculate the classification accuracy of the grounding data after dimensionality reduction, with symbol L2It indicates.
Step 1.7.3:The threshold value of assessment result, δ=0.95 are indicated with symbol δ.IfT values are then tuned up, then
Repeat operation of the step 1.6 to step 1.7.Otherwise, the grounding data after dimension-reduction treatment are made in the operation for terminating this step
For training data.
By the operation of step 1, the training data of T=24 dimensions is obtained, the number that the feature selected corresponds in table 1 is:
5,3,6,23,26,37,2,12,30,2,3,9,10,13,15,18,1,2,7,1,14,22,4,7。
Step 2: building and training multistage classifier.
Step 2.1:The voting principle of random forests algorithm is improved, and is calculated using improved random forest
Method carries out starting type mark to training data.Specially:
Step 2.1.1:Construct n decision tree, n=1000.
Step 2.1.2:With symbol theta presentation class threshold value, θ is artificial setting value, θ=0.35
Step 2.1.3:The a certain type of network data is indicated with symbol y, y ∈ Y, Y are the type set of network data.
{ normal, refusal service, probe, long-range unauthorized access, illegal permission carry the type set Y=of the network data
It rises }.
Step 2.1.4:The classification results of a training data are obtained by formula (4).
Wherein, x indicates that a training data, f (x) indicate that training data x belongs to the classification function value of type y;xiTable
Show the input of i-th decision tree, xiFor the sampling of training data x;fi(xi) indicate i-th decision built by random sampling
The value of the classification function of tree.
If f (x) values are 1, the starting type of training data x is labeled as classification y.
Step 2.1.5:The training data that step 1 is obtained is as the input for improving random forest grader, by classification
Operation is completed to mark the starting type of training data.Since attack type quantity is N, in addition normal data type, one is shared
(N+1) type.
Step 2.2:Use support vector machines (SVM) of the alternative manner structure based on adaptive enhancing (AdaBoost) thought
Integrated classifier, the weighing computation method of each training data and integrated classifier is specially in iterative process:
Step 2.2.1:For including the training dataset of m training data, with symbol w1,w2,w3…wm1 is indicated respectively
To the weight of m-th of training data, and w is set1,w2,w3…wmInitial value beWherein, m indicates the number of training data
Amount.Current iteration number is indicated with symbol t, and the initial value t=0 of current iteration number is set.
Step 2.2.2:T-th is built using Gaussian function as the base grader of the support vector machines of kernel function, uses symbol
gtIt indicates, shown in the classification function such as formula (5) of the grader.Using training dataset as base grader gtInput, instructed
Practice result.
Wherein, g (x) indicates base grader gtClassification function value;Sgn () indicates sign function;A is a training
Data;For supporting vector;B is bias, and the initial value of b is artificial setting value, and value range is [0,0.5];K(xr,
xs) it is gaussian kernel function.
For the base grader of normal type, b=0.0
For the base grader of refusal service (DOS) type, b=0.13
For the base grader of monitoring or probe (Probing), b=0.18
For the base grader of long-range unauthorized access (R2L), b=0.16
The base grader of (U2R), b=0.21 are illegally promoted for user right
Step 2.2.3:Select quadratic loss function as base grader gtLoss function.Calculate base grader gtDamage
Lose the value of function, setting base grader gtThe threshold value of loss function be For artificial setting value,
Step 2.2.4:If base grader gtThe value of loss function is less than threshold valueThen obtain the multistage classifier, institute
State the classification function such as formula (6) of multistage classifier;Then, base grader g is calculated by formula (7)tModel Weight, with symbol
Number dtIt indicates;End operation.If base grader gtThe value of loss function is not less than threshold valueThen follow the steps the behaviour of 2.2.5
Make.
Wherein, g indicates the value of the classification function of multistage classifier;T is total iterations.
Wherein, a indicates a articles training data;wt(a) indicate that weight of a articles training data in the t times iteration takes
Value;gt(a) base grader g is indicatedtTo the training result of training data a;yt(a) legitimate reading of training data a is indicated.
Step 2.2.5:Use the weight w of formula (8) adjusting training data1,w2,w3…wm, and to the power of each training data
Be standardized again, ensure all training datas weight and be 1.Then make iterations from increasing 1, return to step
2.2.2, the operation of step 2.2.2 is executed.
Wherein, wt+1(a) the weight value in training data a (t+1) secondary iteration is indicated;β is regulation coefficient, works as training
When the training result of data a is identical with legitimate reading, β=1, when the training result of training data a and legitimate reading difference, β
=-1;
Step 2.3:Support vector machines (SVM) of the training based on adaptive enhancing (AdaBoost) thought.
The each type of training data that step 2.1 obtains is separately input to one based on adaptive enhancing
(AdaBoost) input of the support vector machines (SVM) of thought.By training, obtain 5 it is trained based on adaptive enhancing
(AdaBoost) support vector machines (SVM) of thought.
By the operation of step 2, a trained multistage classifier is obtained.
Step 3: carrying out classification and Detection to test data with trained multistage classifier.
On the basis of step 2 operates, test data is separately input to each type of multistage classifier, is obtained most
Whole classification results.
In the present embodiment, each corresponding attack type, in the training data used and test data quantity and test set
New attack type sample size is as shown in table 3.
3 experimental data of table and sample size statistics
In order to assess the validity of classification and Detection method, using the accuracy rate of classification, rate of false alarm, rate of failing to report as classification side
The evaluation index of method.The Testing index of various types of other data is as shown in table 4 in the present embodiment.
The Testing index of 4 various categorical datas of table
Claims (3)
1. a kind of method carrying out classification and Detection to attack with machine learning techniques, it is characterised in that:It is specifically grasped
As:
Step 1: acquiring network data and being pre-processed, training data is obtained;The training data is divided into normal data and attacks
Hit data;The attack data are divided into according to different attack types as plurality of classes, and the quantity of attack type is indicated with symbol N;N
For positive integer;The quantity of each type training data is no less than 3000;
It is described to obtain training data concrete operations and be:
Step 1.1:Network data is acquired from network system;The network data includes Web content correlated characteristic, network flow
Amount correlated characteristic connects correlated characteristic with network;
Step 1.2:The network data is pre-processed, specially:
Step 1.2.1:Data cleansing is carried out to the network data, removal characteristic item has the data and characteristic item value of missing
For the data in improper value range;
Step 1.2.2:Network data after cleaning is standardized;Specially:To the data of character types into line number
Value maps or carries out the numerical transformation of dualization;Network data representation after standardization is feature vector shape
Formula;
Step 1.2.3:The network data after standardization is normalized by formula (1), makes network data
The value of each characteristic item is in [0,1] range;
Wherein, new_v indicates the value after any one characteristic item V standardization, new_v ∈ [0,1] in network data;V is indicated
The original value of characteristic item V in network data;Max indicates the maximum value in the original value of characteristic item V in overall network data;min
Indicate the minimum value in the original value of characteristic item V in overall network data;
Step 1.3:It is peeled off factor LOF algorithms using the part based on distance, calculates the part Jing Guo pretreated network data
Peel off the factor;Specially:
Step 1.4:According to the local factor LOF that peels offk(p), judge whether point p is outlier;
Indicate the threshold value of outlier with symbol ε, ε is artificial setting value, the value range of ε be (1,2];Work as LOFk(p)>It, will when ε
Point p is labeled as abnormal point, then the corresponding feature vectors of point p are abnormal data, and delete processing is done to abnormal data;Otherwise, it executes
The operation of step 1.5;
By the operation of step 1.4, using the data obtained after rejecting abnormalities data as basic training data;
Step 1.5:If the dimension m of propaedeutics data>Q thens follow the steps 1.6 operation;Otherwise, by grounding data
As training data;Wherein, Q is artificial setting value, Q >=40;
Step 1.6:Dimension-reduction treatment is carried out to grounding data using feature selecting algorithm, the dimension of grounding data is made to drop
It is artificial setting value, T≤40 to T, T;
Step 1.7:Analyze and confirm the validity of feature selecting algorithm;Concrete operations are:
Step 1.7.1:A selected machine learning model for classifying to grounding data;The machine learning mould
Type is two-value disaggregated model, for grounding data to be divided into normal data or attack data;
The machine learning model includes Logic Regression Models, decision tree and perceptron;
Step 1.7.2:The grounding data that step 1.4 is obtained are classified as the input of the machine learning model
As a result, the classification accuracy of grounding data is then calculated, with symbol L1It indicates;
Step 1.7.3:Using the grounding data after the dimensionality reduction that step 1.6 obtains as the input of the machine learning model,
Classification results are obtained, and calculate the classification accuracy of the grounding data after dimensionality reduction, with symbol L2It indicates;
Step 1.7.3:The threshold value of assessment result, δ ∈ (0.9,1) are indicated with symbol δ;IfT values are then tuned up, are then weighed
Answer operation of the step 1.6 to step 1.7;Otherwise, the operation for terminating this step, using the grounding data after dimension-reduction treatment as
Training data;
Step 2: building and training multistage classifier, concrete operation step is:
Step 2.1:The voting principle of random forests algorithm is improved, structure improves random forest grader;Specifically
For:
Step 2.1.1:N decision tree is constructed, n is artificial setting value, n>500;
Step 2.1.2:With symbol theta presentation class threshold value, θ is artificial setting value, θ ∈ (0,1);
Step 2.1.3:The a certain type of network data is indicated with symbol y, y ∈ Y, Y are the type set of network data;
Step 2.1.4:The classification results of a training data are obtained by formula (4);
Wherein, x indicates that a training data, f (x) indicate that training data x belongs to the classification function value of type y;xiIndicate i-th
The input of decision tree, xiFor the sampling of training data x;fi(xi) indicate dividing for i-th decision tree built by random sampling
The value of class function;
If f (x) values are 1, the starting type of training data x is labeled as classification y;
Step 2.1.5:The training data that step 1 is obtained is grasped as the input for improving random forest grader by classification
Make, completes to mark the starting type of training data;Since attack type quantity is N, in addition normal data type, a shared (N
+ 1) type;
Step 2.2:Use the collection ingredient of support vector machines of the alternative manner structure based on adaptive enhancing AdaBoost thoughts
Class device;The weighing computation method of each training data and integrated classifier is specially in iterative process:
Step 2.2.1:For m training data, with symbol w1,w2,w3…wmThe weight of 1 to m-th training data is indicated respectively,
And w is set1,w2,w3…wmInitial value beCurrent iteration number is indicated with symbol t, and the initial of current iteration number is set
Value t=0;
Step 2.2.2:T-th is built using Gaussian function as the base grader of the support vector machines of kernel function, with symbol gtTable
Show, shown in the classification function such as formula (5) of the grader;Using training dataset as base grader gtInput, trained
As a result;
Wherein, g (x) indicates base grader gtClassification function value;Sgn () indicates sign function;A is a training data;For supporting vector;B is bias, and the initial value of b is artificial setting value, and value range is [0,0.5];K(xr,xs) it is height
This kernel function;
Step 2.2.3:Select quadratic loss function as base grader gtLoss function;Calculate base grader gtLoss letter
Several values, setting base grader gtThe threshold value of loss function beFor artificial setting value,
Step 2.2.4:If base grader gtThe value of loss function is less than threshold valueThe multistage classifier is then obtained, it is described more
The classification function such as formula (6) of grade grader;Then, base grader g is calculated by formula (7)tModel Weight, with symbol dt
It indicates;End operation;If base grader gtThe value of loss function is not less than threshold valueThen follow the steps the operation of 2.2.5;
Wherein, g indicates the value of the classification function of multistage classifier;T is total iterations;
Wherein, a indicates a articles training data;wt(a) weight value of a articles training data in the t times iteration is indicated;gt
(a) base grader g is indicatedtTo the training result of training data a;yt(a) legitimate reading of training data a is indicated;
Step 2.2.5:Use the weight w of formula (8) adjusting training data1,w2,w3…wm, and to the weight of each training data into
Row standardization, ensure all training datas weight and be 1;Then make iterations from increasing 1, return to step 2.2.2,
Execute the operation of step 2.2.2;
Wherein, wt+1(a) the weight value in training data a (t+1) secondary iteration is indicated;β is regulation coefficient, as training data a
Training result it is identical with legitimate reading when, β=1, when the training result of training data a and legitimate reading difference, β=- 1;
Step 2.3:Support vector machines of the training based on adaptive enhancing AdaBoost thoughts;
The each type of training data that step 2.1 obtains is separately input to one based on adaptive enhancing AdaBoost thoughts
Support vector machines input;By training, it is a trained based on adaptive enhancing AdaBoost thoughts to obtain (N+1)
Support vector machines;
By the operation of step 2, a trained multistage classifier is obtained;
Step 3: carrying out classification and Detection to test data with trained multistage classifier;
On the basis of step 2 operates, test data is input to multistage classifier, obtains final classification result.
2. a kind of method that classification and Detection being carried out to attack with machine learning techniques as described in claim 1,
It is characterized in that:It is peeled off factor LOF algorithms using the part based on distance described in its step 1.3, calculates and pass through pretreated net
The peel off concrete operations of the factor of the part of network data are:
Step 1.3.1:Every network data representation is a m dimensional feature vector, and described eigenvector is indicated with symbol s, s=
{x1,x2,x3…xm, m is the characteristic item quantity that a network data includes, and m is positive integer;x1,x2,x3…xmM are indicated respectively
Characteristic item;Then described eigenvector is mapped in the feature space of m dimensions, then in each feature vector character pair space
One point;
Step 1.3.2:Any one m dimensional feature vector corresponding point in feature space is indicated with symbol p, is passed through formula (2)
The part for calculating point p peels off the factor;
Wherein, LOFk(p) indicate that the part of the kth neighborhood of point p peels off the factor;The value of k is by artificially specifying, k>10;Nk(p)Indicate point
The kth neighborhood point of the kth neighborhood point set of p, the point p is all the points of the kth of point p within;lrdk(o) point o is indicated
Local reachability density, o ∈ Nk(p);lrdk(p) local reachability density of point p, lrd are indicatedk(p) it is calculated by formula (3)
It arrives;
Wherein, | Nk(p) | indicate the kth distance of point p;Dist (p, o) indicates the distance of point p to point o.
3. a kind of method that classification and Detection being carried out to attack with machine learning techniques as claimed in claim 1 or 2,
It is characterized in that:Feature selecting algorithm described in its step 1.6 includes:Variance selection method, Information Gain Method, mutual information side
Method, Chi-square Test method and the feature selection approach based on tree-model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810202552.9A CN108540451A (en) | 2018-03-13 | 2018-03-13 | A method of classification and Detection being carried out to attack with machine learning techniques |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810202552.9A CN108540451A (en) | 2018-03-13 | 2018-03-13 | A method of classification and Detection being carried out to attack with machine learning techniques |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108540451A true CN108540451A (en) | 2018-09-14 |
Family
ID=63484291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810202552.9A Pending CN108540451A (en) | 2018-03-13 | 2018-03-13 | A method of classification and Detection being carried out to attack with machine learning techniques |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108540451A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109379379A (en) * | 2018-12-06 | 2019-02-22 | 中国民航大学 | Based on the network inbreak detection method for improving convolutional neural networks |
CN109450880A (en) * | 2018-10-26 | 2019-03-08 | 平安科技(深圳)有限公司 | Detection method for phishing site, device and computer equipment based on decision tree |
CN109450876A (en) * | 2018-10-23 | 2019-03-08 | 中国科学院信息工程研究所 | A kind of DDos recognition methods and system based on various dimensions state-transition matrix feature |
CN109660522A (en) * | 2018-11-29 | 2019-04-19 | 华东师范大学 | The mixed intrusion detection method based on deep layer self-encoding encoder towards Integrated Electronic System |
CN109741175A (en) * | 2018-12-28 | 2019-05-10 | 上海点融信息科技有限责任公司 | Based on artificial intelligence to the appraisal procedure of credit again and equipment for purchasing automobile-used family by stages |
CN110008976A (en) * | 2018-12-05 | 2019-07-12 | 阿里巴巴集团控股有限公司 | A kind of network behavior classification method and device |
CN110133146A (en) * | 2019-05-28 | 2019-08-16 | 国网上海市电力公司 | A kind of Diagnosis Method of Transformer Faults and system considering unbalanced data sample |
CN110478911A (en) * | 2019-08-13 | 2019-11-22 | 苏州钛智智能科技有限公司 | The unmanned method of intelligent game vehicle and intelligent vehicle, equipment based on machine learning |
CN110581840A (en) * | 2019-07-24 | 2019-12-17 | 中国科学院信息工程研究所 | Intrusion detection method based on double-layer heterogeneous integrated learner |
CN110719279A (en) * | 2019-10-09 | 2020-01-21 | 东北大学 | Network anomaly detection system and method based on neural network |
CN111107077A (en) * | 2019-12-16 | 2020-05-05 | 中国电子科技网络信息安全有限公司 | SVM-based attack flow classification method |
CN112269907A (en) * | 2020-11-02 | 2021-01-26 | 山东万里红信息技术有限公司 | Processing method of health big data of Internet of things |
CN112333706A (en) * | 2019-07-16 | 2021-02-05 | ***通信集团浙江有限公司 | Internet of things equipment anomaly detection method and device, computing equipment and storage medium |
CN112367303A (en) * | 2020-10-21 | 2021-02-12 | 中国电子科技集团公司第二十八研究所 | Distributed self-learning abnormal flow cooperative detection method and system |
CN112398779A (en) * | 2019-08-12 | 2021-02-23 | 中国科学院国家空间科学中心 | Network traffic data analysis method and system |
CN112559591A (en) * | 2020-12-08 | 2021-03-26 | 晋中学院 | Outlier detection system and detection method for cold roll manufacturing process |
CN112583844A (en) * | 2020-12-24 | 2021-03-30 | 北京航空航天大学 | Big data platform defense method for advanced sustainable threat attack |
CN113114618A (en) * | 2021-03-02 | 2021-07-13 | 西安电子科技大学 | Internet of things equipment intrusion detection method based on traffic classification recognition |
CN113239025A (en) * | 2021-04-23 | 2021-08-10 | 四川大学 | Ship track classification method based on feature selection and hyper-parameter optimization |
CN113347181A (en) * | 2021-06-01 | 2021-09-03 | 上海明略人工智能(集团)有限公司 | Abnormal advertisement flow detection method, system, computer equipment and storage medium |
CN114039837A (en) * | 2021-11-05 | 2022-02-11 | 奇安信科技集团股份有限公司 | Alarm data processing method, device, system, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150193693A1 (en) * | 2014-01-06 | 2015-07-09 | Cisco Technology, Inc. | Learning model selection in a distributed network |
CN104794192A (en) * | 2015-04-17 | 2015-07-22 | 南京大学 | Multi-level anomaly detection method based on exponential smoothing and integrated learning model |
CN107276999A (en) * | 2017-06-08 | 2017-10-20 | 西安电子科技大学 | A kind of event detecting method in wireless sensor network |
-
2018
- 2018-03-13 CN CN201810202552.9A patent/CN108540451A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150193693A1 (en) * | 2014-01-06 | 2015-07-09 | Cisco Technology, Inc. | Learning model selection in a distributed network |
CN104794192A (en) * | 2015-04-17 | 2015-07-22 | 南京大学 | Multi-level anomaly detection method based on exponential smoothing and integrated learning model |
CN107276999A (en) * | 2017-06-08 | 2017-10-20 | 西安电子科技大学 | A kind of event detecting method in wireless sensor network |
Non-Patent Citations (1)
Title |
---|
贾斌: "基于组合分类器的DDos攻击流量分布式检测模型", 《华中科技大学学报(自然科学版)》 * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109450876B (en) * | 2018-10-23 | 2020-12-22 | 中国科学院信息工程研究所 | DDos identification method and system based on multi-dimensional state transition matrix characteristics |
CN109450876A (en) * | 2018-10-23 | 2019-03-08 | 中国科学院信息工程研究所 | A kind of DDos recognition methods and system based on various dimensions state-transition matrix feature |
CN109450880A (en) * | 2018-10-26 | 2019-03-08 | 平安科技(深圳)有限公司 | Detection method for phishing site, device and computer equipment based on decision tree |
CN109660522A (en) * | 2018-11-29 | 2019-04-19 | 华东师范大学 | The mixed intrusion detection method based on deep layer self-encoding encoder towards Integrated Electronic System |
CN109660522B (en) * | 2018-11-29 | 2021-05-25 | 华东师范大学 | Deep self-encoder-based hybrid intrusion detection method for integrated electronic system |
CN110008976A (en) * | 2018-12-05 | 2019-07-12 | 阿里巴巴集团控股有限公司 | A kind of network behavior classification method and device |
CN109379379B (en) * | 2018-12-06 | 2021-03-02 | 中国民航大学 | Network intrusion detection method based on improved convolutional neural network |
CN109379379A (en) * | 2018-12-06 | 2019-02-22 | 中国民航大学 | Based on the network inbreak detection method for improving convolutional neural networks |
CN109741175A (en) * | 2018-12-28 | 2019-05-10 | 上海点融信息科技有限责任公司 | Based on artificial intelligence to the appraisal procedure of credit again and equipment for purchasing automobile-used family by stages |
CN110133146A (en) * | 2019-05-28 | 2019-08-16 | 国网上海市电力公司 | A kind of Diagnosis Method of Transformer Faults and system considering unbalanced data sample |
CN112333706B (en) * | 2019-07-16 | 2022-08-23 | ***通信集团浙江有限公司 | Internet of things equipment anomaly detection method and device, computing equipment and storage medium |
CN112333706A (en) * | 2019-07-16 | 2021-02-05 | ***通信集团浙江有限公司 | Internet of things equipment anomaly detection method and device, computing equipment and storage medium |
CN110581840B (en) * | 2019-07-24 | 2020-10-16 | 中国科学院信息工程研究所 | Intrusion detection method based on double-layer heterogeneous integrated learner |
CN110581840A (en) * | 2019-07-24 | 2019-12-17 | 中国科学院信息工程研究所 | Intrusion detection method based on double-layer heterogeneous integrated learner |
CN112398779B (en) * | 2019-08-12 | 2022-11-01 | 中国科学院国家空间科学中心 | Network traffic data analysis method and system |
CN112398779A (en) * | 2019-08-12 | 2021-02-23 | 中国科学院国家空间科学中心 | Network traffic data analysis method and system |
CN110478911A (en) * | 2019-08-13 | 2019-11-22 | 苏州钛智智能科技有限公司 | The unmanned method of intelligent game vehicle and intelligent vehicle, equipment based on machine learning |
CN110719279A (en) * | 2019-10-09 | 2020-01-21 | 东北大学 | Network anomaly detection system and method based on neural network |
CN111107077B (en) * | 2019-12-16 | 2021-12-21 | 中国电子科技网络信息安全有限公司 | SVM-based attack flow classification method |
CN111107077A (en) * | 2019-12-16 | 2020-05-05 | 中国电子科技网络信息安全有限公司 | SVM-based attack flow classification method |
CN112367303A (en) * | 2020-10-21 | 2021-02-12 | 中国电子科技集团公司第二十八研究所 | Distributed self-learning abnormal flow cooperative detection method and system |
CN112269907A (en) * | 2020-11-02 | 2021-01-26 | 山东万里红信息技术有限公司 | Processing method of health big data of Internet of things |
CN112559591A (en) * | 2020-12-08 | 2021-03-26 | 晋中学院 | Outlier detection system and detection method for cold roll manufacturing process |
CN112559591B (en) * | 2020-12-08 | 2023-06-13 | 晋中学院 | Outlier detection system and detection method for cold roll manufacturing process |
CN112583844A (en) * | 2020-12-24 | 2021-03-30 | 北京航空航天大学 | Big data platform defense method for advanced sustainable threat attack |
CN112583844B (en) * | 2020-12-24 | 2021-09-03 | 北京航空航天大学 | Big data platform defense method for advanced sustainable threat attack |
CN113114618A (en) * | 2021-03-02 | 2021-07-13 | 西安电子科技大学 | Internet of things equipment intrusion detection method based on traffic classification recognition |
CN113239025A (en) * | 2021-04-23 | 2021-08-10 | 四川大学 | Ship track classification method based on feature selection and hyper-parameter optimization |
CN113347181A (en) * | 2021-06-01 | 2021-09-03 | 上海明略人工智能(集团)有限公司 | Abnormal advertisement flow detection method, system, computer equipment and storage medium |
CN114039837A (en) * | 2021-11-05 | 2022-02-11 | 奇安信科技集团股份有限公司 | Alarm data processing method, device, system, equipment and storage medium |
CN114039837B (en) * | 2021-11-05 | 2023-10-31 | 奇安信科技集团股份有限公司 | Alarm data processing method, device, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108540451A (en) | A method of classification and Detection being carried out to attack with machine learning techniques | |
Khan et al. | An improved convolutional neural network model for intrusion detection in networks | |
CN107395590B (en) | A kind of intrusion detection method classified based on PCA and random forest | |
Özgür et al. | A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015 | |
Ibrahimi et al. | Management of intrusion detection systems based-KDD99: Analysis with LDA and PCA | |
CN106973038B (en) | Network intrusion detection method based on genetic algorithm oversampling support vector machine | |
CN107862347A (en) | A kind of discovery method of the electricity stealing based on random forest | |
Raj et al. | Applications of pattern recognition algorithms in agriculture: a review | |
US20080086493A1 (en) | Apparatus and method for organization, segmentation, characterization, and discrimination of complex data sets from multi-heterogeneous sources | |
CN102291392A (en) | Hybrid intrusion detection method based on bagging algorithm | |
CN114124482B (en) | Access flow anomaly detection method and equipment based on LOF and isolated forest | |
CN105354198A (en) | Data processing method and apparatus | |
Chandolikar et al. | Efficient algorithm for intrusion attack classification by analyzing KDD Cup 99 | |
Chen et al. | Pattern recognition using clustering algorithm for scenario definition in traffic simulation-based decision support systems | |
CN107483451A (en) | Based on serial parallel structural network secure data processing method and system, social networks | |
CN113542241A (en) | Intrusion detection method and device based on CNN-BiGRU mixed model | |
Mir et al. | An experimental evaluation of bayesian classifiers applied to intrusion detection | |
CN116865994A (en) | Network data security prediction method based on big data | |
Gong et al. | Intrusion detection system combining misuse detection and anomaly detection using genetic network programming | |
Machoke et al. | Performance Comparison of Ensemble Learning and Supervised Algorithms in Classifying Multi-label Network Traffic Flow | |
Thanh et al. | An approach to reduce data dimension in building effective network intrusion detection systems | |
CN115392351A (en) | Risk user identification method and device, electronic equipment and storage medium | |
Liu et al. | Network intrusion detection based on chaotic multi-verse optimizer | |
CN113837481A (en) | Financial big data management system based on block chain | |
Chareka et al. | A study of fitness functions for data classification using grammatical evolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180914 |
|
WD01 | Invention patent application deemed withdrawn after publication |